+1 to make it more consistent (with preference for a public method). Dawid
On Fri, Jun 28, 2024 at 2:16 AM Chris Hostetter <hossman_luc...@fucit.org> wrote: > > Over in Solr, there's an open jira regarding some "drift" that has > happened over time between some of the lang specific stopword files that > Solr shipts in it's default configset and the equivilent files that are > provided in the lucene jars (and loaded by the corrisponding lucene > Analyzers via getResourceAsStream()). > > That got me thinking about adding some tooling to Solr's build to update > these files automatically when we upgrade Lucene, which got me looking at > what that would invovle, which lead me to realize there seems to be some > inconsistencies in what static default CharArraySets are/aren't "public" > in Lucene Analyzer classes. > > For example: > > - Most (all?) Analyzer classes that have a default list > of stopwords seem to include a "public static CharArraySet > getDefaultStopSet()" > > ...but... > > - Of the Analyzers that use ElisionFilter, only FrenchAnalyzer has a > "public static final CharArraySet DEFAULT_ARTICLES" -- IrishAnalyzer, > ItalianAnalyzer, & CatalanAnalyzer keep it private > > -IrishAnalyzer also has a "private static final CharArraySet HYPHENATIONS" > that's documented as being important to use as stopwrds when using > StandardTokenizer > > - DutchAnalyzer has a (private) 'static final CharArrayMap<String> > DEFAULT_STEM_DICT' that it uses with StemmerOverrideFilter > > > ...for my purposes, this is inconvinient, but not insurmountable, but > practically speaking, the bigger concern I have for asking if folks think > these kinds of static "defaults" should always be "public" is because it > seems like any (lucene) user who starts out using something like "new > IrishAnalyzer()" and then decides later that they want to write their own > custom Analyzer to tweak beahvior would probably prefer to do this... > > > protected TokenStreamComponents createComponents(String fieldName) { > final Tokenizer source = new StandardTokenizer(); > TokenStream result = new StopFilter(source, > IrishAnalyzer.getDefaultHyphenations()); > result = new ElisionFilter(result, > IrishAnalyzer.getDefaultElisonArticles()); > > // special > result = doMyFancyCustomStuff(result) > > result = new IrishLowerCaseFilter(result); > result = new StopFilter(result, > IrishAnalyzer.getDefaultStopSet()); > result = new SnowballFilter(result, new IrishStemmer()); > return new TokenStreamComponents(source, result); > } > > ...but instead they have to read the source code for IrishAnalyzer and > copy/past the list of hyphenations & articles. > > > Do we want to change/standardize this? > > > > -Hoss > http://www.lucidworks.com/ > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >