Should all 'static final' CharArray(Set|Map)s in stock Analyzers be "public" ?

Chris Hostetter Thu, 27 Jun 2024 17:17:50 -0700

Over in Solr, there's an open jira regarding some "drift" that hashappened over time between some of the lang specific stopword files thatSolr shipts in it's default configset and the equivilent files that areprovided in the lucene jars (and loaded by the corrisponding luceneAnalyzers via getResourceAsStream()).

That got me thinking about adding some tooling to Solr's build to updatethese files automatically when we upgrade Lucene, which got me looking atwhat that would invovle, which lead me to realize there seems to be someinconsistencies in what static default CharArraySets are/aren't "public"in Lucene Analyzer classes.


For example:

- Most (all?) Analyzer classes that have a default listof stopwords seem to include a "public static CharArraySetgetDefaultStopSet()"


        ...but...

- Of the Analyzers that use ElisionFilter, only FrenchAnalyzer has a"public static final CharArraySet DEFAULT_ARTICLES" -- IrishAnalyzer,ItalianAnalyzer, & CatalanAnalyzer keep it private

-IrishAnalyzer also has a "private static final CharArraySet HYPHENATIONS"that's documented as being important to use as stopwrds when usingStandardTokenizer

- DutchAnalyzer has a (private) 'static final CharArrayMap<String>DEFAULT_STEM_DICT' that it uses with StemmerOverrideFilter

...for my purposes, this is inconvinient, but not insurmountable, butpractically speaking, the bigger concern I have for asking if folks thinkthese kinds of static "defaults" should always be "public" is because itseems like any (lucene) user who starts out using something like "newIrishAnalyzer()" and then decides later that they want to write their owncustom Analyzer to tweak beahvior would probably prefer to do this...



  protected TokenStreamComponents createComponents(String fieldName) {
    final Tokenizer source = new StandardTokenizer();
    TokenStream result = new StopFilter(source,
        IrishAnalyzer.getDefaultHyphenations());
    result = new ElisionFilter(result,
        IrishAnalyzer.getDefaultElisonArticles());

    // special
    result = doMyFancyCustomStuff(result)

    result = new IrishLowerCaseFilter(result);
    result = new StopFilter(result,
        IrishAnalyzer.getDefaultStopSet());
    result = new SnowballFilter(result, new IrishStemmer());
    return new TokenStreamComponents(source, result);
  }

...but instead they have to read the source code for IrishAnalyzer andcopy/past the list of hyphenations & articles.



Do we want to change/standardize this?



-Hoss
http://www.lucidworks.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Should all 'static final' CharArray(Set|Map)s in stock Analyzers be "public" ?

Reply via email to