+1 to make it more consistent (with preference for a public method).

Dawid

On Fri, Jun 28, 2024 at 2:16 AM Chris Hostetter <hossman_luc...@fucit.org>
wrote:

>
> Over in Solr, there's an open jira regarding some "drift" that has
> happened over time between some of the lang specific stopword files that
> Solr shipts in it's default configset and the equivilent files that are
> provided in the lucene jars (and loaded by the corrisponding lucene
> Analyzers via getResourceAsStream()).
>
> That got me thinking about adding some tooling to Solr's build to update
> these files automatically when we upgrade Lucene, which got me looking at
> what that would invovle, which lead me to realize there seems to be some
> inconsistencies in what static default CharArraySets are/aren't "public"
> in Lucene Analyzer classes.
>
> For example:
>
> - Most (all?) Analyzer classes that have a default list
> of stopwords seem to include a "public static CharArraySet
> getDefaultStopSet()"
>
>         ...but...
>
> - Of the Analyzers that use ElisionFilter, only FrenchAnalyzer has a
> "public static final CharArraySet DEFAULT_ARTICLES" -- IrishAnalyzer,
> ItalianAnalyzer, & CatalanAnalyzer keep it private
>
> -IrishAnalyzer also has a "private static final CharArraySet HYPHENATIONS"
> that's documented as being important to use as stopwrds when using
> StandardTokenizer
>
> - DutchAnalyzer has a (private) 'static final CharArrayMap<String>
> DEFAULT_STEM_DICT' that it uses with StemmerOverrideFilter
>
>
> ...for my purposes, this is inconvinient, but not insurmountable, but
> practically speaking, the bigger concern I have for asking if folks think
> these kinds of static "defaults" should always be "public" is because it
> seems like any (lucene) user who starts out using something like "new
> IrishAnalyzer()" and then decides later that they want to write their own
> custom Analyzer to tweak beahvior would probably prefer to do this...
>
>
>    protected TokenStreamComponents createComponents(String fieldName) {
>      final Tokenizer source = new StandardTokenizer();
>      TokenStream result = new StopFilter(source,
>          IrishAnalyzer.getDefaultHyphenations());
>      result = new ElisionFilter(result,
>          IrishAnalyzer.getDefaultElisonArticles());
>
>      // special
>      result = doMyFancyCustomStuff(result)
>
>      result = new IrishLowerCaseFilter(result);
>      result = new StopFilter(result,
>          IrishAnalyzer.getDefaultStopSet());
>      result = new SnowballFilter(result, new IrishStemmer());
>      return new TokenStreamComponents(source, result);
>    }
>
> ...but instead they have to read the source code for IrishAnalyzer and
> copy/past the list of hyphenations & articles.
>
>
> Do we want to change/standardize this?
>
>
>
> -Hoss
> http://www.lucidworks.com/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Reply via email to