Re: Should all 'static final' CharArray(Set|Map)s in stock Analyzers be "public" ?

Uwe Schindler Sun, 30 Jun 2024 03:09:40 -0700

Hi again,

There's also one other problem with those sets: Unfortunately they aremodifiable, because they are not real "Set<String>" but CharArraySets.There is no 100% unmodifiable view of them. This was the main reason whywe did not make them public for newer variants of analyzers. I think weshould add unmodifable "views" of CharArraySet, but this is also not100% possible, as the underlying char[] cannot be protected.


Uwe

Am 30.06.2024 um 12:02 schrieb Uwe Schindler:

Hi,
I am fine with this. But on the other hand: Why do you want toreplicate the files into Solr's config folder? A Solr configurationshould better be able to load the stopwords file from resources, too.I was always wondering why we have that tons of files in the defaultconfigset, some of them also being strange outdated examples.
Not sure what the best way to do this is. I think at moment thefactories don't load the defaults automatically, but they are able toload from JAR file (depending of which ResourceLoader you use at Solr).
Uwe

Am 28.06.2024 um 02:16 schrieb Chris Hostetter:
Over in Solr, there's an open jira regarding some "drift" that hashappened over time between some of the lang specific stopword filesthat Solr shipts in it's default configset and the equivilent filesthat are provided in the lucene jars (and loaded by the corrispondinglucene Analyzers via getResourceAsStream()).
That got me thinking about adding some tooling to Solr's build toupdate these files automatically when we upgrade Lucene, which got melooking at what that would invovle, which lead me to realize thereseems to be some inconsistencies in what static default CharArraySetsare/aren't "public" in Lucene Analyzer classes.
For example:
- Most (all?) Analyzer classes that have a default list of stopwordsseem to include a "public static CharArraySet getDefaultStopSet()"
    ...but...
- Of the Analyzers that use ElisionFilter, only FrenchAnalyzer has a"public static final CharArraySet DEFAULT_ARTICLES" -- IrishAnalyzer,ItalianAnalyzer, & CatalanAnalyzer keep it private
-IrishAnalyzer also has a "private static final CharArraySetHYPHENATIONS" that's documented as being important to use as stopwrdswhen using StandardTokenizer
- DutchAnalyzer has a (private) 'static final CharArrayMap<String>DEFAULT_STEM_DICT' that it uses with StemmerOverrideFilter
...for my purposes, this is inconvinient, but not insurmountable, butpractically speaking, the bigger concern I have for asking if folksthink these kinds of static "defaults" should always be "public" isbecause it seems like any (lucene) user who starts out usingsomething like "new IrishAnalyzer()" and then decides later that theywant to write their own custom Analyzer to tweak beahvior wouldprobably prefer to do this...
  protected TokenStreamComponents createComponents(String fieldName) {
    final Tokenizer source = new StandardTokenizer();
    TokenStream result = new StopFilter(source,
        IrishAnalyzer.getDefaultHyphenations());
    result = new ElisionFilter(result,
        IrishAnalyzer.getDefaultElisonArticles());

    // special
    result = doMyFancyCustomStuff(result)

    result = new IrishLowerCaseFilter(result);
    result = new StopFilter(result,
        IrishAnalyzer.getDefaultStopSet());
    result = new SnowballFilter(result, new IrishStemmer());
    return new TokenStreamComponents(source, result);
  }
...but instead they have to read the source code for IrishAnalyzerand copy/past the list of hyphenations & articles.
Do we want to change/standardize this?



-Hoss
http://www.lucidworks.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Should all 'static final' CharArray(Set|Map)s in stock Analyzers be "public" ?

Reply via email to