Hi,
I am fine with this. But on the other hand: Why do you want to replicate
the files into Solr's config folder? A Solr configuration should better
be able to load the stopwords file from resources, too. I was always
wondering why we have that tons of files in the default configset, some
of them also being strange outdated examples.
Not sure what the best way to do this is. I think at moment the
factories don't load the defaults automatically, but they are able to
load from JAR file (depending of which ResourceLoader you use at Solr).
Uwe
Am 28.06.2024 um 02:16 schrieb Chris Hostetter:
Over in Solr, there's an open jira regarding some "drift" that has
happened over time between some of the lang specific stopword files
that Solr shipts in it's default configset and the equivilent files
that are provided in the lucene jars (and loaded by the corrisponding
lucene Analyzers via getResourceAsStream()).
That got me thinking about adding some tooling to Solr's build to
update these files automatically when we upgrade Lucene, which got me
looking at what that would invovle, which lead me to realize there
seems to be some inconsistencies in what static default CharArraySets
are/aren't "public" in Lucene Analyzer classes.
For example:
- Most (all?) Analyzer classes that have a default list of stopwords
seem to include a "public static CharArraySet getDefaultStopSet()"
...but...
- Of the Analyzers that use ElisionFilter, only FrenchAnalyzer has a
"public static final CharArraySet DEFAULT_ARTICLES" -- IrishAnalyzer,
ItalianAnalyzer, & CatalanAnalyzer keep it private
-IrishAnalyzer also has a "private static final CharArraySet
HYPHENATIONS" that's documented as being important to use as stopwrds
when using StandardTokenizer
- DutchAnalyzer has a (private) 'static final CharArrayMap<String>
DEFAULT_STEM_DICT' that it uses with StemmerOverrideFilter
...for my purposes, this is inconvinient, but not insurmountable, but
practically speaking, the bigger concern I have for asking if folks
think these kinds of static "defaults" should always be "public" is
because it seems like any (lucene) user who starts out using something
like "new IrishAnalyzer()" and then decides later that they want to
write their own custom Analyzer to tweak beahvior would probably
prefer to do this...
protected TokenStreamComponents createComponents(String fieldName) {
final Tokenizer source = new StandardTokenizer();
TokenStream result = new StopFilter(source,
IrishAnalyzer.getDefaultHyphenations());
result = new ElisionFilter(result,
IrishAnalyzer.getDefaultElisonArticles());
// special
result = doMyFancyCustomStuff(result)
result = new IrishLowerCaseFilter(result);
result = new StopFilter(result,
IrishAnalyzer.getDefaultStopSet());
result = new SnowballFilter(result, new IrishStemmer());
return new TokenStreamComponents(source, result);
}
...but instead they have to read the source code for IrishAnalyzer and
copy/past the list of hyphenations & articles.
Do we want to change/standardize this?
-Hoss
http://www.lucidworks.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org