[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795860#action_12795860 ]
Robert Muir commented on LUCENE-2034: ------------------------------------- I am back on a real computer and (as mentioned december 18th) I would like to commit this soon. Simon, I only have one question: do you think it would be possible in the future to add an additional feature (under another issue) whereas: * analyzers extending the StopwordAnalyzerBase can have multiple stoplists depending upon Version * the StopwordAnalyzerBase.getStopwordSet requires a Version argument to match this behavior. My reasoning is that we would then be able to improve stopword lists without breaking backwards compatibility. I am aware many people feel stopword lists are not that important but for quite a few non-english languages they are very important, no matter how advanced the scoring mechanism is (see persian for a great example of this). I also think in the future perhaps we would consider merging in the commongrams functionality that is currently duplicated in nutch and solr so that these stoplists can be ab(used) with that method as well, so I think this kind of thing might become more important in the future. I realize this is a new feature so it shouldnt be under this issue, but if it means this design isn't viable let me know that. otherwise i would like to commit this one first to make progress. i broke the backwards compat fixing the arabic stopwords before and I would like to not do this sort of thing again. > Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors > ------------------------------------------------------------------------- > > Key: LUCENE-2034 > URL: https://issues.apache.org/jira/browse/LUCENE-2034 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.9 > Reporter: Simon Willnauer > Assignee: Robert Muir > Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, > LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, > LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, > LUCENE-2034.txt > > > Due to the variouse tokenStream APIs we had in lucene analyzer subclasses > need to implement at least one of the methodes returning a tokenStream. When > you look at the code it appears to be almost identical if both are > implemented in the same analyzer. Each analyzer defnes the same inner class > (SavedStreams) which is unnecessary. > In contrib almost every analyzer uses stopwords and each of them creates his > own way of loading them or defines a large number of ctors to load stopwords > from a file, set, arrays etc.. those ctors should be removed / deprecated and > eventually removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org