[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784303#action_12784303 ]
DM Smith commented on LUCENE-2034: ---------------------------------- Patch looks good. I like how this simplifies the classes. Some comments based on my use case, which allows a user creating an index to decide whether to use Lucene's default stop words or no stop words at all. No stop words is the default. (I'm also allowing stemming to be optional, but on by default.) These two require me to duplicate the each contrib Analyzers but reuse the parts. (If you're interested, each Lucene index is a whole book, where each paragraph is a document. Every word is potentially meaningful so stop words are not used by default.) Regarding stop words: * Some of the analyzers allow for null to be specified for the stop word list. Others require an empty set/file/reader. Those deriving from StopawareAnalyzer allow null. I'd like to see the ability to use null to follow through the rest of the analyzers. *Some of the analyzers are cluttered with stopword list processing. Maybe WordListLoader could be extended to handle the other ways that contrib/analyzers store their lists? Specifically, how about moving StopawareAnalyzer.loadStopwordSet(...)? It seems to be a better place. * How about splitting out the stop words to their own class? (I'm digging the word lists out of the analyzers and the lack of uniformity is a pain. Having them standalone would be useful.) * If not how about adding public static Set<?> getDefaultStopSet() to StopawareAnalyzer? * Shouldn't StopawareAnalyzer be in core? and used in StopAnalyzer? Could it be merged into StopAnalyzer? Other than the loadStopwordSet, it really only adds a method to get the current stopword list. Regarding 3.1: There are some TODOs in the code to make this or that private or final. If this is going to wait for 3.1 shouldn't they change? On a separate note: In WordListLoader the return types are not Set or Map, but HashSet and HashMap. What's up with that? Should anyone care what the particular implementation is? > Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors > ------------------------------------------------------------------------- > > Key: LUCENE-2034 > URL: https://issues.apache.org/jira/browse/LUCENE-2034 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.9 > Reporter: Simon Willnauer > Assignee: Robert Muir > Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, > LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, > LUCENE-2034.txt > > > Due to the variouse tokenStream APIs we had in lucene analyzer subclasses > need to implement at least one of the methodes returning a tokenStream. When > you look at the code it appears to be almost identical if both are > implemented in the same analyzer. Each analyzer defnes the same inner class > (SavedStreams) which is unnecessary. > In contrib almost every analyzer uses stopwords and each of them creates his > own way of loading them or defines a large number of ctors to load stopwords > from a file, set, arrays etc.. those ctors should be removed / deprecated and > eventually removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org