[ 
https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784303#action_12784303
 ] 

DM Smith commented on LUCENE-2034:
----------------------------------

Patch looks good. I like how this simplifies the classes.

Some comments based on my use case, which allows a user creating an index to 
decide whether to use Lucene's default stop words or no stop words at all. No 
stop words is the default. (I'm also allowing stemming to be optional, but on 
by default.) These two require me to duplicate the each contrib Analyzers but 
reuse the parts. (If you're interested, each Lucene index is a whole book, 
where each paragraph is a document. Every word is potentially meaningful so 
stop words are not used by default.)

Regarding stop words:
* Some of the analyzers allow for null to be specified for the stop word list. 
Others require an empty set/file/reader. Those deriving from StopawareAnalyzer 
allow null. I'd like to see the ability to use null to follow through the rest 
of the analyzers.
*Some of the analyzers are cluttered with stopword list processing. Maybe 
WordListLoader could be extended to handle the other ways that 
contrib/analyzers store their lists? Specifically, how about moving 
StopawareAnalyzer.loadStopwordSet(...)? It seems to be a better place.
* How about splitting out the stop words to their own class? (I'm digging the 
word lists out of the analyzers and the lack of uniformity is a pain. Having 
them standalone would be useful.)
* If not how about adding public static Set<?> getDefaultStopSet() to 
StopawareAnalyzer?
* Shouldn't StopawareAnalyzer be in core? and used in StopAnalyzer? Could it be 
merged into StopAnalyzer? Other than the loadStopwordSet, it really only adds a 
method to get the current stopword list.

Regarding 3.1:
There are some TODOs in the code to make this or that private or final. If this 
is going to wait for 3.1 shouldn't they change?

On a separate note:
In WordListLoader the return types are not Set or Map, but HashSet and HashMap. 
What's up with that? Should anyone care what the particular implementation is?


> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-2034
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2034
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9
>            Reporter: Simon Willnauer
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, 
> LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, 
> LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses 
> need to implement at least one of the methodes returning a tokenStream. When 
> you look at the code it appears to be almost identical if both are 
> implemented in the same analyzer.  Each analyzer defnes the same inner class 
> (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his 
> own way of loading them or defines a large number of ctors to load stopwords 
> from a file, set, arrays etc.. those ctors should be removed / deprecated and 
> eventually removed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to