[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors

Simon Willnauer (JIRA) Tue, 01 Dec 2009 13:13:46 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784432#action_12784432
 ]


Simon Willnauer commented on LUCENE-2034:
-----------------------------------------

bq. Some of the analyzers allow for null to be specified for the stop word 
list. Others require an empty set/file/reader. Those deriving from 
StopawareAnalyzer allow null.
That is true - Stopawareanalyzer uses an empty set if you pass null. 

bq. I'd like to see the ability to use null to follow through the rest of the 
analyzers.
*Some of the analyzers are cluttered with stopword list processing.
The analyzers in this patch are rather a PoC than a complete list. Eventually 
we will have all analyzers with stopwords to extend StopawareAnalyzer that is 
also the reason why we have this class. This and some other issues aim to 
eventually have a consistent way of processing all this stuff related to 
stopwords. We will also remove all the setters and have Set<?> only ctors for 
consistency.

bq. If not how about adding public static Set<?> getDefaultStopSet() to 
StopawareAnalyzer?
the problem is that it is static and it should be static. Thats why we define 
it in each analyzer that uses stopwords. I would like to have it generalized 
but this seems to be the ideal solution. We could have something like a 
getDefaultStopSet(Class<? extends StopawareAnalyzer>) but I like the 
expressiveness of getDefaultStopSet() way better though.

bq. How about splitting out the stop words to their own class? 
What do you mean by that?  can you elaborate?

bq. There are some TODOs in the code to make this or that private or final. If 
this is going to wait for 3.1 shouldn't they change?
The should actually go away but I kept them in there because they are somewhat 
unrelated to this particular issue. Once this is in we will work on removing 
the deprecated stuff and make analyzers final (at least in contrib).

bq. In WordListLoader the return types are not Set or Map, but HashSet and 
HashMap. What's up with that? Should anyone care what the particular 
implementation is?
that is one thing I hate about WordListLoader. +1 towards Uwe working on them!

bq. I'm trying to figure out a way to specify a tokenizer/filter chain. (I've 
been trying to figure it out for a while, but not with much effort or success).
This has been discussed already and we haven't had much of a success though. I 
can not remember the issue (robert can you remember the factory issue?) but it 
was basically based on a factory pattern. This would also be my approach to it. 
That way we could get rid of almost every analyzer. I use such a pattern myself 
which works quite well.

bq. DM, I think we can have both? A method to get the default stopword list, 
but then they also happen to be in text files too?
+1 for having those words in files. Nevertheless we will have a default 
stopword list though.

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-2034
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2034
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9
>            Reporter: Simon Willnauer
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, 
> LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, 
> LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses 
> need to implement at least one of the methodes returning a tokenStream. When 
> you look at the code it appears to be almost identical if both are 
> implemented in the same analyzer.  Each analyzer defnes the same inner class 
> (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his 
> own way of loading them or defines a large number of ctors to load stopwords 
> from a file, set, arrays etc.. those ctors should be removed / deprecated and 
> eventually removed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors

Reply via email to