[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors

Simon Willnauer (JIRA) Fri, 04 Dec 2009 05:38:46 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785923#action_12785923
 ]


Simon Willnauer commented on LUCENE-2034:
-----------------------------------------

bq. Im not sure about this. I think this is a partially true statement. I know 
I could look it up to be sure. I thought that the JLS required all static 
initializers to be run at first access to the class. So if one does not want 
the list of default stopwords, but wants something else in the class or is 
supplying an alternate set of stopwords, the default stopwords are initialized 
anyway.

DM, What you say its true but the holder is a static inner class and its static 
initializers run on the first access. That is right when it needs to be as it 
is only accessed once you the default stopwords. It does not require any 
synchronization as this is guaranteed by the JVM. What I like about it is that 
you can't introduce any synch. problems - simple and declarative.

bq. So the other benefit is that it is fully lazy. Though this is a small 
benefit.
see above

bq. It could be made into a singleton (which would have been better in the 
first place), or static or both. I just tossed together one example, though 
extensive, to answer. Also, the matchVersion is not needed in the derived 
classes.
It already is a singleton. the holder makes it a lazy loaded static final 
singleton. MatchVersion will only be needed in derived classes if the 
tokenStreamComponents 


I personally don't like the various different ways you can load stopwords 
either, my approach is a different one. Stopwords are mainly used in analyzers 
/ filters, we have a standard way to load them in StopawareAnalyzer if you 
implement your analyzer. If you use the analyzer you should use WordlistLoader. 
If we fix WordlistLoader to return Set<?> we are good to go with a single way 
for the user and a standard way for makeing a stopaware analyzer. If you wrap 
this up in a Class StopWords then people do not know what to do with it once 
they wanna load a Stem-Exclusion Table.
Maybe I miss one important thing but I do not see the benefit of wrapping a 
Set<?> into another class. - If so please explain. :)

Thanks

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-2034
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2034
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9
>            Reporter: Simon Willnauer
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, 
> LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, 
> LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses 
> need to implement at least one of the methodes returning a tokenStream. When 
> you look at the code it appears to be almost identical if both are 
> implemented in the same analyzer.  Each analyzer defnes the same inner class 
> (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his 
> own way of loading them or defines a large number of ctors to load stopwords 
> from a file, set, arrays etc.. those ctors should be removed / deprecated and 
> eventually removed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors

Reply via email to