[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors

DM Smith (JIRA) Wed, 02 Dec 2009 05:22:48 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784812#action_12784812
 ]


DM Smith commented on LUCENE-2034:
----------------------------------

bq. But I do not see the benefit compared to the current solution.
In an earlier post we discussed that it'd be possible, like SOLR, to eliminate 
analyzers for a factory pattern. The benefit of this variation (you are right, 
it is equivalent) is that it moves in that direction.

.bq  To access a default stopword set you have to create an instance of a 
specific analyzer which is IMO not a very natural way.
It could be made into a singleton (which would have been better in the first 
place), or static or both. I just tossed together one example, though 
extensive, to answer. Also, the matchVersion is not needed in the derived 
classes. So here is an alternate:
{code}
public class ArabicStopWords extends StopWords {
  private static final StopWords instance = new ArabicStopWords();
  private ArabicStopWords() {
    super(Version.LUCENE_30, null, null, false);
  }
  public static Set<?> getDefaultStopWords() {
    return instance.getDefaultStopWords();
  }
}
{code}

bq. I personally prefer the holder pattern as it is guaranteed to be lazy by 
the JVM.
I'm not sure about this. I think this is a partially true statement. I know I 
could look it up to be sure. I thought that the JLS required *all* static 
initializers to be run at first access to the class. So if one does not want 
the list of default stopwords, but wants something else in the class or is 
supplying an alternate set of stopwords, the default stopwords are initialized 
anyway.

So the other benefit is that it is fully lazy. Though this is a small benefit.

On another note, still regarding code placement:
StopFilter has a bunch of makeStopSet methods. WordListLoader has a few more. 
StopawareAnalyzer has another. My example has yet another. I think this creates 
confusion for end users and casual contributors as it is not clear how to 
proceed without looking at the code for examples. I'd like to see some kind of 
clarity/consolidation.





> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-2034
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2034
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9
>            Reporter: Simon Willnauer
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, 
> LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, 
> LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses 
> need to implement at least one of the methodes returning a tokenStream. When 
> you look at the code it appears to be almost identical if both are 
> implemented in the same analyzer.  Each analyzer defnes the same inner class 
> (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his 
> own way of loading them or defines a large number of ctors to load stopwords 
> from a file, set, arrays etc.. those ctors should be removed / deprecated and 
> eventually removed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors

Reply via email to