[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors

Robert Muir (JIRA) Sat, 02 Jan 2010 05:10:19 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795860#action_12795860
 ]


Robert Muir commented on LUCENE-2034:
-------------------------------------

I am back on a real computer and (as mentioned december 18th) I would like to 
commit this soon.

Simon, I only have one question: do you think it would be possible in the 
future to add an additional feature (under another issue) whereas:
* analyzers extending the StopwordAnalyzerBase can have multiple stoplists 
depending upon Version
* the StopwordAnalyzerBase.getStopwordSet requires a Version argument to match 
this behavior.

My reasoning is that we would then be able to improve stopword lists without 
breaking backwards compatibility.
I am aware many people feel stopword lists are not that important but for quite 
a few non-english languages they are very important, no matter how advanced the 
scoring mechanism is (see persian for a great example of this). 
I also think in the future perhaps we would consider merging in the commongrams 
functionality that is currently duplicated in nutch and solr so that these 
stoplists can be ab(used) with that method as well, so I think this kind of 
thing might become more important in the future.

I realize this is a new feature so it shouldnt be under this issue, but if it 
means this design isn't viable let me know that. otherwise i would like to 
commit this one first to make progress. i broke the backwards compat fixing the 
arabic stopwords before and I would like to not do this sort of thing again.


> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-2034
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2034
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9
>            Reporter: Simon Willnauer
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, 
> LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, 
> LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, 
> LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses 
> need to implement at least one of the methodes returning a tokenStream. When 
> you look at the code it appears to be almost identical if both are 
> implemented in the same analyzer.  Each analyzer defnes the same inner class 
> (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his 
> own way of loading them or defines a large number of ctors to load stopwords 
> from a file, set, arrays etc.. those ctors should be removed / deprecated and 
> eventually removed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors

Reply via email to