[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors

DM Smith (JIRA) Tue, 01 Dec 2009 10:36:46 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784327#action_12784327
 ]


DM Smith commented on LUCENE-2034:
----------------------------------

Robert, I'd like them to be in files as well. But when it really gets down to 
it, a uniform interface to the the default stop word list is what really 
matters to me.

Like your use case, I don't see the provided analyzers as much more than a 
suggestion and default implementation. Currently and in this patch, I have to 
use them to get to the stop words.

I'm trying to figure out a way to specify a tokenizer/filter chain. (I've been 
trying to figure it out for a while, but not with much effort or success). 
Something like:
{code}
TokenStream construct(Version v, String fieldName, Reader r, StreamSpec ...) {
  source = first StreamSpec.create(v, fieldName, r);
  result = source;
  for the remaining StreamSpec {
     result = streamSpec.create(v, fieldName, result);
  }
  return result;
}
{code}

The purpose of the StreamSpec is to allow a late binding of tokenizers/filters 
into a chain.

The other part would be to generate a Manifest with version info for Lucene, 
Java and each component that could be stored in (or with) the index. That way 
one could compare the manifest to see if the index needs to be rebuilt. This 
manifest could also be used to reconstruct the TokenStream.


> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-2034
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2034
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9
>            Reporter: Simon Willnauer
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, 
> LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, 
> LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses 
> need to implement at least one of the methodes returning a tokenStream. When 
> you look at the code it appears to be almost identical if both are 
> implemented in the same analyzer.  Each analyzer defnes the same inner class 
> (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his 
> own way of loading them or defines a large number of ctors to load stopwords 
> from a file, set, arrays etc.. those ctors should be removed / deprecated and 
> eventually removed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors

Reply via email to