[jira] [Commented] (LUCENE-2309) Fully decouple IndexWriter from analyzers

Chris Male (JIRA) Fri, 22 Jul 2011 04:04:06 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069502#comment-13069502
 ]


Chris Male commented on LUCENE-2309:
------------------------------------

I've thought about this issue some more and I feel there's a middle ground to 
be had.

{quote}
First, IW should not have to "know" how to get a TokenStream from a
IndexableField; it should only ask the Field for the token stream and get that
back and iterate its tokens.
{quote}

You're absolutely right and this should be our first step.  It should be up to 
the Field to produce its terms, IW should just iterate through them.

{quote}
Likewise, for multi-valued fields, IW shouldn't "see" the separate
values; it should just receive a single token stream, and under the
hood (in Document/Field impl) it's concatenating separate token
streams, adding posIncr/offset gaps, etc. This too is now hardwired
in indexer but shouldn't be. Maybe an app wants to insert custom
"separator" tokens between the values...
{quote}

I also totally agree.  We should strive to reduce as much hardwiring at make it 
as flexible as possible.  But again I see this as a step in the process.

{quote}
Second, this new idea to "invert" TokenStream into an AttrConsumer,
which I think is separate? I'm actually not sure I like such an
approach... it seems more confusing for simple usage? Ie, if I want
to analyze some text and iterate over the tokens... suddenly, instead
of a few lines of local code, I have to make a class instance with a
method that receives each token? It seems more convoluted? I
mean, for Lucene's limited internal usage of token stream, this is
fine, but for others who consume token streams... it seems more
cumbersome.
{quote}

I don't agree that this is separate.  For me the purpose of this issue is to 
fully decouple IndexWriter from analyzers :) As such the how IW consumes the 
terms it indexes is at the heart of the issue.  The inversion approach is a 
suggestion for how we might tackle this in a flexible and extensible way.  So I 
don't see any reason to push it to another issue.  Its a way of fulfilling this 
issue.

I think there is also some confusion here.  I'm not suggesting we change all 
usage of analysis.  If someone wants to consume TokenStream as is, so be it.  
What I'm looking at changing here is how IW gets the terms it indexes, thats 
all.  We've introduced abstractions like IndexableField to be flexible and 
extensible.  I don't think there's anything wrong with examining the same thing 
with TokenStream here.

I think Robert has stated here that he's comfortable continuing to use 
TokenStream as the API for IW to get the terms it indexes, is that what others 
feel too? I agree the inverted API I proposed is a little convoluted and I'm 
sure we can come up with a simple Consumable like abstraction (which Robert did 
also suggest above).  But if people are content with TokenStream then theres no 
need.

> Fully decouple IndexWriter from analyzers
> -----------------------------------------
>
>                 Key: LUCENE-2309
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2309
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael McCandless
>              Labels: gsoc2011, lucene-gsoc-11, mentor
>             Fix For: 4.0
>
>         Attachments: LUCENE-2309-analyzer-based.patch, LUCENE-2309.patch
>
>
> IndexWriter only needs an AttributeSource to do indexing.
> Yet, today, it interacts with Field instances, holds a private
> analyzers, invokes analyzer.reusableTokenStream, has to deal with a
> wide variety (it's not analyzed; it is analyzed but it's a Reader,
> String; it's pre-analyzed).
> I'd like to have IW only interact with attr sources that already
> arrived with the fields.  This would be a powerful decoupling -- it
> means others are free to make their own attr sources.
> They need not even use any of Lucene's analysis impls; eg they can
> integrate to other things like [OpenPipeline|http://www.openpipeline.org].
> Or make something completely custom.
> LUCENE-2302 is already a big step towards this: it makes IW agnostic
> about which attr is "the term", and only requires that it provide a
> BytesRef (for flex).
> Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
> FieldType knows the analyzer to use, then we could simply create a
> getAttrSource() method (say) on it and move all the logic IW has today
> onto there.  (We'd still need existing IW code for back-compat).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-2309) Fully decouple IndexWriter from analyzers

Reply via email to