[jira] [Commented] (LUCENE-2309) Fully decouple IndexWriter from analyzers

Robert Muir (JIRA) Sun, 17 Jul 2011 08:35:26 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066672#comment-13066672
 ]


Robert Muir commented on LUCENE-2309:
-------------------------------------

{quote}
What I'm exploring with this direction currently is how best to consume the 
terms of a Field while minimizing the exposure of Analyzer / TokenStream in the 
indexing process.
{quote}

I thought about this a lot, I'm not sure we need to minimize TokenStream, I 
think it might be fine! There is really nothing much more minimal than this, 
its just attributesource + incrementToken() + reset() + end() + close()...

This issue is a little out of date, I actually think we have been decoupling 
indexwriter from analysis even more... for example the indexwriter just pulls 
bytes from the tokenstream, etc.

I think we have been going in the right direction and should just be factoring 
out the things that don't belong in the indexer or in analyzer (like this gap 
manipulation)

{quote}
What has felt nature to me is having Analyzer at the Field level. This is 
already kind of implemented anyway - if a Field returns a tokenStreamValue() 
then thats used for indexing, no matter what the 'analysis configuration' is.
{quote}

Yeah but thats currently an expert option, not the normal case. In general if 
we are saying IndexWriter doesn't take Analyzer but has some schema that is a 
list of Fields, then QueryParser etc need to take this list of fields also, so 
that queryparsing is consistent with analysis. But i'm not sure I like this 
either: I think it only couples QueryParser with IndexWriter.

{quote}
Do you have any suggestions for other directions to follow?
{quote}

I don't think we should try to do a huge mega-change right now given that there 
are various "problems" we should fix... some of this I know is my pet peeves 
but:
* fixing the Analyzers to only be reusable is important, its a performance 
trap. we should do this regardless... and we can even backport this one easily 
to 3.x (backporting improvements to reusableAnalyzerBase, deprecating 
tokenStream(), etc)
* removing the positionIncrement/offsetGaps from Analyzer makes total sense to 
me, this is "decoupling indexwriter from analyzer" because these gaps make no 
sense to other analyzer consumers. So I think these gaps are in the wrong 
place, and should instead be in Field or whatever, which creates 
ConcatenatedTokenStream behind the scenes to provide to IndexWriter. This is 
also good too, maybe you need to do other things than munge offsets/positions 
across these gaps and this concatenation would no longer be hardcoded in 
indexwriter.

I've looked at doing e.g. the first of these and I know its a huge mondo pain 
in the ass, these "smaller" changes are really hard and a ton of work.


> Fully decouple IndexWriter from analyzers
> -----------------------------------------
>
>                 Key: LUCENE-2309
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2309
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael McCandless
>              Labels: gsoc2011, lucene-gsoc-11, mentor
>             Fix For: 4.0
>
>         Attachments: LUCENE-2309.patch
>
>
> IndexWriter only needs an AttributeSource to do indexing.
> Yet, today, it interacts with Field instances, holds a private
> analyzers, invokes analyzer.reusableTokenStream, has to deal with a
> wide variety (it's not analyzed; it is analyzed but it's a Reader,
> String; it's pre-analyzed).
> I'd like to have IW only interact with attr sources that already
> arrived with the fields.  This would be a powerful decoupling -- it
> means others are free to make their own attr sources.
> They need not even use any of Lucene's analysis impls; eg they can
> integrate to other things like [OpenPipeline|http://www.openpipeline.org].
> Or make something completely custom.
> LUCENE-2302 is already a big step towards this: it makes IW agnostic
> about which attr is "the term", and only requires that it provide a
> BytesRef (for flex).
> Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
> FieldType knows the analyzer to use, then we could simply create a
> getAttrSource() method (say) on it and move all the logic IW has today
> onto there.  (We'd still need existing IW code for back-compat).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-2309) Fully decouple IndexWriter from analyzers

Reply via email to