[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens

Grant Ingersoll (JIRA) Tue, 27 Nov 2007 12:33:17 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12545995
 ]


Grant Ingersoll commented on LUCENE-1058:
-----------------------------------------

{quote}
Maybe I'm missing something?
{quote}

No, I don't think you are missing anything in that use case, it's just an 
example of its use.  And I am not totally sold on this approach, but mostly am 
:-) 

I had originally considered your option, but didn't feel it was satisfactory 
for the case where you are extracting things like proper nouns or maybe it is 
generating a category value.  The more general case is where not all the tokens 
are needed (in fact, very few are).  In those cases, you have to go back 
through the whole list of cached tokens in order to extract the ones you want.  
In fact, thinking some more of on it, I am not sure my patch goes far enough in 
the sense that what if you want it to buffer in mid stream.  

For example, if you had:
StandardTokenizer
Proper Noun TF
LowerCaseTF
StopTF

and Proper Noun TF is solely responsible for setting aside proper nouns as it 
comes across them in the stream.

As for the convoluted cross-field logic, I don't think it is all that 
convoluted.  There are only two fields and the implementing Analyzer takes care 
of all of it.  Only real requirement the application has is that the fields be 
ordered correctly.  

I do agree somewhat about the pre-analysis approach, except for the case where 
there may be a large number of tokens in the source field, in which case, you 
are holding them around in memory (maxFieldLength mitigates to some extent.)  
Also, it puts the onus on the app. writer to do it, when it could be pretty 
straight forward for Lucene to do it w/o it's usual analysis pipeline.

At any rate, separate of the CollaboratingAnalyzer, I do think the 
CachedTokenFilter is useful, especially in supporting the pre-analysis approach.



> New Analyzer for buffering tokens
> ---------------------------------
>
>                 Key: LUCENE-1058
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1058
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, 
> LUCENE-1058.patch, LUCENE-1058.patch
>
>
> In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that 
> could siphon off certain tokens and store them in a buffer to be used later 
> in the processing pipeline.
> For example, if you want to have two fields, one lowercased and one not, but 
> all the other analysis is the same, then you could save off the tokens to be 
> output for a different field.
> Patch to follow, but I am still not sure about a couple of things, mostly how 
> it plays with the new reuse API.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens

Reply via email to