[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens

Grant Ingersoll (JIRA) Tue, 27 Nov 2007 13:05:14 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546002
 ]


Grant Ingersoll commented on LUCENE-1058:
-----------------------------------------

{quote}
What if they wanted 3 fields instead of two?
{quote}
True.  I'll have to think about a more generic approach.  In some sense, I 
think 2 is often sufficient, but you are right it isn't totally generic in the 
spirit of Lucene.  

To some extent, I was thinking that this could help optimize Solr's copyField 
mechanism.  In Solr's case, I think you often have copy fields that have 
marginal differences in the filters that are applied.  It would be useful for 
Solr to be able to optimize these so that it doesn't have to go through the 
whole analysis chain again.

{quote}
Isn't this what your current code does?
{quote}
No, in my main use case (# of buffered tokens is << # of source tokens) the 
only tokens kept around is the (much) smaller subset of buffered tokens.  In 
the pre-analysis approach you have to keep the source field tokens and the 
buffered tokens.  Not to mention that you are increasing the work by having to 
iterate over the cached tokens in the list in Lucene.  Thus, you have the cost 
of the analysis in your application plus the storage of both token lists (one 
large, one small, likely) then in Lucene you have the cost of iterating over 
two lists.  In my approach, I think, you have the cost of analysis plus the 
cost of storage of one list of tokens (small) and the cost of iterating that 
list.

> New Analyzer for buffering tokens
> ---------------------------------
>
>                 Key: LUCENE-1058
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1058
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, 
> LUCENE-1058.patch, LUCENE-1058.patch
>
>
> In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that 
> could siphon off certain tokens and store them in a buffer to be used later 
> in the processing pipeline.
> For example, if you want to have two fields, one lowercased and one not, but 
> all the other analysis is the same, then you could save off the tokens to be 
> output for a different field.
> Patch to follow, but I am still not sure about a couple of things, mostly how 
> it plays with the new reuse API.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens

Reply via email to