[ https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12545995 ]
Grant Ingersoll commented on LUCENE-1058: ----------------------------------------- {quote} Maybe I'm missing something? {quote} No, I don't think you are missing anything in that use case, it's just an example of its use. And I am not totally sold on this approach, but mostly am :-) I had originally considered your option, but didn't feel it was satisfactory for the case where you are extracting things like proper nouns or maybe it is generating a category value. The more general case is where not all the tokens are needed (in fact, very few are). In those cases, you have to go back through the whole list of cached tokens in order to extract the ones you want. In fact, thinking some more of on it, I am not sure my patch goes far enough in the sense that what if you want it to buffer in mid stream. For example, if you had: StandardTokenizer Proper Noun TF LowerCaseTF StopTF and Proper Noun TF is solely responsible for setting aside proper nouns as it comes across them in the stream. As for the convoluted cross-field logic, I don't think it is all that convoluted. There are only two fields and the implementing Analyzer takes care of all of it. Only real requirement the application has is that the fields be ordered correctly. I do agree somewhat about the pre-analysis approach, except for the case where there may be a large number of tokens in the source field, in which case, you are holding them around in memory (maxFieldLength mitigates to some extent.) Also, it puts the onus on the app. writer to do it, when it could be pretty straight forward for Lucene to do it w/o it's usual analysis pipeline. At any rate, separate of the CollaboratingAnalyzer, I do think the CachedTokenFilter is useful, especially in supporting the pre-analysis approach. > New Analyzer for buffering tokens > --------------------------------- > > Key: LUCENE-1058 > URL: https://issues.apache.org/jira/browse/LUCENE-1058 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, > LUCENE-1058.patch, LUCENE-1058.patch > > > In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that > could siphon off certain tokens and store them in a buffer to be used later > in the processing pipeline. > For example, if you want to have two fields, one lowercased and one not, but > all the other analysis is the same, then you could save off the tokens to be > output for a different field. > Patch to follow, but I am still not sure about a couple of things, mostly how > it plays with the new reuse API. > See > http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]