Re: BufferingAnalyzer (or something like that)

Grant Ingersoll Mon, 19 Nov 2007 05:53:43 -0800

https://issues.apache.org/jira/browse/LUCENE-1058



On Nov 8, 2007, at 7:14 AM, Mark Miller wrote:

I think it is certainly useful as I use something similar myself. Myimplementation is not as generic as I would like (requires aspecific special analyzer written for the task), but works great formy case. I use a CachingTokenFilter as well as a couple ThreadLocalsso that I can have a stemmed and non stemmed index without having toanalyze twice. It saves me plenty in my benchmarks. A genericsolution would be awesome.
- Mark

Grant Ingersoll wrote:
From time to time, I have run across analysis problems where I wantto only analyze a particular field once, but I also want to "pluck"certain tokens (one or more) out of the stream and then use them asthe basis for another field. For example, say I have a tokenfilter that can identify proper names and I also want a field thatcontains all the tokens. Currently, the way to do this is toanalyze your content for the whole field and then reanalyze thefield for the proper names. Essentially do what Solr's copyFielddoes. Another use case, potentially, is when there are two fields,one that is lowercased and one that isn't. In this case, you coulddo all the analysis, then have the last filter set aside the tokensbefore they are lower-cased (or vice versa) and then when it comesto indexing the lower-cased field, Lucene just needs to spit backout the token buffer.
This has always struck me as wasteful especially given a complexanalysis stream. What I am thinking of doing is injecting aTokenFilter that can buffer these tokens and then that TokenFiltercan be shared by the Analyzer when it is time to analyze the otherfield. Obviously, there are memory issues that need to be managed/documented, but I think they could be controlled by theapplication. For example, there likely isn't a lot of proper nounsin a given document such that it would be a huge memory footprint.Unless the "filtering" TokenFilter is also expanding and addingother tokens, I would guess most use cases in the worst case woulduse up as much memory as the original field analysis. At any rate,some filter implementations could be designed to control memory anddiscard when full or something like that.
The CachingTokenFilter kind of does this, but it doesn't allow formodifications and always gives you those same tokens back. It alsoseems like the new Field.tokenStreamValue() and TokenStream basedconstructor might help, but you have the whole constructionproblem. I suppose you could "pre-analyze" the content and thenmake both Fields based on that pre-analysis,
I currently have two different approaches to this. The first is aCachedAnalyzer and CachedTokenizer implementation that takes in aList of tokens. The other is an abstract Analyzer that coordinatesthe handoff of the buffer created by the first TokenStream andgives it to the second. The first requires that you do the loopingon the TokenStream in the application outside of Lucene, the latterlets Lucene what it normally does.
Anyone have any thoughts on this? Is this useful (i.e. should Iadd it in)?
-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: BufferingAnalyzer (or something like that)

Reply via email to