BufferingAnalyzer (or something like that)

Grant Ingersoll Wed, 07 Nov 2007 17:09:25 -0800

From time to time, I have run across analysis problems where I wantto only analyze a particular field once, but I also want to "pluck"certain tokens (one or more) out of the stream and then use them asthe basis for another field. For example, say I have a token filterthat can identify proper names and I also want a field that containsall the tokens. Currently, the way to do this is to analyze yourcontent for the whole field and then reanalyze the field for theproper names. Essentially do what Solr's copyField does. Another usecase, potentially, is when there are two fields, one that islowercased and one that isn't. In this case, you could do all theanalysis, then have the last filter set aside the tokens before theyare lower-cased (or vice versa) and then when it comes to indexing thelower-cased field, Lucene just needs to spit back out the token buffer.

This has always struck me as wasteful especially given a complexanalysis stream. What I am thinking of doing is injecting aTokenFilter that can buffer these tokens and then that TokenFilter canbe shared by the Analyzer when it is time to analyze the other field.Obviously, there are memory issues that need to be managed/documented,but I think they could be controlled by the application. For example,there likely isn't a lot of proper nouns in a given document such thatit would be a huge memory footprint. Unless the "filtering"TokenFilter is also expanding and adding other tokens, I would guessmost use cases in the worst case would use up as much memory as theoriginal field analysis. At any rate, some filter implementationscould be designed to control memory and discard when full or somethinglike that.

The CachingTokenFilter kind of does this, but it doesn't allow formodifications and always gives you those same tokens back. It alsoseems like the new Field.tokenStreamValue() and TokenStream basedconstructor might help, but you have the whole construction problem.I suppose you could "pre-analyze" the content and then make bothFields based on that pre-analysis,

I currently have two different approaches to this. The first is aCachedAnalyzer and CachedTokenizer implementation that takes in a Listof tokens. The other is an abstract Analyzer that coordinates thehandoff of the buffer created by the first TokenStream and gives it tothe second. The first requires that you do the looping on theTokenStream in the application outside of Lucene, the latter letsLucene what it normally does.

Anyone have any thoughts on this? Is this useful (i.e. should I addit in)?


-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

BufferingAnalyzer (or something like that)

Reply via email to