From time to time, I have run across analysis problems where I want to only analyze a particular field once, but I also want to "pluck" certain tokens (one or more) out of the stream and then use them as the basis for another field. For example, say I have a token filter that can identify proper names and I also want a field that contains all the tokens. Currently, the way to do this is to analyze your content for the whole field and then reanalyze the field for the proper names. Essentially do what Solr's copyField does. Another use case, potentially, is when there are two fields, one that is lowercased and one that isn't. In this case, you could do all the analysis, then have the last filter set aside the tokens before they are lower-cased (or vice versa) and then when it comes to indexing the lower-cased field, Lucene just needs to spit back out the token buffer.

This has always struck me as wasteful especially given a complex analysis stream. What I am thinking of doing is injecting a TokenFilter that can buffer these tokens and then that TokenFilter can be shared by the Analyzer when it is time to analyze the other field. Obviously, there are memory issues that need to be managed/documented, but I think they could be controlled by the application. For example, there likely isn't a lot of proper nouns in a given document such that it would be a huge memory footprint. Unless the "filtering" TokenFilter is also expanding and adding other tokens, I would guess most use cases in the worst case would use up as much memory as the original field analysis. At any rate, some filter implementations could be designed to control memory and discard when full or something like that.

The CachingTokenFilter kind of does this, but it doesn't allow for modifications and always gives you those same tokens back. It also seems like the new Field.tokenStreamValue() and TokenStream based constructor might help, but you have the whole construction problem. I suppose you could "pre-analyze" the content and then make both Fields based on that pre-analysis,

I currently have two different approaches to this. The first is a CachedAnalyzer and CachedTokenizer implementation that takes in a List of tokens. The other is an abstract Analyzer that coordinates the handoff of the buffer created by the first TokenStream and gives it to the second. The first requires that you do the looping on the TokenStream in the application outside of Lucene, the latter lets Lucene what it normally does.

Anyone have any thoughts on this? Is this useful (i.e. should I add it in)?

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to