I think it is certainly useful as I use something similar myself. My
implementation is not as generic as I would like (requires a specific
special analyzer written for the task), but works great for my case. I
use a CachingTokenFilter as well as a couple ThreadLocals so that I can
have a stemmed and non stemmed index without having to analyze twice. It
saves me plenty in my benchmarks. A generic solution would be awesome.
- Mark
Grant Ingersoll wrote:
From time to time, I have run across analysis problems where I want to
only analyze a particular field once, but I also want to "pluck"
certain tokens (one or more) out of the stream and then use them as
the basis for another field. For example, say I have a token filter
that can identify proper names and I also want a field that contains
all the tokens. Currently, the way to do this is to analyze your
content for the whole field and then reanalyze the field for the
proper names. Essentially do what Solr's copyField does. Another use
case, potentially, is when there are two fields, one that is
lowercased and one that isn't. In this case, you could do all the
analysis, then have the last filter set aside the tokens before they
are lower-cased (or vice versa) and then when it comes to indexing the
lower-cased field, Lucene just needs to spit back out the token buffer.
This has always struck me as wasteful especially given a complex
analysis stream. What I am thinking of doing is injecting a
TokenFilter that can buffer these tokens and then that TokenFilter can
be shared by the Analyzer when it is time to analyze the other field.
Obviously, there are memory issues that need to be managed/documented,
but I think they could be controlled by the application. For example,
there likely isn't a lot of proper nouns in a given document such that
it would be a huge memory footprint. Unless the "filtering"
TokenFilter is also expanding and adding other tokens, I would guess
most use cases in the worst case would use up as much memory as the
original field analysis. At any rate, some filter implementations
could be designed to control memory and discard when full or something
like that.
The CachingTokenFilter kind of does this, but it doesn't allow for
modifications and always gives you those same tokens back. It also
seems like the new Field.tokenStreamValue() and TokenStream based
constructor might help, but you have the whole construction problem.
I suppose you could "pre-analyze" the content and then make both
Fields based on that pre-analysis,
I currently have two different approaches to this. The first is a
CachedAnalyzer and CachedTokenizer implementation that takes in a List
of tokens. The other is an abstract Analyzer that coordinates the
handoff of the buffer created by the first TokenStream and gives it to
the second. The first requires that you do the looping on the
TokenStream in the application outside of Lucene, the latter lets
Lucene what it normally does.
Anyone have any thoughts on this? Is this useful (i.e. should I add
it in)?
-Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]