https://issues.apache.org/jira/browse/LUCENE-1058
On Nov 8, 2007, at 7:14 AM, Mark Miller wrote:
I think it is certainly useful as I use something similar myself. My
implementation is not as generic as I would like (requires a
specific special analyzer written for the task), but works great for
my case. I use a CachingTokenFilter as well as a couple ThreadLocals
so that I can have a stemmed and non stemmed index without having to
analyze twice. It saves me plenty in my benchmarks. A generic
solution would be awesome.
- Mark
Grant Ingersoll wrote:
From time to time, I have run across analysis problems where I want
to only analyze a particular field once, but I also want to "pluck"
certain tokens (one or more) out of the stream and then use them as
the basis for another field. For example, say I have a token
filter that can identify proper names and I also want a field that
contains all the tokens. Currently, the way to do this is to
analyze your content for the whole field and then reanalyze the
field for the proper names. Essentially do what Solr's copyField
does. Another use case, potentially, is when there are two fields,
one that is lowercased and one that isn't. In this case, you could
do all the analysis, then have the last filter set aside the tokens
before they are lower-cased (or vice versa) and then when it comes
to indexing the lower-cased field, Lucene just needs to spit back
out the token buffer.
This has always struck me as wasteful especially given a complex
analysis stream. What I am thinking of doing is injecting a
TokenFilter that can buffer these tokens and then that TokenFilter
can be shared by the Analyzer when it is time to analyze the other
field. Obviously, there are memory issues that need to be managed/
documented, but I think they could be controlled by the
application. For example, there likely isn't a lot of proper nouns
in a given document such that it would be a huge memory footprint.
Unless the "filtering" TokenFilter is also expanding and adding
other tokens, I would guess most use cases in the worst case would
use up as much memory as the original field analysis. At any rate,
some filter implementations could be designed to control memory and
discard when full or something like that.
The CachingTokenFilter kind of does this, but it doesn't allow for
modifications and always gives you those same tokens back. It also
seems like the new Field.tokenStreamValue() and TokenStream based
constructor might help, but you have the whole construction
problem. I suppose you could "pre-analyze" the content and then
make both Fields based on that pre-analysis,
I currently have two different approaches to this. The first is a
CachedAnalyzer and CachedTokenizer implementation that takes in a
List of tokens. The other is an abstract Analyzer that coordinates
the handoff of the buffer created by the first TokenStream and
gives it to the second. The first requires that you do the looping
on the TokenStream in the application outside of Lucene, the latter
lets Lucene what it normally does.
Anyone have any thoughts on this? Is this useful (i.e. should I
add it in)?
-Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]