If I'm understanding correctly...
What about a SinkTokenizer that is backed by a Reader/Field instead of
the current one that stores it all in a List? This is more or less
the use case for the Tee/Sink implementations, w/ the exception that
we didn't plan for the Sink being too large, but that is easily
overcome, IMO.
That is, you use a TeeTokenFilter that adds to your Sink, which
serializes to some storage, and then your SinkTokenizer just
unserializes. No need to change Fieldable at all or anything else
Or maybe just a Tokenizer that is backed by a Field would work and
uses a TermEnum on the Field to serve up next() for the TokenStream.
Just thinking out loud...
-Grant
On Aug 27, 2008, at 10:47 AM, Andrzej Bialecki wrote:
Hi all,
I recently had a situation where I had to pass some metadata
information to Analyzer. This metadata was specific to a Document
instance (short story is that the analysis of some fields depended
on data coming from other fields, and the number of possible values
was too big to use separate fields for each combination).
It would be nice to have an Analyzer.tokenStream(String fieldName,
Field f), or even better tokenStream(String fieldName, Document
doc) ... but probably it's too intrusive to change this. Although I
would be happy to have tokenStream(String, Fieldable), because then
I could provide my own Fieldable with metadata.
In the meantime, having neither option, I came up with an idea: I
will use a subclass of Reader, and attach my metadata there, and
then use this Reader when creating a Field. However, I quickly
discovered that if you set a Reader on a Field, this field
automatically becomes un-stored - not what I wanted ... Field is
declared final, so no luck there.
In the end I implemented a Fieldable, which sort of breaks the
contract for Fieldable - but it works :) . Namely, my Fieldable
returns both readerValue() and stringValue(). The first method
returns my subclass of Reader with metadata, and the second returns
the value to be stored.
The reason why it works is that DocInverterPerField first checks the
tokenStreamValue, then the readerValue, and only then the
stringValue that it converts to a Reader - so in my case it uses the
supplied readerValue. At the same time, FieldsWriter, which is
responsible for storing field values, uses just the stringValue (or
binaryValue, but that wasn't relevant to my case), which is also set
to non-null.
So, here are my thoughts on this, and I'd appreciate any comments on
this:
* is this a justified use of the API? it works, at least at the
moment ;) and I couldn't find any other way to accomplish this task.
* could we perhaps relax the restriction on Fieldable so that it can
return non-null values from more than one method, and clearly
document in what sequence they are processed? This is already hinted
at in the javadoc.
* I propose to add a new API to Analyzer:
public TokenStream tokenStream(String fieldName, Fieldable field);
to support use cases like the one I described above. The default
implementation could be something like this:
public TokenStream tokenStream(String fieldName, Fieldable field) {
Reader r = field.readerValue();
if (r == null) {
String s = field.stringValue();
r = new StringReader(s);
}
return tokenStream(fieldName, r);
}
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]