Re: Analyzer and Fieldable, different stored and indexed values

Grant Ingersoll Wed, 27 Aug 2008 08:15:39 -0700

If I'm understanding correctly...

What about a SinkTokenizer that is backed by a Reader/Field instead ofthe current one that stores it all in a List? This is more or lessthe use case for the Tee/Sink implementations, w/ the exception thatwe didn't plan for the Sink being too large, but that is easilyovercome, IMO.

That is, you use a TeeTokenFilter that adds to your Sink, whichserializes to some storage, and then your SinkTokenizer justunserializes. No need to change Fieldable at all or anything else

Or maybe just a Tokenizer that is backed by a Field would work anduses a TermEnum on the Field to serve up next() for the TokenStream.


Just thinking out loud...

-Grant

On Aug 27, 2008, at 10:47 AM, Andrzej Bialecki wrote:

Hi all,
I recently had a situation where I had to pass some metadatainformation to Analyzer. This metadata was specific to a Documentinstance (short story is that the analysis of some fields dependedon data coming from other fields, and the number of possible valueswas too big to use separate fields for each combination).
It would be nice to have an Analyzer.tokenStream(String fieldName,Field f), or even better tokenStream(String fieldName, Documentdoc) ... but probably it's too intrusive to change this. Although Iwould be happy to have tokenStream(String, Fieldable), because thenI could provide my own Fieldable with metadata.
In the meantime, having neither option, I came up with an idea: Iwill use a subclass of Reader, and attach my metadata there, andthen use this Reader when creating a Field. However, I quicklydiscovered that if you set a Reader on a Field, this fieldautomatically becomes un-stored - not what I wanted ... Field isdeclared final, so no luck there.
In the end I implemented a Fieldable, which sort of breaks thecontract for Fieldable - but it works :) . Namely, my Fieldablereturns both readerValue() and stringValue(). The first methodreturns my subclass of Reader with metadata, and the second returnsthe value to be stored.
The reason why it works is that DocInverterPerField first checks thetokenStreamValue, then the readerValue, and only then thestringValue that it converts to a Reader - so in my case it uses thesupplied readerValue. At the same time, FieldsWriter, which isresponsible for storing field values, uses just the stringValue (orbinaryValue, but that wasn't relevant to my case), which is also setto non-null.
So, here are my thoughts on this, and I'd appreciate any comments onthis:
* is this a justified use of the API? it works, at least at themoment ;) and I couldn't find any other way to accomplish this task.
* could we perhaps relax the restriction on Fieldable so that it canreturn non-null values from more than one method, and clearlydocument in what sequence they are processed? This is already hintedat in the javadoc.
* I propose to add a new API to Analyzer:

 public TokenStream tokenStream(String fieldName, Fieldable field);
to support use cases like the one I described above. The defaultimplementation could be something like this:
 public TokenStream tokenStream(String fieldName, Fieldable field) {
        Reader r = field.readerValue();
        if (r == null) {
                String s = field.stringValue();
                r = new StringReader(s);
        }
        return tokenStream(fieldName, r);
 }


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]










---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Analyzer and Fieldable, different stored and indexed values

Reply via email to