Analyzer and Fieldable, different stored and indexed values

Andrzej Bialecki Wed, 27 Aug 2008 07:48:46 -0700

Hi all,

I recently had a situation where I had to pass some metadata informationto Analyzer. This metadata was specific to a Document instance (shortstory is that the analysis of some fields depended on data coming fromother fields, and the number of possible values was too big to useseparate fields for each combination).

It would be nice to have an Analyzer.tokenStream(String fieldName, Fieldf), or even better tokenStream(String fieldName, Document doc) ... butprobably it's too intrusive to change this. Although I would be happy tohave tokenStream(String, Fieldable), because then I could provide my ownFieldable with metadata.

In the meantime, having neither option, I came up with an idea: I willuse a subclass of Reader, and attach my metadata there, and then usethis Reader when creating a Field. However, I quickly discovered that ifyou set a Reader on a Field, this field automatically becomes un-stored- not what I wanted ... Field is declared final, so no luck there.

In the end I implemented a Fieldable, which sort of breaks the contractfor Fieldable - but it works :) . Namely, my Fieldable returns bothreaderValue() and stringValue(). The first method returns my subclass ofReader with metadata, and the second returns the value to be stored.

The reason why it works is that DocInverterPerField first checks thetokenStreamValue, then the readerValue, and only then the stringValuethat it converts to a Reader - so in my case it uses the suppliedreaderValue. At the same time, FieldsWriter, which is responsible forstoring field values, uses just the stringValue (or binaryValue, butthat wasn't relevant to my case), which is also set to non-null.


So, here are my thoughts on this, and I'd appreciate any comments on this:

* is this a justified use of the API? it works, at least at the moment;) and I couldn't find any other way to accomplish this task.

* could we perhaps relax the restriction on Fieldable so that it canreturn non-null values from more than one method, and clearly documentin what sequence they are processed? This is already hinted at in thejavadoc.


* I propose to add a new API to Analyzer:

  public TokenStream tokenStream(String fieldName, Fieldable field);

to support use cases like the one I described above. The defaultimplementation could be something like this:


  public TokenStream tokenStream(String fieldName, Fieldable field) {
        Reader r = field.readerValue();
        if (r == null) {
                String s = field.stringValue();
                r = new StringReader(s);
        }
        return tokenStream(fieldName, r);
  }


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Analyzer and Fieldable, different stored and indexed values

Reply via email to