Hello!
I have really long document field values. Tokens of these fields are of the
form: word|payload|position_increment. (I need to control position increments
and payload manually.)
I collect these compound tokens for the entire document, then join them with a
'\t', and then pass this string to my custom analyzer.
(For the really long field strings something breaks in the
UnicodeUtil.UTF16toUTF8() with ArrayOutOfBoundsException).
The analyzer is just the following:
class AmbiguousTokenAnalyzer extends Analyzer {
private PayloadEncoder encoder = new IntegerEncoder();
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader
reader) {
Tokenizer source = new DelimiterTokenizer('\t',
EngineInfo.ENGINE_VERSION, reader);
TokenStream sink = new DelimitedPositionIncrementFilter(source, '|');
sink = new CustomDelimitedPayloadTokenFilter(sink, '|', encoder);
sink.addAttribute(OffsetAttribute.class);
sink.addAttribute(CharTermAttribute.class);
sink.addAttribute(PayloadAttribute.class);
sink.addAttribute(PositionIncrementAttribute.class);
return new TokenStreamComponents(source, sink);
}
}
CustomDelimitedPayloadTokenFilter and DelimitedPositionIncrementFilter have
'incrementToken' method where the rightmost "|aaa" part of a token is processed.
The field is configured as:
attributeFieldType.setIndexed(true);
attributeFieldType.setStored(true);
attributeFieldType.setOmitNorms(true);
attributeFieldType.setTokenized(true);
attributeFieldType.setStoreTermVectorOffsets(true);
attributeFieldType.setStoreTermVectorPositions(true);
attributeFieldType.setStoreTermVectors(true);
attributeFieldType.setStoreTermVectorPayloads(true);
The problem is, if I pass to the analyzer the field itself (one huge string -
via document.add(...) ), it works OK, but if I pass token after token,
something breaks at the search stage.
As I read somewhere, these two ways must be the same from the resulting index
point of view. Maybe my analyzer misses something?
--
Best Regards,
Igor Shalyminov
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]