Hello,
Please, help. I am lost in TokenStream / Token / Analyzer API.
I am trying to figure out how to get _token_itself_ or token text while
looking at "Invoking the Analyzer" example (see example below and also at:
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/package-summary.html?is-external=true#package_description
)
Method "ts.reflectAsString(true))" returns lots of useful info:
org.apache.lucene.analysis.tokenattributes.CharTermAttribute#term=some,org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute#bytes=[73
6f 6d
65],org.apache.lucene.analysis.tokenattributes.OffsetAttribute#startOffset=0,org.apache.lucene.analysis.tokenattributes.OffsetAttribute#endOffset=4,org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute#positionIncrement=1,org.apache.lucene.analysis.tokenattributes.TypeAttribute#type=<ALPHANUM>,org.apache.lucene.analysis.tokenattributes.KeywordAttribute#keyword=false
Yet, how to get token itself? In this case "some" ?
Thanks!
------ Example in the documentation --------
Version matchVersion = Version.LUCENE_XY; // Substitute desired Lucene
version for XY
Analyzer analyzer = new StandardAnalyzer(matchVersion); // or any other
analyzer
TokenStream ts = analyzer.tokenStream("myfield", new StringReader("some
text goes here"));
OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
try {
ts.reset(); // Resets this stream to the beginning. (Required)
while (ts.incrementToken()) {
// Use AttributeSource.reflectAsString(boolean)
// for token stream debugging.
System.out.println("token: " + ts.reflectAsString(true));
System.out.println("token start offset: " +
offsetAtt.startOffset());
System.out.println(" token end offset: " + offsetAtt.endOffset());
}
ts.end(); // Perform end-of-stream operations, e.g. set the final
offset.
} finally {
ts.close(); // Release resources associated with this stream.
}