[
https://issues.apache.org/jira/browse/LUCENE-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660088#action_12660088
]
Andrew Duffy commented on LUCENE-579:
-------------------------------------
It is a duplicate of LUCENE-1448; the fix proposed in that issue will fix the
problem in a very comprehensive way.
> TermPositionVector offsets incorrect if indexed field has multiple values and
> one ends with non-term chars
> ----------------------------------------------------------------------------------------------------------
>
> Key: LUCENE-579
> URL: https://issues.apache.org/jira/browse/LUCENE-579
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9
> Reporter: Keiron McCammon
> Attachments: offsets.patch
>
>
> If you add multiple values for a field with term vector positions and offsets
> enabled and one of the values ends with a non-term then the offsets for the
> terms from subsequent values are wrong. For example (note the '.' in the
> first value):
> IndexWriter writer = new IndexWriter(directory, new SimpleAnalyzer(),
> true);
> Document doc = new Document();
> doc.add(new Field("", "one.", Field.Store.YES, Field.Index.TOKENIZED,
> Field.TermVector.WITH_POSITIONS_OFFSETS));
> doc.add(new Field("", "two", Field.Store.YES, Field.Index.TOKENIZED,
> Field.TermVector.WITH_POSITIONS_OFFSETS));
> writer.addDocument(doc);
> writer.optimize();
> writer.close();
> IndexSearcher searcher = new IndexSearcher(directory);
> Hits hits = searcher.search(new MatchAllDocsQuery());
> Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter(),
> new QueryScorer(new TermQuery(new Term("", "camera")),
> searcher.getIndexReader(), ""));
> for (int i = 0; i < hits.length(); ++i) {
> TermPositionVector v = (TermPositionVector)
> searcher.getIndexReader().getTermFreqVector(
> hits.id(i), "");
> StringBuilder str = new StringBuilder();
> for (String s : hits.doc(i).getValues("")) {
> str.append(s);
> str.append(" ");
> }
>
> System.out.println(str);
> TokenStream tokenStream = TokenSources.getTokenStream(v, false);
> String[] terms = v.getTerms();
> int[] freq = v.getTermFrequencies();
> for (int j = 0; j < terms.length; ++j) {
> System.out.print(terms[j] + ":" + freq[j] + ":");
>
> int[] pos = v.getTermPositions(j);
>
> System.out.print(Arrays.toString(pos));
>
> TermVectorOffsetInfo[] offset = v.getOffsets(j);
> for (int k = 0; k < offset.length; ++k) {
>
> System.out.print(":");
>
> System.out.print(str.substring(offset[k].getStartOffset(),
> offset[k].getEndOffset()));
> }
>
> System.out.println();
> }
> }
> searcher.close();
> If I run the above I get:
> one:1:[0]:one
> two:1:[1]: tw
> Note that the offsets for the second term are off by 1.
> It seems to be that the length of the value that is stored is not taken into
> account when calculating the offset for the fields of the next value.
> I noticed ths problem when using the highlight contrib package which can make
> use of term vectors for highlighting. I also noticed that the offset for the
> second string is +1 the end of the previous value, so when concatenating the
> fields values to pass to the hgighlighter I add to append a ' ' character
> after each string...which is quite useful, but not documented anywhere.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]