[jira] Created: (LUCENE-579) TermPositionVector offsets incorrect if indexed field has multiple values and one ends with non-term chars

Keiron McCammon (JIRA) Thu, 25 May 2006 14:21:13 -0700

TermPositionVector offsets incorrect if indexed field has multiple values and 
one ends with non-term chars
----------------------------------------------------------------------------------------------------------


         Key: LUCENE-579
         URL: http://issues.apache.org/jira/browse/LUCENE-579
     Project: Lucene - Java
        Type: Bug

  Components: Analysis  
    Versions: 1.9    
    Reporter: Keiron McCammon


If you add multiple values for a field with term vector positions and offsets 
enabled and one of the values ends with a non-term then the offsets for the 
terms from subsequent values are wrong. For example (note the '.' in the first 
value):

        IndexWriter writer = new IndexWriter(directory, new SimpleAnalyzer(), 
true);

        Document doc = new Document();

        doc.add(new Field("", "one.", Field.Store.YES, Field.Index.TOKENIZED, 
Field.TermVector.WITH_POSITIONS_OFFSETS));

        doc.add(new Field("", "two", Field.Store.YES, Field.Index.TOKENIZED, 
Field.TermVector.WITH_POSITIONS_OFFSETS));

        writer.addDocument(doc);

        writer.optimize();

        writer.close();

        IndexSearcher searcher = new IndexSearcher(directory);

        Hits hits = searcher.search(new MatchAllDocsQuery());

        Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter(),
            new QueryScorer(new TermQuery(new Term("", "camera")), 
searcher.getIndexReader(), ""));

        for (int i = 0; i < hits.length(); ++i) {

            TermPositionVector v = (TermPositionVector) 
searcher.getIndexReader().getTermFreqVector(
                hits.id(i), "");

            StringBuilder str = new StringBuilder();

            for (String s : hits.doc(i).getValues("")) {

                str.append(s);
                str.append(" ");
            }
            
            System.out.println(str);

            TokenStream tokenStream = TokenSources.getTokenStream(v, false);

            String[] terms = v.getTerms();
            int[] freq = v.getTermFrequencies();

            for (int j = 0; j < terms.length; ++j) {

                System.out.print(terms[j] + ":" + freq[j] + ":");
                
                int[] pos = v.getTermPositions(j);
                
                System.out.print(Arrays.toString(pos));
                
                TermVectorOffsetInfo[] offset = v.getOffsets(j); 

                for (int k = 0; k < offset.length; ++k) {
                    
                    System.out.print(":");
                    System.out.print(str.substring(offset[k].getStartOffset(), 
offset[k].getEndOffset()));
                }
                
                System.out.println();
            }
        }

        searcher.close();

If I run the above I get:
        one:1:[0]:one
        two:1:[1]: tw

Note that the offsets for the second term are off by 1.

It seems to be that the length of the value that is stored is not taken into 
account when calculating the offset for the fields of the next value.

I noticed ths problem when using the highlight contrib package which can make 
use of term vectors for highlighting. I also noticed that the offset for the 
second string is +1 the end of the previous value, so when concatenating the 
fields values to pass to the hgighlighter I add to append a ' ' character after 
each string...which is quite useful, but not documented anywhere.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-579) TermPositionVector offsets incorrect if indexed field has multiple values and one ends with non-term chars

Reply via email to