Re: Bug in Lucene 2.2.0 code? Simple code included (StringIndexOutOfBoundsException).

Mark Miller Mon, 30 Jul 2007 05:26:35 -0700

Hey Lukas,

I was being simplistic when I said that the text and TokenSteam must beexactly the same. It's difficult to think of a reason why you would notwant them to be the same though. Each Token records the offsets where itcan be found in the original text -- that is how the Highlighter knowswhere to highlight in the original text with the only the Tokens toinspect. So if a Token is scored >0, then the offsets for that Tokenmust be valid indexes into the text String (In the case of theHTMLFormmatter, which only marks Tokens that score >0).


Now an issue I see you having:

The TokenStream for "example long text" is:
(term,startoffset,endoffset)

(example,0,7)
(long,8,12)
(text,13,17)

So for the query "example long" the Highlighter will highlight offsets0-7 and 8-12 in the source text. In your example, with the text onlybeing "example", the attempt to highlight the Token "long" will indexinto the source text 8 and cause an outofbounds.

In your case you are even worse off because you are building theTokenStream from a field that was added more than once. This gives youseemingly wrong offsets of:


(example,0,7)
(long,14,18)
(text,22,26)

Each word has its space accounted for twice. Maybe there is a reason forthis, but it looks wrong. I have not investigated enough to know ifTokenSources is responsible for this, or if core Lucene is the culprit.Even if it was done differently though, there would still seem to bepossible issues with the possible spacing between words when you areadding the words one at a time with no spacing in the same field.

Looking at your original email though, you may be trying to do somethingthat is best done without the Highlighter.

In summary , you should use Document.getFields (more efficient if youare getting more than one field anyway) and get around the offset issuesabove.


- Mark

Lukas Vlcek wrote:

Mark,
thank you for this. I will wait for your other responses.
This will keep me going on :-)

I didn't know that there is a design restriction in Lucene that the text and
TokenStream must be exactly the same (still this seems redundant, I will
dive into Lucene API more).

BR
Lukas

On 7/29/07, Mark Miller <[EMAIL PROTECTED]> wrote:

I'm am going to try and write up some more info for you tomorrow, but
just to point out: I do think there is a bug in the way offsets are
being handled. I don't think this is causing your current problem (what
I mentioned is) but it will prob cause you problems down the road. I
will look into this further.

- Mark

Lukas Vlcek wrote:

Hi Lucene experts,

The following is a simple Lucene code which generates
StringIndexOutOfBoundsException exception. I am using Lucene 2.2.0official
releasse. Can anyone tell me what is wrong with this code? Is this a bug

or

a feature of Lucene? Any comments/hits highly welcommed!

In a nutshell I have a document with two (or four) fileds:
1) all
2-4) small

I use [all] for searching and [small] for highlighting.

[packkage and imports truncated...]

public class MemoryIndexCase {
    static public void main(String[] arg) {

        Document doc = new Document();

        doc.add(new Field("all","example long text",
                Field.Store.NO, Field.Index.TOKENIZED));
        doc.add(new Field("small","example",
                Field.Store.YES, Field.Index.UN_TOKENIZED,
Field.TermVector.WITH_POSITIONS_OFFSETS));
        doc.add(new Field("small","long",
                Field.Store.YES, Field.Index.UN_TOKENIZED,
Field.TermVector.WITH_POSITIONS_OFFSETS));
        doc.add(new Field("small","text",
                Field.Store.YES, Field.Index.UN_TOKENIZED,
Field.TermVector.WITH_POSITIONS_OFFSETS));

        try {
            Directory idx = new RAMDirectory();
            IndexWriter writer = new IndexWriter(idx, new
StandardAnalyzer(), true);

            writer.addDocument(doc);
            writer.optimize();
            writer.close();

            Searcher searcher = new IndexSearcher(idx);

            QueryParser qp = new QueryParser("all", new

StandardAnalyzer());

            Query query = qp.parse("example text");
            Hits hits = searcher.search(query);

            Highlighter highlighter =    new Highlighter(new
QueryScorer(query));

            IndexReader ir = IndexReader.open(idx);
            for (int i = 0; i < hits.length(); i++) {

                String text = hits.doc(i).get("small");

                TermFreqVector tfv = ir.getTermFreqVector(hits.id(i),
"small");
                TokenStream tokenStream=
TokenSources.getTokenStream((TermPositionVector)
tfv);

                String result =
                    highlighter.getBestFragment(tokenStream,text);
                System.out.println(result);
            }

        } catch (Throwable e) {
            e.printStackTrace();
        }
    }
}

The exception is:
java.lang.StringIndexOutOfBoundsException: String index out of range: 11
    at java.lang.String.substring(String.java:1935)
    at

org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(

Highlighter.java:235)
    at org.apache.lucene.search.highlight.Highlighter.getBestFragments(
Highlighter.java:175)
    at org.apache.lucene.search.highlight.Highlighter.getBestFragment(
Highlighter.java:101)
    at org.lucenetest.MemoryIndexCase.main(MemoryIndexCase.java:70)

Best regards,
Lukas

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Bug in Lucene 2.2.0 code? Simple code included (StringIndexOutOfBoundsException).

Reply via email to