Hey Lukas,
I was being simplistic when I said that the text and TokenSteam must be
exactly the same. It's difficult to think of a reason why you would not
want them to be the same though. Each Token records the offsets where it
can be found in the original text -- that is how the Highlighter knows
where to highlight in the original text with the only the Tokens to
inspect. So if a Token is scored >0, then the offsets for that Token
must be valid indexes into the text String (In the case of the
HTMLFormmatter, which only marks Tokens that score >0).
Now an issue I see you having:
The TokenStream for "example long text" is:
(term,startoffset,endoffset)
(example,0,7)
(long,8,12)
(text,13,17)
So for the query "example long" the Highlighter will highlight offsets
0-7 and 8-12 in the source text. In your example, with the text only
being "example", the attempt to highlight the Token "long" will index
into the source text 8 and cause an outofbounds.
In your case you are even worse off because you are building the
TokenStream from a field that was added more than once. This gives you
seemingly wrong offsets of:
(example,0,7)
(long,14,18)
(text,22,26)
Each word has its space accounted for twice. Maybe there is a reason for
this, but it looks wrong. I have not investigated enough to know if
TokenSources is responsible for this, or if core Lucene is the culprit.
Even if it was done differently though, there would still seem to be
possible issues with the possible spacing between words when you are
adding the words one at a time with no spacing in the same field.
Looking at your original email though, you may be trying to do something
that is best done without the Highlighter.
In summary , you should use Document.getFields (more efficient if you
are getting more than one field anyway) and get around the offset issues
above.
- Mark
Lukas Vlcek wrote:
Mark,
thank you for this. I will wait for your other responses.
This will keep me going on :-)
I didn't know that there is a design restriction in Lucene that the text and
TokenStream must be exactly the same (still this seems redundant, I will
dive into Lucene API more).
BR
Lukas
On 7/29/07, Mark Miller <[EMAIL PROTECTED]> wrote:
I'm am going to try and write up some more info for you tomorrow, but
just to point out: I do think there is a bug in the way offsets are
being handled. I don't think this is causing your current problem (what
I mentioned is) but it will prob cause you problems down the road. I
will look into this further.
- Mark
Lukas Vlcek wrote:
Hi Lucene experts,
The following is a simple Lucene code which generates
StringIndexOutOfBoundsException exception. I am using Lucene 2.2.0official
releasse. Can anyone tell me what is wrong with this code? Is this a bug
or
a feature of Lucene? Any comments/hits highly welcommed!
In a nutshell I have a document with two (or four) fileds:
1) all
2-4) small
I use [all] for searching and [small] for highlighting.
[packkage and imports truncated...]
public class MemoryIndexCase {
static public void main(String[] arg) {
Document doc = new Document();
doc.add(new Field("all","example long text",
Field.Store.NO, Field.Index.TOKENIZED));
doc.add(new Field("small","example",
Field.Store.YES, Field.Index.UN_TOKENIZED,
Field.TermVector.WITH_POSITIONS_OFFSETS));
doc.add(new Field("small","long",
Field.Store.YES, Field.Index.UN_TOKENIZED,
Field.TermVector.WITH_POSITIONS_OFFSETS));
doc.add(new Field("small","text",
Field.Store.YES, Field.Index.UN_TOKENIZED,
Field.TermVector.WITH_POSITIONS_OFFSETS));
try {
Directory idx = new RAMDirectory();
IndexWriter writer = new IndexWriter(idx, new
StandardAnalyzer(), true);
writer.addDocument(doc);
writer.optimize();
writer.close();
Searcher searcher = new IndexSearcher(idx);
QueryParser qp = new QueryParser("all", new
StandardAnalyzer());
Query query = qp.parse("example text");
Hits hits = searcher.search(query);
Highlighter highlighter = new Highlighter(new
QueryScorer(query));
IndexReader ir = IndexReader.open(idx);
for (int i = 0; i < hits.length(); i++) {
String text = hits.doc(i).get("small");
TermFreqVector tfv = ir.getTermFreqVector(hits.id(i),
"small");
TokenStream tokenStream=
TokenSources.getTokenStream((TermPositionVector)
tfv);
String result =
highlighter.getBestFragment(tokenStream,text);
System.out.println(result);
}
} catch (Throwable e) {
e.printStackTrace();
}
}
}
The exception is:
java.lang.StringIndexOutOfBoundsException: String index out of range: 11
at java.lang.String.substring(String.java:1935)
at
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(
Highlighter.java:235)
at org.apache.lucene.search.highlight.Highlighter.getBestFragments(
Highlighter.java:175)
at org.apache.lucene.search.highlight.Highlighter.getBestFragment(
Highlighter.java:101)
at org.lucenetest.MemoryIndexCase.main(MemoryIndexCase.java:70)
Best regards,
Lukas
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]