Joerg Heinicke wrote:

On 09.03.2004 02:39, Vadim Gritsenko wrote:

public void characters(char[] ch, int start, int length) {
if (ch.length > 0 && start >= 0 && length > 1) {
- String text = new String(ch, start, length);
if (elementStack.size() > 0) {
IndexHelperField tos = (IndexHelperField) elementStack.peek();
- tos.appendText(text);
+ tos.appendText(ch, start, length);
}
- bodyText.append(text);
+ bodyText.append(' ');
+ bodyText.append(ch, start, length);
}
}



What will happen when "keyword" text is streamed as two characters events, "key" and "word"? I think it will become "key word", and indexing will break.


IIUC, idea was to add a space in between tags, i.e. so <p>some</p><p>text</p> is not indexed as "sometext". If that's correct, then better fix would be to add space only if boolean flag had_start_or_end_element_in_between_char_events set.


Joerg?


Your mail was neither ignored nor accidently deleted - I just didn't know what really to write, but marked it as important in nice red color in Mozilla :)


:-)


Yes, I see your objection - and asked for them already in the bug http://nagoya.apache.org/bugzilla/show_bug.cgi?id=25934 ;)

So what are the practical use cases this might occure? Maybe it's only a theoretical problem depending on the "thing" the index is created from? On which SAX stream the LuceneIndexHandler operates?


I remember there were issues already in other components with text being splitted up onto multiple character events. So, think of this as of preventive maintenance.


I also don't get your implications for "had_start_or_end_element_in_between_char_events". But I had a look on the endElement(). It gets the elements from a stack and already tests for text:
if (text != null && text.length() > 0) {
Would it make sense to add the space in endElement, if the element contains text, i.e. the above is true?


This was my first though... But then, multiple closing tags will cause multiple spaces... So, I thought, this should work:

startElement:
   flag = true;

endElement:
   flag = true;

characters:
   if (flag)
       x.append(' ');
       flag = false;

Does it solves the problem?

Vadim




Reply via email to