Valery wrote:
Hi John,
(aren't you the same John Byrne who is a key contributor to the great
OpenSSI project?)
Nope, never heard of him! But with a great name like that I'm sure he'll go a long way :)

John Byrne-3 wrote:
I'm inclined to disagree with the idea that a token should not be split again downstream. I think that is actually a much easier way to handle it. I would have the tokenizer return the longest match, and then split it in a token filter. In fact I have dones this before and it has worked fine for me.


well, I could soften my position: if the token re-parsing is done by looking
into currentlexem value only, then it might be perhaps accepted. In
contrast, if during your re-parsing you have to look into the upstream
characters data "several filters backwards", then, IMHO, it is rather messy
and unacceptable.
If I understand you correctly, that's pretty much what I meant. By having the first tokenizer pass larger tokens, and splitting them in the filter, you never have to look upstream while storing state. You only look upstream for a new token after you are finished splitting the last one and sending the parts downstream.


Regarding this part:

John Byrne-3 wrote:
I think you will have to maintain some state within the token filter [...]


I would wait for Simon's answer to the question "What do you expect from the
Tokenizer?"

Then I will give my 2cents on this and perhaps then I could sum up all
opinions and adopt a common conclusion.
:)

regards
Valery

------------------------------------------------------------------------


No virus found in this incoming message.
Checked by AVG - www.avg.com Version: 8.5.392 / Virus Database: 270.13.63/2316 - Release Date: 08/20/09 18:06:00



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to