Valery wrote:
Hi John,
(aren't you the same John Byrne who is a key contributor to the great
OpenSSI project?)
Nope, never heard of him! But with a great name like that I'm sure he'll
go a long way :)
John Byrne-3 wrote:
I'm inclined to disagree with the idea that a token should not be split
again downstream. I think that is actually a much easier way to handle
it. I would have the tokenizer return the longest match, and then split
it in a token filter. In fact I have dones this before and it has worked
fine for me.
well, I could soften my position: if the token re-parsing is done by looking
into currentlexem value only, then it might be perhaps accepted. In
contrast, if during your re-parsing you have to look into the upstream
characters data "several filters backwards", then, IMHO, it is rather messy
and unacceptable.
If I understand you correctly, that's pretty much what I meant. By
having the first tokenizer pass larger tokens, and splitting them in the
filter, you never have to look upstream while storing state. You only
look upstream for a new token after you are finished splitting the last
one and sending the parts downstream.
Regarding this part:
John Byrne-3 wrote:
I think you will have to maintain some state within the token filter
[...]
I would wait for Simon's answer to the question "What do you expect from the
Tokenizer?"
Then I will give my 2cents on this and perhaps then I could sum up all
opinions and adopt a common conclusion.
:)
regards
Valery
------------------------------------------------------------------------
No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 8.5.392 / Virus Database: 270.13.63/2316 - Release Date: 08/20/09 18:06:00
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org