Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

John Byrne Fri, 21 Aug 2009 04:10:47 -0700

Valery wrote:

Hi John,

(aren't you the same John Byrne who is a key contributor to the great
OpenSSI project?)

Nope, never heard of him! But with a great name like that I'm sure he'llgo a long way :)

John Byrne-3 wrote:
I'm inclined to disagree with the idea that a token should not be splitagain downstream. I think that is actually a much easier way to handleit. I would have the tokenizer return the longest match, and then splitit in a token filter. In fact I have dones this before and it has workedfine for me.
well, I could soften my position: if the token re-parsing is done by looking
into currentlexem value only, then it might be perhaps accepted. In
contrast, if during your re-parsing you have to look into the upstream
characters data "several filters backwards", then, IMHO, it is rather messy
and unacceptable.

If I understand you correctly, that's pretty much what I meant. Byhaving the first tokenizer pass larger tokens, and splitting them in thefilter, you never have to look upstream while storing state. You onlylook upstream for a new token after you are finished splitting the lastone and sending the parts downstream.

Regarding this part:

John Byrne-3 wrote:
I think you will have to maintain some state within the token filter[...]
I would wait for Simon's answer to the question "What do you expect from the
Tokenizer?"

Then I will give my 2cents on this and perhaps then I could sum up all
opinions and adopt a common conclusion.
:)

regards
Valery
------------------------------------------------------------------------
No virus found in this incoming message.
Checked by AVG - www.avg.comVersion: 8.5.392 / Virus Database: 270.13.63/2316 - Release Date: 08/20/09 18:06:00



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Reply via email to