StandardTokenizer and Korean grouping with alphanum

Daniel Noll Sun, 21 Sep 2008 21:50:12 -0700

Hi all.

I have a question about Korean tokenisation. Currently there is a rulein StandardTokenizerImpl.jflex which looks like this:


ALPHANUM   = ({LETTER}|{DIGIT}|{KOREAN})+

I'm wondering if there was some good reason why it isn't:

ALPHANUM   = (({LETTER}|{DIGIT})+|{KOREAN}+)

Basically I'm seeing some tokens come back with mixed digits and Hangul,and I'm questioning the correctness of that.

Disclaimer: we're not performing any further processing of Korean insubsequent filters at the current point in time, and I don't know thelanguage either.


Daniel


--
Daniel Noll                            Forensic and eDiscovery Software
Senior Developer                              The world's most advanced
Nuix                                                email data analysis
http://nuix.com/                                and eDiscovery software

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

StandardTokenizer and Korean grouping with alphanum

Reply via email to