Hi all.

I have a question about Korean tokenisation. Currently there is a rule in StandardTokenizerImpl.jflex which looks like this:

ALPHANUM   = ({LETTER}|{DIGIT}|{KOREAN})+

I'm wondering if there was some good reason why it isn't:

ALPHANUM   = (({LETTER}|{DIGIT})+|{KOREAN}+)

Basically I'm seeing some tokens come back with mixed digits and Hangul, and I'm questioning the correctness of that.

Disclaimer: we're not performing any further processing of Korean in subsequent filters at the current point in time, and I don't know the language either.

Daniel


--
Daniel Noll                            Forensic and eDiscovery Software
Senior Developer                              The world's most advanced
Nuix                                                email data analysis
http://nuix.com/                                and eDiscovery software

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to