Hi all.
I have a question about Korean tokenisation. Currently there is a rule
in StandardTokenizerImpl.jflex which looks like this:
ALPHANUM = ({LETTER}|{DIGIT}|{KOREAN})+
I'm wondering if there was some good reason why it isn't:
ALPHANUM = (({LETTER}|{DIGIT})+|{KOREAN}+)
Basically I'm seeing some tokens come back with mixed digits and Hangul,
and I'm questioning the correctness of that.
Disclaimer: we're not performing any further processing of Korean in
subsequent filters at the current point in time, and I don't know the
language either.
Daniel
--
Daniel Noll Forensic and eDiscovery Software
Senior Developer The world's most advanced
Nuix email data analysis
http://nuix.com/ and eDiscovery software
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]