Steven A Rowe wrote:
Korean has been treated differently from Chinese and Japanese since
LUCENE-461 <https://issues.apache.org/jira/browse/LUCENE-461>.  The
grouping of Hangul with digits was introduced in this issue.

Certainly I found LUCENE-461 during my search, and certainly grouping together the words is a lot better *if* there are spaces between them. Although in several cases I have found there are no spaces, it's relatively rare and the way it's breaking it now appears to produce better hits than when it was separating them out.

Really I was just wondering about the digits being mixed in. Maybe it's legitimate to have a digit in the middle of a sequence of Hangul, like when we have AB3F for a product code with Latin characters.

You're right though, to do differently it will require a lot of jiggery to restrict ranges down to each language again instead of using [:letter:] which is much more convenient.

Daniel


--
Daniel Noll                            Forensic and eDiscovery Software
Senior Developer                              The world's most advanced
Nuix                                                email data analysis
http://nuix.com/                                and eDiscovery software

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to