[ http://issues.apache.org/jira/browse/NUTCH-224?page=comments#action_12442140 ] KuroSaka TeruHiko commented on NUTCH-224: -----------------------------------------
[[ Old comment, sent by email on Tue, 13 Jun 2006 18:17:48 -0700 ]] Thank you for taking care of this bug. I can't read or write Korean. I reported this bug because the code does not look like not being able to handle Korean characters. So, I can't really test the code. Your code inspection would be as good as mine. Perhaps you can find some Korean volunteers on nutch-user ML? -kuro > Nutch doesn't handle Korean text at all > --------------------------------------- > > Key: NUTCH-224 > URL: http://issues.apache.org/jira/browse/NUTCH-224 > Project: Nutch > Issue Type: Bug > Components: indexer > Affects Versions: 0.7.1 > Reporter: KuroSaka TeruHiko > > I was browing NutchAnalysis.jj and found that > Hungul Syllables (U+AC00 ... U+D7AF; U+xxxx means > a Unicode character of the hex value xxxx) are not > part of LETTER or CJK class. This seems to me that > Nutch cannot handle Korean documents at all. > I posted the above message at nutch-user ML and Cheolgoo Kang [EMAIL > PROTECTED] > replied as: > ------------------------------------------------------------------------------------ > There was similar issue with Lucene's StandardTokenizer.jj. > http://issues.apache.org/jira/browse/LUCENE-444 > and > http://issues.apache.org/jira/browse/LUCENE-461 > I'm have almost no experience with Nutch, but you can handle it like > those issues above. > ------------------------------------------------------------------------------------ > Both fixes should probably be ported back to NuatchAnalysis.jj. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira