Nutch doesn't handle Korean text at all ---------------------------------------
Key: NUTCH-224 URL: http://issues.apache.org/jira/browse/NUTCH-224 Project: Nutch Type: Bug Components: indexer Versions: 0.7.1 Reporter: KuroSaka TeruHiko I was browing NutchAnalysis.jj and found that Hungul Syllables (U+AC00 ... U+D7AF; U+xxxx means a Unicode character of the hex value xxxx) are not part of LETTER or CJK class. This seems to me that Nutch cannot handle Korean documents at all. I posted the above message at nutch-user ML and Cheolgoo Kang [EMAIL PROTECTED] replied as: ------------------------------------------------------------------------------------ There was similar issue with Lucene's StandardTokenizer.jj. http://issues.apache.org/jira/browse/LUCENE-444 and http://issues.apache.org/jira/browse/LUCENE-461 I'm have almost no experience with Nutch, but you can handle it like those issues above. ------------------------------------------------------------------------------------ Both fixes should probably be ported back to NuatchAnalysis.jj. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira