[ https://issues.apache.org/jira/browse/NUTCH-224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12477190 ]
Steve Severance commented on NUTCH-224: --------------------------------------- The PDF Parser for 0.8.1 also fails on Korean text. Steve > Nutch doesn't handle Korean text at all > --------------------------------------- > > Key: NUTCH-224 > URL: https://issues.apache.org/jira/browse/NUTCH-224 > Project: Nutch > Issue Type: Bug > Components: indexer > Affects Versions: 0.7.1 > Reporter: KuroSaka TeruHiko > > I was browing NutchAnalysis.jj and found that > Hungul Syllables (U+AC00 ... U+D7AF; U+xxxx means > a Unicode character of the hex value xxxx) are not > part of LETTER or CJK class. This seems to me that > Nutch cannot handle Korean documents at all. > I posted the above message at nutch-user ML and Cheolgoo Kang [EMAIL > PROTECTED] > replied as: > ------------------------------------------------------------------------------------ > There was similar issue with Lucene's StandardTokenizer.jj. > http://issues.apache.org/jira/browse/LUCENE-444 > and > http://issues.apache.org/jira/browse/LUCENE-461 > I'm have almost no experience with Nutch, but you can handle it like > those issues above. > ------------------------------------------------------------------------------------ > Both fixes should probably be ported back to NuatchAnalysis.jj. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers