Nutch doesn't handle Korean text at all
---------------------------------------

         Key: NUTCH-224
         URL: http://issues.apache.org/jira/browse/NUTCH-224
     Project: Nutch
        Type: Bug
  Components: indexer  
    Versions: 0.7.1    
    Reporter: KuroSaka TeruHiko


I was browing NutchAnalysis.jj and found that
Hungul Syllables (U+AC00 ... U+D7AF; U+xxxx means
a Unicode character of the hex value xxxx) are not
part of LETTER or CJK class.  This seems to me that
Nutch cannot handle Korean documents at all.

I posted the above message at nutch-user ML and Cheolgoo Kang [EMAIL PROTECTED]
replied as:
------------------------------------------------------------------------------------
There was similar issue with Lucene's StandardTokenizer.jj.

http://issues.apache.org/jira/browse/LUCENE-444

and

http://issues.apache.org/jira/browse/LUCENE-461

I'm have almost no experience with Nutch, but you can handle it like
those issues above.
------------------------------------------------------------------------------------

Both fixes should probably be ported back to NuatchAnalysis.jj.





-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to