[Nutch-general] Can't index Japanese PDF

Teruhiko Kurosaka Tue, 21 Mar 2006 09:24:11 -0800

In my quick experiments, Nutch 0.7.1 (with bundled PDFBox
which I thought wouldn't handle multibyte text) can 
index Chinese PDFs but not Japanese PDFs.  Japanese PDFs
can be hit by entering an English word that happens to
appear in them but the digest lines are shown corrupted.  
Apparantly, PDFBox is having problem extracting
text only from Japanese PDF but not other language PDFs. 
It's not a multibyte vs single byte issue.
Has anybody has any success indexing Japanese PDFs?


-kuro


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Can't index Japanese PDF

Reply via email to