In my quick experiments, Nutch 0.7.1 (with bundled PDFBox which I thought wouldn't handle multibyte text) can index Chinese PDFs but not Japanese PDFs. Japanese PDFs can be hit by entering an English word that happens to appear in them but the digest lines are shown corrupted. Apparantly, PDFBox is having problem extracting text only from Japanese PDF but not other language PDFs. It's not a multibyte vs single byte issue. Has anybody has any success indexing Japanese PDFs?
-kuro ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
