I would forward this to [EMAIL PROTECTED] -----Original Message----- From: Teruhiko Kurosaka [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 21, 2006 12:23 PM To: [email protected] Subject: Can't index Japanese PDF
In my quick experiments, Nutch 0.7.1 (with bundled PDFBox which I thought wouldn't handle multibyte text) can index Chinese PDFs but not Japanese PDFs. Japanese PDFs can be hit by entering an English word that happens to appear in them but the digest lines are shown corrupted. Apparantly, PDFBox is having problem extracting text only from Japanese PDF but not other language PDFs. It's not a multibyte vs single byte issue. Has anybody has any success indexing Japanese PDFs? -kuro ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
