You are correct that it is not a 1 byte versus 2 byte problem, it is an encoding issue. There are many ways that a PDF can do encoding and CJK languages happen to be more complex, PDFBox supports some cases but not all.
I would first encourage you to try the nightly build of PDFBox at http://www.pdfbox.org/dist there have been a couple fixes since the 0.7.2 release. If that still does not fix the problem then please create an issue on SourceForge and attach/upload(ftp.pdfbox.org) the problem PDF, usually it can be fixed pretty quickly. Ben On Wed, 22 Mar 2006, Richard Braman wrote: > I would forward this to [EMAIL PROTECTED] > > -----Original Message----- > From: Teruhiko Kurosaka [mailto:[EMAIL PROTECTED] > Sent: Tuesday, March 21, 2006 12:23 PM > To: [email protected] > Subject: Can't index Japanese PDF > > > In my quick experiments, Nutch 0.7.1 (with bundled PDFBox > which I thought wouldn't handle multibyte text) can > index Chinese PDFs but not Japanese PDFs. Japanese PDFs > can be hit by entering an English word that happens to > appear in them but the digest lines are shown corrupted. > Apparantly, PDFBox is having problem extracting > text only from Japanese PDF but not other language PDFs. > It's not a multibyte vs single byte issue. > Has anybody has any success indexing Japanese PDFs? > > -kuro > > > > ------------------------------------------------------- > This SF.Net email is sponsored by xPML, a groundbreaking scripting language > that extends applications into web and mobile media. Attend the live webcast > and join the prime developer group breaking into this new coding territory! > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 > _______________________________________________ > PDFBox-user mailing list > [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/pdfbox-user > ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
