You are correct that it is not a 1 byte versus 2 byte problem, it is an
encoding issue.  There are many ways that a PDF can do encoding and CJK
languages happen to be more complex, PDFBox supports some cases but not
all.

I would first encourage you to try the nightly build of PDFBox at
http://www.pdfbox.org/dist  there have been a couple fixes since the 0.7.2
release.  If that still does not fix the problem then please create an
issue on SourceForge and attach/upload(ftp.pdfbox.org) the problem PDF,
usually it can be fixed pretty quickly.

Ben


On Wed, 22 Mar 2006, Richard Braman wrote:

> I would forward this to [EMAIL PROTECTED]
>
> -----Original Message-----
> From: Teruhiko Kurosaka [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, March 21, 2006 12:23 PM
> To: [email protected]
> Subject: Can't index Japanese PDF
>
>
> In my quick experiments, Nutch 0.7.1 (with bundled PDFBox
> which I thought wouldn't handle multibyte text) can
> index Chinese PDFs but not Japanese PDFs.  Japanese PDFs
> can be hit by entering an English word that happens to
> appear in them but the digest lines are shown corrupted.
> Apparantly, PDFBox is having problem extracting
> text only from Japanese PDF but not other language PDFs.
> It's not a multibyte vs single byte issue.
> Has anybody has any success indexing Japanese PDFs?
>
> -kuro
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by xPML, a groundbreaking scripting language
> that extends applications into web and mobile media. Attend the live webcast
> and join the prime developer group breaking into this new coding territory!
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
> _______________________________________________
> PDFBox-user mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/pdfbox-user
>


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to