[Nutch-general] RE: Can't index Japanese PDF

Richard Braman Wed, 22 Mar 2006 20:25:10 -0800

I would forward this to [EMAIL PROTECTED]

-----Original Message-----
From: Teruhiko Kurosaka [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 21, 2006 12:23 PM
To: [email protected]
Subject: Can't index Japanese PDF



In my quick experiments, Nutch 0.7.1 (with bundled PDFBox
which I thought wouldn't handle multibyte text) can 
index Chinese PDFs but not Japanese PDFs.  Japanese PDFs
can be hit by entering an English word that happens to
appear in them but the digest lines are shown corrupted.  
Apparantly, PDFBox is having problem extracting
text only from Japanese PDF but not other language PDFs. 
It's not a multibyte vs single byte issue.
Has anybody has any success indexing Japanese PDFs?

-kuro



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] RE: Can't index Japanese PDF

Reply via email to