This is an encoding issue with PDFBox text extraction, not all chinese PDF files are supported right now. This is a known PDFBox issue.

Ben


[EMAIL PROTECTED] wrote:
hi all

While using nutch0.8 to parse some chinese pdf files encoded in GBK,I always get errors 
message as:" Unknown encoding for 'GBK-EUC-H' " , should I change some settings 
or recomplie the parse-pdf plugin?

thanks



Reply via email to