Huan LI created PDFBOX-1304: ------------------------------- Summary: Text extraction meets "Could not parse predefined CMAP" and returns just a small part of the content containing garbage chars. Key: PDFBOX-1304 URL: https://issues.apache.org/jira/browse/PDFBOX-1304 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.6.0 Environment: Win7 32bits Reporter: Huan LI
i'm using pdfbox-1.6.0 for text extraction from a Chinese pdf file(see the attachment "fj.pdf"). the extraction code looks like below: [code] stripper = new PDFTextStripper(encoding); txt = stripper.getText(_pdfDoc); [/code] when running getText(), the console says : [console] 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding 严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2' 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding 严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2' 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding 严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2' 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding 严重: Error: Could not parse predefined CMAP file for 'Founder-PKUE1-UCS2' 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding 严重: Error: Could not parse predefined CMAP file for 'Founder-PKUO1-UCS2' 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding 严重: Error: Could not parse predefined CMAP file for 'Founder-PKUF1-UCS2' 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding 严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2' 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding 严重: Error: Could not parse predefined CMAP file for 'Founder-PKUF1-UCS2' 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding 严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2' 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding 严重: Error: Could not parse predefined CMAP file for 'Founder-PKUE1-UCS2' 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding 严重: Error: Could not parse predefined CMAP file for 'Founder-PKUE1-UCS2' 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding 严重: Error: Could not parse predefined CMAP file for 'Founder-PKUF1-UCS2' 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding 严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2' [/console] after getText() returns, the txt contains just a small part of the pdf content (lots are missing) and some garbage chars like "犖犑狌犣犎犗犝犔犻犺犅"(see attachment "fj.txt"). I've heard some said that the "org.apache.pdfbox.cos.COSString.java" has some errors when pdfbox-0.7.3. Has COSString.java been corrected in 1.6.0? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira