[
https://issues.apache.org/jira/browse/PDFBOX-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16842353#comment-16842353
]
ASF subversion and git services commented on PDFBOX-4549:
---------------------------------------------------------
Commit 1859443 from Tilman Hausherr in branch 'pdfbox/branches/issue45'
[ https://svn.apache.org/r1859443 ]
PDFBOX-4549: assume Identity-H when ToUnicode stream has no entries and
ToUnicode Ordering and Encoding have Identity-H
> No Unicode mapping
> ------------------
>
> Key: PDFBOX-4549
> URL: https://issues.apache.org/jira/browse/PDFBOX-4549
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.15
> Reporter: Sergey Makarov
> Priority: Major
> Attachments: XO_Thames.zip, our_star_wars.pdf
>
>
> Hello, if i try get text from pdf (attached), i will result empty out and
> many warns. Font attached also.
> Acrobat reader will open succeed, I can select, copy text and save as text
> my code:
> {code:java}
> private static void parseOne(String path) throws IOException {
> String pdfFileInText;
> PDFTextStripper tStripper;
> File file = new File(path);
> tStripper = new PDFTextStripper();
> MemoryUsageSetting memUsageSetting = MemoryUsageSetting.setupMixed(0,
> 500000000).setTempDir(new File("/home/user/pdfBoxTest/newFiles/"));
> PDDocument document = PDDocument.load(file, memUsageSetting);
> if (!document.isEncrypted()) {
> pdfFileInText = tStripper.getText(document);
> System.out.print(pdfFileInText);
> }
> document.close();
> }{code}
> Error:
> {code:java}
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDFont <init>
> WARNING: Invalid ToUnicode CMap in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+83 (83) in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+116 (116) in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+97 (97) in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+114 (114) in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+87 (87) in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+115 (115) in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDFont <init>
> WARNING: Invalid ToUnicode CMap in font HPDFAB+DejaVuSansMono,Book
> {code}
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]