[ https://issues.apache.org/jira/browse/PDFBOX-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17010851#comment-17010851 ]
Jorge Spinsanti commented on PDFBOX-4549: ----------------------------------------- [~tilman] about the comment of [~tallison] {quote}These are good points Michael Klink. See e.g.: [http://www.vintasoft.com/forums/viewtopic.php?t=2320] for willful/intentional obfuscation of test. {quote} Can you predict the obfuscation without text extraction? If yes, [~tallison] could use it to throw on Tika an exception such as `PDFProtectedException` or similar? > No Unicode mapping > ------------------ > > Key: PDFBOX-4549 > URL: https://issues.apache.org/jira/browse/PDFBOX-4549 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 2.0.15 > Reporter: Sergey Makarov > Assignee: Tilman Hausherr > Priority: Major > Fix For: 2.0.16, 3.0.0 PDFBox > > Attachments: XO_Thames.zip, our_star_wars.pdf > > > Hello, if i try get text from pdf (attached), i will result empty out and > many warns. Font attached also. > Acrobat reader will open succeed, I can select, copy text and save as text > my code: > {code:java} > private static void parseOne(String path) throws IOException { > String pdfFileInText; > PDFTextStripper tStripper; > File file = new File(path); > tStripper = new PDFTextStripper(); > MemoryUsageSetting memUsageSetting = MemoryUsageSetting.setupMixed(0, > 500000000).setTempDir(new File("/home/user/pdfBoxTest/newFiles/")); > PDDocument document = PDDocument.load(file, memUsageSetting); > if (!document.isEncrypted()) { > pdfFileInText = tStripper.getText(document); > System.out.print(pdfFileInText); > } > document.close(); > }{code} > Error: > {code:java} > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDFont <init> > WARNING: Invalid ToUnicode CMap in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode > WARNING: No Unicode mapping for CID+83 (83) in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode > WARNING: No Unicode mapping for CID+116 (116) in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode > WARNING: No Unicode mapping for CID+97 (97) in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode > WARNING: No Unicode mapping for CID+114 (114) in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode > WARNING: No Unicode mapping for CID+87 (87) in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode > WARNING: No Unicode mapping for CID+115 (115) in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDFont <init> > WARNING: Invalid ToUnicode CMap in font HPDFAB+DejaVuSansMono,Book > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org