[ 
https://issues.apache.org/jira/browse/PDFBOX-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17010421#comment-17010421
 ] 

Michael Klink commented on PDFBOX-4549:
---------------------------------------

[~Giorgy]

The problem is that a syntactically correct PDF may still produce gibberish 
during text extraction. As PDFBox does not dive into semantics, it won't 
identify such situations for you.

Thus, unless you have guarantees that your input PDFs can be expected to 
provide proper information for text extraction, you will always have to check.

Even worse, PDFs may be explicitly built to deceive upon text extraction, 
probably not touching regular text but exchanging digits in numbers. In such a 
case even dictionary checks won't help.

Thus, I'm sorry, what I have are not really constructive ideas, merely 
warnings. Essentially: Don't trust text extracted from PDFs per se. Restrict 
yourself to PDFs from sources that guarantee they provide correct information 
for text extraction in their PDFs. Or at least double check.

> No Unicode mapping
> ------------------
>
>                 Key: PDFBOX-4549
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4549
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.15
>            Reporter: Sergey Makarov
>            Assignee: Tilman Hausherr
>            Priority: Major
>             Fix For: 2.0.16, 3.0.0 PDFBox
>
>         Attachments: XO_Thames.zip, our_star_wars.pdf
>
>
> Hello, if i try get text from pdf (attached), i will result empty out and 
> many warns. Font attached also.
>  Acrobat reader will open succeed, I can select, copy text and save as text
> my code:
> {code:java}
> private static void parseOne(String path) throws IOException {
>     String pdfFileInText;
>     PDFTextStripper tStripper;
>     File file = new File(path);
>     tStripper = new PDFTextStripper();
>     MemoryUsageSetting memUsageSetting = MemoryUsageSetting.setupMixed(0, 
> 500000000).setTempDir(new File("/home/user/pdfBoxTest/newFiles/"));
>     PDDocument document = PDDocument.load(file, memUsageSetting);
>     if (!document.isEncrypted()) {
>         pdfFileInText = tStripper.getText(document);
>         System.out.print(pdfFileInText);
>     }
>     document.close();
> }{code}
> Error:
> {code:java}
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDFont <init>
> WARNING: Invalid ToUnicode CMap in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+83 (83) in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+116 (116) in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+97 (97) in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+114 (114) in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+87 (87) in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+115 (115) in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDFont <init>
> WARNING: Invalid ToUnicode CMap in font HPDFAB+DejaVuSansMono,Book
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to