[jira] [Commented] (PDFBOX-1706) Reading PDF documents that contain special characters (e.g. €) cause warning and invalid parse result

Robert Neumann (JIRA) Tue, 27 Aug 2013 04:24:03 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13751187#comment-13751187
 ]


Robert Neumann commented on PDFBOX-1706:
----------------------------------------

I confirm: extacting the text of the wohle document works fine. However, when 
splitting the document, the beforementioned problem occurs. I just verified 
that.

We will modify our code to process the whole document, instead of the splits. 
That should do the trick for us. Maybe, you still want to consider this as an 
issue?

Thanks for the fast reply!
                
> Reading PDF documents that contain special characters (e.g. €) cause warning 
> and invalid parse result
> -----------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1706
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1706
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDFReader, PDModel
>    Affects Versions: 1.8.2, 2.0.0
>         Environment: Windows
>            Reporter: Robert Neumann
>              Labels: patch
>
> When trying to call stripper.getText on the PDF file 
> http://www.edi-energy.de/files2/EDI@Energy%20UTILMD%205.1_20130401.pdf, 
> PDFBox 1.8.2 emits the following warning:
> 08:48:20,222  WARN PDFStreamEngine:567 - java.io.IOException: Error: Could 
> not find font(COSName{F7}) in 
> map={F1=org.apache.pdfbox.pdmodel.font.PDTrueTypeFont@676825b5, 
> F2=org.apache.pdfbox.pdmodel.font.PDTrueTypeFont@547e97d8}
> java.io.IOException: Error: Could not find font(COSName{F7}) in 
> map={F1=org.apache.pdfbox.pdmodel.font.PDTrueTypeFont@676825b5, 
> F2=org.apache.pdfbox.pdmodel.font.PDTrueTypeFont@547e97d8}
>                 at 
> org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:57)
>                 at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>                 at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
>                 at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
>                 at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
>                 at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:455)
>                 at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:379)
>                 at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:335)
>                 at 
> org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:254)
> Interestingly, PDFBox 2.0 emits a different warning that calls out the 
> problem more precisely:
> Aug 27, 2013 9:35:30 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> extractToUnicodeEncoding
> SEVERE: Error: Could not load embedded ToUnicode CMap
> Aug 27, 2013 9:35:30 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> getSpaceWidth
> SEVERE: Can't determine the width of the space character using 250 as default
> java.lang.NullPointerException
>       at 
> org.apache.pdfbox.pdmodel.font.PDSimpleFont.getSpaceWidth(PDSimpleFont.java:406)
>       at 
> org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:343)
>       at 
> org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:62)
>       at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:529)
>       at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:258)
>       at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:225)
>       at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:205)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:455)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:379)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:335)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:254)
> We could trace the problem down to reading pages that contain special 
> characters (e.g. €). In the referenced PDF document, pages that do not 
> contain special characters (e.g. €) do not cause the above mentioned warning. 
> The text parts in the document that cause the warning do not get parsed 
> correctly. The parse result contains byte rubbish. 
> Adobe reader displays the entire document correctly.
> The following snippet should serve as a repro:
> package com.regiocom.bpo.mig;
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.FileNotFoundException;
> import java.io.IOException;
> import java.util.List;
> import org.apache.pdfbox.pdfparser.PDFParser;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.util.PDFTextStripper;
> import org.apache.pdfbox.util.Splitter;
> public class Repro {
>       
>       public Repro() {
>               
>               try {
>                       stripper = new PDFTextStripper();
>               } catch (IOException e) {
>                       e.printStackTrace();
>               }
>       }
>       // use this PDF as input: 
> http://www.edi-energy.de/files2/EDI@Energy%20UTILMD%205.1_20130401.pdf
>       public void run(String pdfFile) {
>       
>               PDDocument[] documents = loadAndSplitFile(pdfFile, 1);
>       
>               for(PDDocument document : documents) {
>                       parse(document);
>               }
>       }
>       
>       private PDDocument[] loadAndSplitFile(String pdfFile, int splitPage) {
>                       
>               List<PDDocument> documents;
>               Splitter splitter = new Splitter();             
>               PDFParser parser;
>               
>               try {                   
>                       parser = new PDFParser(new FileInputStream(new 
> File(pdfFile)));
>                       parser.parse();
>                       
>                       PDDocument doc = parser.getPDDocument();
>                       
>                       splitter.setSplitAtPage(splitPage);
>                       
>                       documents = splitter.split(doc);
>                       
>                       doc.close();
>                       
>                       return documents.toArray(new PDDocument[]{});
>               } catch (FileNotFoundException e) {
>                       e.printStackTrace();
>                       
>               } catch (IOException e) {
>                       e.printStackTrace();
>               }
>               
>               return null;
>       }
>       
>       private void parse(PDDocument pdfFile) {
>               try {
>                       stripper.getText(pdfFile);
>               } catch (IOException e) {
>                       e.printStackTrace();
>               }
>       }
>       
>       private PDFTextStripper stripper;
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1706) Reading PDF documents that contain special characters (e.g. €) cause warning and invalid parse result

Reply via email to