[jira] [Updated] (PDFBOX-1706) Reading PDF documents that contain special characters (e.g. €) cause warning and invalid parse result

John Hewson (JIRA) Sat, 08 Feb 2014 10:31:38 -0800

     [ 
https://issues.apache.org/jira/browse/PDFBOX-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


John Hewson updated PDFBOX-1706:
--------------------------------

    Component/s:     (was: Swing GUI)

> Reading PDF documents that contain special characters (e.g. €) cause warning 
> and invalid parse result
> -----------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1706
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1706
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>    Affects Versions: 1.8.2, 2.0.0
>         Environment: Windows
>            Reporter: Robert Neumann
>              Labels: patch
>
> When trying to call stripper.getText on the PDF file 
> http://www.edi-energy.de/files2/EDI@Energy%20UTILMD%205.1_20130401.pdf, 
> PDFBox 1.8.2 emits the following warning:
> 08:48:20,222  WARN PDFStreamEngine:567 - java.io.IOException: Error: Could 
> not find font(COSName{F7}) in 
> map={F1=org.apache.pdfbox.pdmodel.font.PDTrueTypeFont@676825b5, 
> F2=org.apache.pdfbox.pdmodel.font.PDTrueTypeFont@547e97d8}
> java.io.IOException: Error: Could not find font(COSName{F7}) in 
> map={F1=org.apache.pdfbox.pdmodel.font.PDTrueTypeFont@676825b5, 
> F2=org.apache.pdfbox.pdmodel.font.PDTrueTypeFont@547e97d8}
>                 at 
> org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:57)
>                 at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>                 at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
>                 at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
>                 at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
>                 at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:455)
>                 at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:379)
>                 at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:335)
>                 at 
> org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:254)
> Interestingly, PDFBox 2.0 emits a different warning that calls out the 
> problem more precisely:
> Aug 27, 2013 9:35:30 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> extractToUnicodeEncoding
> SEVERE: Error: Could not load embedded ToUnicode CMap
> Aug 27, 2013 9:35:30 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> getSpaceWidth
> SEVERE: Can't determine the width of the space character using 250 as default
> java.lang.NullPointerException
>       at 
> org.apache.pdfbox.pdmodel.font.PDSimpleFont.getSpaceWidth(PDSimpleFont.java:406)
>       at 
> org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:343)
>       at 
> org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:62)
>       at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:529)
>       at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:258)
>       at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:225)
>       at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:205)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:455)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:379)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:335)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:254)
> We could trace the problem down to reading pages that contain special 
> characters (e.g. €). In the referenced PDF document, pages that do not 
> contain special characters (e.g. €) do not cause the above mentioned warning. 
> The text parts in the document that cause the warning do not get parsed 
> correctly. The parse result contains byte rubbish. 
> Adobe reader displays the entire document correctly.
> The following snippet should serve as a repro:
> package com.regiocom.bpo.mig;
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.FileNotFoundException;
> import java.io.IOException;
> import java.util.List;
> import org.apache.pdfbox.pdfparser.PDFParser;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.util.PDFTextStripper;
> import org.apache.pdfbox.util.Splitter;
> public class Repro {
>       
>       public Repro() {
>               
>               try {
>                       stripper = new PDFTextStripper();
>               } catch (IOException e) {
>                       e.printStackTrace();
>               }
>       }
>       // use this PDF as input: 
> http://www.edi-energy.de/files2/EDI@Energy%20UTILMD%205.1_20130401.pdf
>       public void run(String pdfFile) {
>       
>               PDDocument[] documents = loadAndSplitFile(pdfFile, 1);
>       
>               for(PDDocument document : documents) {
>                       parse(document);
>               }
>       }
>       
>       private PDDocument[] loadAndSplitFile(String pdfFile, int splitPage) {
>                       
>               List<PDDocument> documents;
>               Splitter splitter = new Splitter();             
>               PDFParser parser;
>               
>               try {                   
>                       parser = new PDFParser(new FileInputStream(new 
> File(pdfFile)));
>                       parser.parse();
>                       
>                       PDDocument doc = parser.getPDDocument();
>                       
>                       splitter.setSplitAtPage(splitPage);
>                       
>                       documents = splitter.split(doc);
>                       
>                       doc.close();
>                       
>                       return documents.toArray(new PDDocument[]{});
>               } catch (FileNotFoundException e) {
>                       e.printStackTrace();
>                       
>               } catch (IOException e) {
>                       e.printStackTrace();
>               }
>               
>               return null;
>       }
>       
>       private void parse(PDDocument pdfFile) {
>               try {
>                       stripper.getText(pdfFile);
>               } catch (IOException e) {
>                       e.printStackTrace();
>               }
>       }
>       
>       private PDFTextStripper stripper;
> }



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (PDFBOX-1706) Reading PDF documents that contain special characters (e.g. €) cause warning and invalid parse result

Reply via email to