[
https://issues.apache.org/jira/browse/PDFBOX-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
John Hewson reopened PDFBOX-1988:
---------------------------------
Reopening because we leave issues open until the version they were fixed in is
released.
> PDFBox ExtractText issue of PDF with no embedded fonts
> ------------------------------------------------------
>
> Key: PDFBOX-1988
> URL: https://issues.apache.org/jira/browse/PDFBOX-1988
> Project: PDFBox
> Issue Type: Bug
> Components: Rendering, Text extraction
> Affects Versions: 1.8.4
> Environment: Windows 7
> Also, PASE on IBM i
> Reporter: Craig Strong
> Labels: patch
> Fix For: 1.8.5, 2.0.0
>
> Attachments: Test1.pdf
>
> Original Estimate: 120h
> Remaining Estimate: 120h
>
> I have been using PDFBox 1.8.4 to extract text from several different PDF
> files fine. I use the latest PDFBox app with ExtractText command line.
> There is one PDF that PDFBox (and iText) fails to extract any text even
> though I can extract the text with Adobe Reader and also pdftotext.exe part
> of XPdf. "java -jar pdfbox-app-1.8.4.jar ExtractText Test1.pdf Out.txt". I
> don't want to have to rely on using pdftotext.exe from a PC since this is
> part of an automated application. I think the error relates to an unknown
> font type and having to use the few fonts installed in the jar file. I tried
> running the API classes and trying to force a font from a certain location
> but I still got errors. I thought I loaded the font with the loadTTF method
> but I don't know if that did anything with the font. I would really like to
> have this working straight from the ExtractText class anyway.
> Here are the errors I am getting. I tried this from both a Windows 7 PC and
> our IBM i in the PASE environment but I get the same errors. The section
> starting processEncodedText and on repeats a few times so I just included the
> first entries.
>
> Mar 10, 2014 3:50:44 PM org.apache.pdfbox.pdmodel.font.PDFontFactory
> createFont
> WARNING: Substituting TrueType for unknown font subtype=
>
> Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine
> processOperator
> WARNING: java.lang.NullPointerException
>
> Throwable occurred: java.lang.NullPointerException
>
> at
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:375)
> at
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.ensureFontDescriptor(PDTrueTypeFont.java:221)
>
> at
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:119)
>
> at
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:121)
>
> at
> org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:204)
>
> at
> org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:604)
>
> at
> org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:54)
>
> at
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>
> at
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
> at
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
> at
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
>
> at
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)
>
> at
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381)
>
> at
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)
>
> at
> org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)
>
> at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
>
> at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)
>
> Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine
> processEncodedText
> WARNING: java.lang.NullPointerException
>
> Throwable occurred: java.lang.NullPointerException
>
> at
> org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355)
> at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45)
>
> at
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>
> at
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
>
> at
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
>
> at
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
>
> at
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)
>
> at
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381)
>
> at
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)
>
> at
> org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)
>
> at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
>
> at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)
>
> Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine
> processOperator
> WARNING: java.lang.NullPointerException
>
> Throwable occurred: java.lang.NullPointerException
>
> at
> org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:364)
> at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45)
>
> at
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>
> at
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
>
> at
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
>
> at
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
>
> at
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)
>
> at
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381)
>
> at
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)
>
> at
> org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)
>
> at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
>
> at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)
>
> Thanks,
> Craig Strong
--
This message was sent by Atlassian JIRA
(v6.2#6252)