[jira] [Issue Comment Edited] (PDFBOX-569) Text-Extraction of PDF fails

LynX (JIRA) Fri, 16 Sep 2011 11:21:31 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13106654#comment-13106654
 ]


LynX edited comment on PDFBOX-569 at 9/16/11 6:19 PM:
------------------------------------------------------

Hello,

This issue is opened for a long time, and I still was able to recreate it with 
latest sources from the SVN.
Yes in this PDF file we have two 204 0 objects:

{code}
....
....
...
----------- Offset 698060 ----------- 
204 0 obj
<<
/Type /FontDescriptor
/FontName /Tele-GroteskNor,Italic
/Ascent 820
/CapHeight 500
/Descent -180
/Flags 32
/FontBBox [-68 -250 1175 880]
/ItalicAngle 0
/StemV 0
/AvgWidth 357
/Leading 0
/MaxWidth 1243
/XHeight 250
>>
endobj
....
....
....
endstream
endobj
----------- Offset 705884 ----------- 
204 0 obj
<< 
/Type /Font 
/Subtype /Type1 
/Name /F579 
/BaseFont /Tele-GroteskNor,Italic 
/FirstChar 30 
/LastChar 255 
/Widths 205 0 R 
/Encoding /WinAnsiEncoding 
/FontDescriptor 206 0 R 
>> 
endobj
...
...
...
{code}

Only second object present in the xref table so it should be used instead of 
the first one. 

Current PDFParser code is able to handle such situations. When it founds two 
objects with similar names, it saves the offset to the second one in 
conflictList. When whole document is parsed it calls:

{code}
ConflictObj.resolveConflicts(document, conflictList);
{code}

Which stores only those object which offset is found in xrefTable. So it should 
save the second one and it is right.
But it fails to do this because offset to the second object calculated 
incorrectly.

Fix for this problem provided in the patch.

Regards,
LX



      was (Author: devlynx):
    Fix for offset problem
  
> Text-Extraction of PDF fails
> ----------------------------
>
>                 Key: PDFBOX-569
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-569
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>         Environment: 1.6.0_11
>            Reporter: Stephan Götter
>            Priority: Blocker
>         Attachments: TextExtractionFix-569.patch, b820GL0204.pdf
>
>
> Using trunk this Exception occurs when extracting text of attached PDF.
> [WARN] PDFParser - invalid xref line: 0
> java.io.IOException: Cannot create font if /Type is not /Font.  
> Actual=COSName{FontDescriptor}
>       at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:95)
>       at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:68)
>       at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:117)
>       at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:206)
>       at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:188)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:367)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:291)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:247)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (PDFBOX-569) Text-Extraction of PDF fails

Reply via email to