[jira] Resolved: (PDFBOX-267) CMap parse fails during text extract

Jukka Zitting (JIRA) Wed, 15 Dec 2010 13:56:39 -0800

     [ 
https://issues.apache.org/jira/browse/PDFBOX-267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jukka Zitting resolved PDFBOX-267.
----------------------------------

    Resolution: Incomplete

Test document not available.

> CMap parse fails during text extract
> ------------------------------------
>
>                 Key: PDFBOX-267
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-267
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>            Priority: Minor
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1702313
> Originally submitted by matthillsdon on 2007-04-17 09:21.
> Unfortunately I cannot supply the PDF file.  Any suggestion appreciated.
> Exception in thread "main" java.io.IOException: Error: expected the end of a 
> dictionary.
>         at org.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:220)
>         at org.fontbox.cmap.CMapParser.parse(CMapParser.java:79)
>         at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:535)
>         at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387)
>         at 
> org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
>         at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
>         at 
> org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
>         at 
> org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
>         at 
> org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
>         at 
> org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
>         at 
> org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
>         at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>         at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
> ...
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1702313&file_id=226802
> ExtractFonts.java (text/java), 1721 bytes
> A simple program to extract fonts and CMap streams
> [comment on SourceForge]
> Originally sent by matthillsdon.
> Logged In: YES 
> user_id=701665
> Originator: YES
> Sorry for the delay.  Updated extract output at
> http://www.hillsdon.net/CMapDocument3.pdf
> Stack trace for text extract as before:
> Exception in thread "main" java.io.IOException: Error: expected the end of a 
> dictionary.
>         at org.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:269)
>         at org.fontbox.cmap.CMapParser.parse(CMapParser.java:117)
> ...
> Thanks, Matt.
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES 
> user_id=601708
> Originator: NO
> Hi Matt,
> any update?
> Ben
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES 
> user_id=601708
> Originator: NO
> ok, I looked at it some more and I'd like to have you get the latest nightly 
> build and try to run ExtractText on your original PDF again.  If it doesn't 
> work then run the ExtractFonts again(using the nightly build) and post the 
> results.
> The issue is that there is some extra data at the end of the Cmap stream and 
> tonight I happened to fix an issue with parsing and having extra data at the 
> end of the stream for a different user.  So I don't know if this is the same 
> issue but I'd rather have you try the nightly build than have me chasing a 
> ghost.
> Ben
> [comment on SourceForge]
> Originally sent by matthillsdon.
> Logged In: YES 
> user_id=701665
> Originator: YES
> Output with the decryption here
> http://www.hillsdon.net/CMapDocument2.pdf
> Thanks.
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES 
> user_id=601708
> Originator: NO
> shoot, I think your document was encrypted.  It needs to be decrypted for the 
> extraction to work, I should have had that as part of the program.  Can you 
> take the attached program and add the lines after the PDDocument.load call
> if( doc.isEncrypted() )
> {
>     doc.decrypt( "" );
> }
> and resend the CMapDocument.pdf
> Thanks,
> Ben
> [comment on SourceForge]
> Originally sent by matthillsdon.
> Logged In: YES 
> user_id=701665
> Originator: YES
> Result too large to attach.  Please see
> http://www.hillsdon.net/CMapDocument.pdf
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES 
> user_id=601708
> Originator: NO
> Attached is a simple java program that will create a new pseudo PDF document 
> that contains just the Font information.  Please run it on the problem PDF 
> and upload the resulting CmapDocument.pdf 
> It is a simple command line program, first compile then run it like this
> java ExtractFonts my.pdf
> Let me know if you have any questions getting it running.
> Ben
> File Added: ExtractFonts.java
> [comment on SourceForge]
> Originally sent by matthillsdon.
> Logged In: YES 
> user_id=701665
> Originator: YES
> No change unfortunately - with FontBox-0.2.0-dev-20070424 the stack trace is 
> identical.
> Exception in thread "main" java.io.IOException: Error: expected the end of a 
> dictionary.
>         at org.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:269)
> ...
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES 
> user_id=601708
> Originator: NO
> I just update the CMapParser with a bug from 
> https://sourceforge.net/forum/message.php?msg_id=4269559
> please get tonights FontBox build and give it a try
> http://www.fontbox.org/fontbox
> [comment on SourceForge]
> Originally sent by matthillsdon.
> Logged In: YES 
> user_id=701665
> Originator: YES
> Hi Ben, thanks for the quick response.
> Using the nightly build [1] the stack trace is the same except for line 
> numbers:
> Exception in thread "main" java.io.IOException: Error: expected the end of a 
> dictionary.
>         at org.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:269)
>         at org.fontbox.cmap.CMapParser.parse(CMapParser.java:117)
>         at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:509)
>         at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:380)
>         at 
> org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
>         at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
>         at 
> org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
>         at 
> org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
>         at 
> org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
>         at 
> org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
>         at 
> org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
>         at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>         at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
> ...
> Extracting the fonts sounds ideal.
> [1] http://www.pdfbox.org/dist/PDFBox-0.7.4-dev-20070418.zip
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES 
> user_id=601708
> Originator: NO
> Hi Matt,
> Can you try one for me first; upgrade to the latest nightly build of PDFBox( 
> http://www.pdfbox.org/dist/ ) and see if this is still an issue.  There have 
> been some changes to the CMAPParser.
> If it is still an issue I think we can write a simple program to extract just 
> the fonts from your PDF and that should be enough for me to fix the bug.
> Ben

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PDFBOX-267) CMap parse fails during text extract

Reply via email to