[ 
https://issues.apache.org/jira/browse/PDFBOX-139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12728020#action_12728020
 ] 

Daniel Wilson commented on PDFBOX-139:
--------------------------------------

I haven't done this on THIS item, but really there are 2 options:

1.  Implement the missing items. I'm not intimately familiar with that part of 
the project, so can't say what that would involve.

2.  Catch the error and continue.  Likely something will be skipped, but the 
rest of the PDF, presumably, could deliver its text.  

BTW, which version are you using?  Between 0.7.3 and the trunk version, we've 
added a LOT of error-trapping.

> The CMapParser does not recognize essential cmap operators
> ----------------------------------------------------------
>
>                 Key: PDFBOX-139
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-139
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1438028
> Originally submitted by vdimchev on 2006-02-24 03:48.
> The bug is directly related to the following bug I 
> discovered in the database:
> [ 1208652 ] PDFTextStripper.writeText Exception:Unknown 
> encoding for ..
> I'll try to exlain it again here and supply enough 
> resources for its fix.
> The problem is that the current implementation of 
> CMapParser class supports only the beginbfchar and 
> beginbfrange operators.
> This is not enough and causes the invokation to 
> PDFTextStripper.writeText() to throw IOException with 
> the following message: Unknown encoding for 'Identity-
> V'.
> I also managed to produce the message: "Unknown 
> encoding for '90ms-RKSJ-H'.
> The complete stacktrace is:
> java.io.IOException: Unknown encoding for 'Identity-V'
>         at org.pdfbox.encoding.EncodingManager.
> getEncoding(EncodingManager.java:83)
>         at org.pdfbox.pdmodel.font.PDFont.
> getEncoding(PDFont.java:627)
>         at org.pdfbox.pdmodel.font.PDFont.
> encode(PDFont.java:476)
>         at org.pdfbox.util.PDFStreamEngine.
> showString(PDFStreamEngine.java:332)
>         at org.pdfbox.util.operator.ShowText.
> process(ShowText.java:66)
>         at org.pdfbox.util.PDFStreamEngine.
> processOperator(PDFStreamEngine.java:494)
>         at org.pdfbox.util.PDFStreamEngine.
> processSubStream(PDFStreamEngine.java:207)
>         at org.pdfbox.util.PDFStreamEngine.
> processStream(PDFStreamEngine.java:160)
>         at org.pdfbox.util.PDFTextStripper.
> processPage(PDFTextStripper.java:355)
>         at org.pdfbox.util.PDFTextStripper.
> processPages(PDFTextStripper.java:268)
>         at org.pdfbox.util.PDFTextStripper.
> writeText(PDFTextStripper.java:220)
> In fact the cause of this exception is that the 
> CMapParser does not recognize the begincidchar and 
> begincidrange operators (in the case of the 90ms-RKSJ-
> H) encoding and usecmap operator in the case of 
> Identity-V encoding.
> The cmap files for these encodings are not properly 
> parsed and the corresponding Cmap objects do not 
> contain neither one nor two byte mappings, further the 
> lookup() method returns null.
> I'll attach two samples for the 90ms-RKSJ-H encoding 
> and one for the Identity-V encoding.
> I'll attach cmap reference also.
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1438028&file_id=168711
> 5014.CIDFont_Spec.rar (application/octet-stream), 240282 bytes
> Reference, containing CMAP description
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1438028&file_id=168709
> ken1.pdf (application/pdf), 33713 bytes
> The Identity-V sample
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1438028&file_id=168708
> tp0404-2a.pdf (application/pdf), 11434 bytes
> The second 90ms-RKSJ-H sample
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1438028&file_id=168705
> nan_youkou.pdf (application/pdf), 7663 bytes
> The first 90ms-RKSJ-H sample

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to