[jira] [Commented] (PDFBOX-4749) Text Exctraction leads to weird result - toUnicodeCMap is 'AdHoc-UCS'

Michael Klink (Jira) Tue, 21 Jan 2020 04:43:05 -0800


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17020194#comment-17020194
 ]


Michael Klink commented on PDFBOX-4749:
---------------------------------------

Indeed, it looks like the example document shows the same error as the document 
in that SO question.

Please be aware, though, that the document here does not only have fonts with 
that *Encoding* / *ToUnicode* mismatch but other fonts, too. Thus, the proposal 
in my answer to that question (to remove the *ToUnicode* entry from all fonts) 
for the document here must be refined to only remove the *ToUnicode* maps from 
fonts with single-byte encoding, e.g. *WinAnsiEncoding*.

> Text Exctraction leads to weird result - toUnicodeCMap is 'AdHoc-UCS'
> ---------------------------------------------------------------------
>
>                 Key: PDFBOX-4749
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4749
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.18
>            Reporter: Benoit Lacelle
>            Priority: Major
>         Attachments: 2020-01 Vodafone Invoice.pdf
>
>
> I consider the attached PDF. I consider the text on the first page:
> "Am Fährweg"
> It appears the code for the first character 'A' is 65 and is parsed 
> correctly, while the code for the fourth character 'F' is 70 which is parsed 
> as a 'c'.
> org.apache.pdfbox.pdmodel.font.PDFont.toUnicode(int) relies on a CMap named 
> 'AdHoc-UCS' which mapping in :
> {129=ü, 3= , 8=%, 9=&, 11=(, 12=), 15=,, 16=-, 17=., 18=/, 19=0, 20=1, 21=2, 
> 22=3, 23=4, 24=5, 25=6, 26=7, 27=8, 28=9, 29=:, 34=?, 36=A, 37=B, 38=C, 39=D, 
> 40=E, 41=F, 42=G, 43=H, 44=I, 46=K, 47=L, 48=M, 49=N, 50=O, 51=P, 53=R, 54=S, 
> 55=T, 56=U, 57=V, 58=W, 59=X, 61=Z, 68=a, 69=b, 70=c, 71=d, 72=e, 73=f, 74=g, 
> 75=h, 76=i, 78=k, 79=l, 80=m, 81=n, 82=o, 83=p, 85=r, 86=s, 87=t, 88=u, 89=v, 
> 90=w, 93=z, 95=|, 108=ä, 124=ö}
> -> 'A' is parsed as 'A' as it is out of the mapping of CMap, while 'F' 
> conflicts the entry mapping 70 to c.
> The document is correctly parsed in Acrobat Reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-4749) Text Exctraction leads to weird result - toUnicodeCMap is 'AdHoc-UCS'

Reply via email to