[
https://issues.apache.org/jira/browse/PDFBOX-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17092706#comment-17092706
]
Andreas Lehmkühler commented on PDFBOX-4749:
--------------------------------------------
I've added support for the origin byte length of all mappings of a CMap. In the
end those work correct if the origin length of the input value is available.
PDFBox converts the byte value to an integer in an early stage of the parsing
process so that the information of the origin code length is lost. We should
consider to refactor that part of the code. However I've found a way to
estimate the origin code length for the given issue so that the mapping works.
I'd appreciate any feedback
> Text Extraction leads to weird result - toUnicodeCMap is 'AdHoc-UCS'
> --------------------------------------------------------------------
>
> Key: PDFBOX-4749
> URL: https://issues.apache.org/jira/browse/PDFBOX-4749
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.18
> Reporter: Benoit Lacelle
> Assignee: Andreas Lehmkühler
> Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: PDFBOX-4749-reduced.pdf
>
>
> I consider the attached PDF. I consider the text on the first page:
> "Am Fährweg"
> It appears the code for the first character 'A' is 65 and is parsed
> correctly, while the code for the fourth character 'F' is 70 which is parsed
> as a 'c'.
> org.apache.pdfbox.pdmodel.font.PDFont.toUnicode(int) relies on a CMap named
> 'AdHoc-UCS' which mapping in :
> {129=ü, 3= , 8=%, 9=&, 11=(, 12=), 15=,, 16=-, 17=., 18=/, 19=0, 20=1, 21=2,
> 22=3, 23=4, 24=5, 25=6, 26=7, 27=8, 28=9, 29=:, 34=?, 36=A, 37=B, 38=C, 39=D,
> 40=E, 41=F, 42=G, 43=H, 44=I, 46=K, 47=L, 48=M, 49=N, 50=O, 51=P, 53=R, 54=S,
> 55=T, 56=U, 57=V, 58=W, 59=X, 61=Z, 68=a, 69=b, 70=c, 71=d, 72=e, 73=f, 74=g,
> 75=h, 76=i, 78=k, 79=l, 80=m, 81=n, 82=o, 83=p, 85=r, 86=s, 87=t, 88=u, 89=v,
> 90=w, 93=z, 95=|, 108=ä, 124=ö}
> -> 'A' is parsed as 'A' as it is out of the mapping of CMap, while 'F'
> conflicts the entry mapping 70 to c.
> The document is correctly parsed in Acrobat Reader.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]