[ 
https://issues.apache.org/jira/browse/PDFBOX-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029078#comment-14029078
 ] 

Tilman Hausherr commented on PDFBOX-1919:
-----------------------------------------

I'm not the text extraction specialist here so here's just a first look. What 
looks weird to me are the ToUnicode tables of the two fonts:
{code}
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS) /Supplement 0 >> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
13 beginbfchar
<0015> <0020>
<0036> <0041>   <============
<0039> <0044>   <============
<003A> <0065>
<003B> <0066>
<003D> <0068>
<003E> <0049>   <============
<0041> <004C>   <============
<0043> <004E>   <============
<0044> <006F>
<0045> <0070>
<0047> <0052>   <============
<0049> <0074>
endbfchar
endcmap CMapName currentdict /CMap defineresource pop end end
{code}
{code}
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS) /Supplement 0 >> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
21 beginbfchar
<0015> <0020>
<0023> <002E>
<0024> <002F>
<0036> <0061>
<0038> <0063>
<0039> <0064>
<003A> <0065>
<003B> <0046>   <============
<003C> <0067>
<003D> <0068>
<003E> <0069>
<0041> <006C>
<0042> <006D>
<0043> <006E>
<0044> <006F>
<0045> <0070>
<0047> <0072>
<0048> <0073>
<0049> <0074>
<004A> <0075>
<004C> <0077>
endbfchar
endcmap CMapName currentdict /CMap defineresource pop end end
{code}

Look at the right column. It looks to me as if there is a mix between upper and 
lowercases. For example, 65 is "e", 46 is "F", and 67 is "g".

> Font descriptor flags are not implemented
> -----------------------------------------
>
>                 Key: PDFBOX-1919
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1919
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.5, 1.8.6, 2.0.0
>            Reporter: Corentin Regal
>         Attachments: PDFBOX-1919.pdf, PDFBOX-1919.txt
>
>
> The font descriptor flags are not set.
> They are described in the document "PDF reference 1.7" at : 5.7.1 Font 
> Descriptor Flags
> The methods in PDFontDescriptor are ready but never called :
> setFlags()
> setSerif()
> setAllCap() which is used in a lot of PDF
> ...
> I saw some TODO that relate to that issue in the code, is it planned to be 
> implemented soon?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to