[jira] [Commented] (PDFBOX-3719) pdfbox parses spaces as tabs

Tilman Hausherr (JIRA) Thu, 16 Mar 2017 10:12:57 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15928448#comment-15928448
 ]


Tilman Hausherr commented on PDFBOX-3719:
-----------------------------------------

The behavior is correct. Your font TT2 has this ToUnicode content:
{code}
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo <<
  /Registry (Adobe)
  /Ordering (UCS)
  /Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<00><FF>
endcodespacerange
26 beginbfrange
<21><21><0044>
<22><22><0075>
<23><23><006d>
<24><24><0079>
<25><25><0009>  <==== that's a tab
<26><26><0064>
<27><27><006f>
<28><28><0063>
<29><29><0065>
<2a><2a><006e>
<2b><2b><0074>
<2c><2c><0066>
<2d><2d><0072>
<2e><2e><0061>
<2f><2f><0067>
<30><30><0078>
<31><31><0069>
<32><32><0053>
<33><33><0031>
<34><34><0054>
<35><35><0068>
<36><36><0073>
<37><37><0062>
<38><38><0077>
<39><39><0032>
<3a><3a><0076>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end
end
{code}
Hex 25 is "%", and 0009 is a tab. If you look at your content stream with 
PDFDebugger, you'll see that "%" is used for font TT2 a lot.

You should complain to the creator of the file, "Mac OS X 10.12.3 Quartz 
PDFContext", and ask why a TAB in the ToUnicode content.

> pdfbox parses spaces as tabs 
> -----------------------------
>
>                 Key: PDFBOX-3719
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3719
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.13
>            Reporter: Ahmed Eltayeb
>         Attachments: DummyDoc.docx, DummyDoc.pdf
>
>
> i converted this pdf from the attached word document "DummyDoc.docx" 
> then when using pdfbox1.8 to extract text
> java -jar pdfbox-app-1.8.13.jar ExtractText "DummyDoc.pdf" txt.txt
> and the generated is 
> Dummy document        for     tag     extraction      
>       
> Section       1       
>       
> \\DummyTagOne_01  
> This  is      text    body    one     
>       
> \\DummyTagOne_02  
> This  is      text    body    two     
>       
> Section       2       
> \\DummyTagTwo_01  
> This  is      text    body    three   
>       
> \\DummyTagTwo_02  
> This  is      text    body    four    
>       
> \\DummyTagTwo_03  
> This  is      text    body    five    
> as you can see "This  is      text    body    one     " instead of "This is 
> text body one     " and so on 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3719) pdfbox parses spaces as tabs

Reply via email to