[
https://issues.apache.org/jira/browse/PDFBOX-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15928448#comment-15928448
]
Tilman Hausherr commented on PDFBOX-3719:
-----------------------------------------
The behavior is correct. Your font TT2 has this ToUnicode content:
{code}
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo <<
/Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<00><FF>
endcodespacerange
26 beginbfrange
<21><21><0044>
<22><22><0075>
<23><23><006d>
<24><24><0079>
<25><25><0009> <==== that's a tab
<26><26><0064>
<27><27><006f>
<28><28><0063>
<29><29><0065>
<2a><2a><006e>
<2b><2b><0074>
<2c><2c><0066>
<2d><2d><0072>
<2e><2e><0061>
<2f><2f><0067>
<30><30><0078>
<31><31><0069>
<32><32><0053>
<33><33><0031>
<34><34><0054>
<35><35><0068>
<36><36><0073>
<37><37><0062>
<38><38><0077>
<39><39><0032>
<3a><3a><0076>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end
end
{code}
Hex 25 is "%", and 0009 is a tab. If you look at your content stream with
PDFDebugger, you'll see that "%" is used for font TT2 a lot.
You should complain to the creator of the file, "Mac OS X 10.12.3 Quartz
PDFContext", and ask why a TAB in the ToUnicode content.
> pdfbox parses spaces as tabs
> -----------------------------
>
> Key: PDFBOX-3719
> URL: https://issues.apache.org/jira/browse/PDFBOX-3719
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.13
> Reporter: Ahmed Eltayeb
> Attachments: DummyDoc.docx, DummyDoc.pdf
>
>
> i converted this pdf from the attached word document "DummyDoc.docx"
> then when using pdfbox1.8 to extract text
> java -jar pdfbox-app-1.8.13.jar ExtractText "DummyDoc.pdf" txt.txt
> and the generated is
> Dummy document for tag extraction
>
> Section 1
>
> \\DummyTagOne_01
> This is text body one
>
> \\DummyTagOne_02
> This is text body two
>
> Section 2
> \\DummyTagTwo_01
> This is text body three
>
> \\DummyTagTwo_02
> This is text body four
>
> \\DummyTagTwo_03
> This is text body five
> as you can see "This is text body one " instead of "This is
> text body one " and so on
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]