[
https://issues.apache.org/jira/browse/PDFBOX-3166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060317#comment-15060317
]
Tilman Hausherr commented on PDFBOX-3166:
-----------------------------------------
I assume you mean the two spaces before the "1" at the beginning. There really
is one space before and after the 1 in the PDF (look for Tj):
{code}
/Artifact << /Attached [ /Bottom ] /Type /Pagination >> BDC
BT
/TT0 9 Tf
90.02 51.72 Td
( ) Tj
ET
q
295.42 49.62 4.5 10.32 re
W*
n
q
295.42 49.62 4.5 10.32 re
W*
n
BT
/TT0 9 Tf
295.42 51.78 Td
(1) Tj
ET
Q
q
295.42 49.62 4.5 10.32 re
W*
n
BT
/TT0 9 Tf
299.92 51.78 Td
( ) Tj
ET
EMC
Q
Q
{code}
The second space is because the real space is too far away from the 1. See also
the attached image which shows where the space is.
I am aware that Adobe Reader and PDF.js do not bring that space. But I don't
consider this to be important - and fixing this "problem" might bring new
problems, text extraction is very sensitive to changes. You can eliminate
leading or trailing spaces with trim, or eliminate double spaces with replace.
The good side of your issue is that if the only topic you're complaining is a
space, it means that the rest is pretty good :-)
> Unwanted spaces before number in chinese text extraction
> --------------------------------------------------------
>
> Key: PDFBOX-3166
> URL: https://issues.apache.org/jira/browse/PDFBOX-3166
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0
> Environment: Windows
> Reporter: Gang Luo
> Labels: test
> Attachments: 1201830823-marked-1.png
>
> Original Estimate: 72h
> Remaining Estimate: 72h
>
> Unwanted spaces before number in chinese date text .
> such as this pdf file
> http://www.cninfo.com.cn/finalpage/2015-12-12/1201830823.PDF
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]