[
https://issues.apache.org/jira/browse/PDFBOX-3166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15061436#comment-15061436
]
Tilman Hausherr commented on PDFBOX-3166:
-----------------------------------------
{quote}
But it cannot eliminate space before the 1 , if I add setSpacingTolerance value.
{quote}
Because the space is really there, see the image. The spacing tolerance helps
to decide where characters are seperated or not. You can play with that one if
you have some special documents where words appear split or always together.
Try it on a document with english text, there it will be more obvious because
one word = several glyphs: depending on the value, the sentence I just wrote
would be extracted as "Tryitonadocumentwithwesterntext" or "Tr y it on a doc
ume nt w ith wes te rn te xt".
No, there is no API to remove the space before the "1" because it really exists
in the PDF. PDF files are created by a wide variety of software and there are
often surprises.
As I said, just add a trim() to each line.
If you need more help, tell us what your application is about and why the space
is a problem.
> Unwanted spaces before number in chinese text extraction
> --------------------------------------------------------
>
> Key: PDFBOX-3166
> URL: https://issues.apache.org/jira/browse/PDFBOX-3166
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0
> Environment: Windows
> Reporter: Gang Luo
> Labels: test
> Attachments: 1201830823-marked-1.png
>
> Original Estimate: 72h
> Remaining Estimate: 72h
>
> Unwanted spaces before number in chinese date text .
> such as this pdf file
> http://www.cninfo.com.cn/finalpage/2015-12-12/1201830823.PDF
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]