[
https://issues.apache.org/jira/browse/PDFBOX-3166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15061254#comment-15061254
]
Gang Luo commented on PDFBOX-3166:
----------------------------------
Text extraction is very sensitive to changes. Yes ,I see. Is there API can
adjust space char to appear or not?
I try PDFTextStripper.setSpacingTolerance(). But it cannot eliminate space
before the 1 , if I add setSpacingTolerance value.
PDFTextStripper stripper = new PDFTextStripper();
stripper.setSpacingTolerance(800.0f); //0.08f
If I reduce the setSpacingTolerance value , it did add space after date number.
The rest is pretty good.
> Unwanted spaces before number in chinese text extraction
> --------------------------------------------------------
>
> Key: PDFBOX-3166
> URL: https://issues.apache.org/jira/browse/PDFBOX-3166
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0
> Environment: Windows
> Reporter: Gang Luo
> Labels: test
> Attachments: 1201830823-marked-1.png
>
> Original Estimate: 72h
> Remaining Estimate: 72h
>
> Unwanted spaces before number in chinese date text .
> such as this pdf file
> http://www.cninfo.com.cn/finalpage/2015-12-12/1201830823.PDF
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]