[jira] [Commented] (PDFBOX-3166) Unwanted spaces before number in chinese text extraction

Tilman Hausherr (JIRA) Wed, 16 Dec 2015 20:18:07 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15061436#comment-15061436
 ]


Tilman Hausherr commented on PDFBOX-3166:
-----------------------------------------

{quote}
But it cannot eliminate space before the 1 , if I add setSpacingTolerance value.
{quote}
Because the space is really there, see the image. The spacing tolerance helps 
to decide where characters are seperated or not. You can play with that one if 
you have some special documents where words appear split or always together. 
Try it on a document with english text, there it will be more obvious because 
one word = several glyphs: depending on the value, the sentence I just wrote 
would be extracted as "Tryitonadocumentwithwesterntext" or "Tr y it on a doc 
ume nt w ith wes te rn te xt".

No, there is no API to remove the space before the "1" because it really exists 
in the PDF. PDF files are created by a wide variety of software and there are 
often surprises.

As I said, just add a trim() to each line.

If you need more help, tell us what your application is about and why the space 
is a problem.

> Unwanted spaces before number in chinese text extraction
> --------------------------------------------------------
>
>                 Key: PDFBOX-3166
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3166
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>         Environment: Windows
>            Reporter: Gang Luo
>              Labels: test
>         Attachments: 1201830823-marked-1.png
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Unwanted spaces before number in chinese date text .
> such as this pdf file
> http://www.cninfo.com.cn/finalpage/2015-12-12/1201830823.PDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3166) Unwanted spaces before number in chinese text extraction

Reply via email to