[jira] [Comment Edited] (PDFBOX-4480) Problem extracting text in newline characters and spaces beetween words

Tilman Hausherr (JIRA) Tue, 05 Mar 2019 09:45:28 -0800


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16784704#comment-16784704
 ]


Tilman Hausherr edited comment on PDFBOX-4480 at 3/5/19 5:44 PM:
-----------------------------------------------------------------

This is a duplicate of PDFBOX-3464, see the explanation there. I'll probably 
commit the proposed solution from there - Adobe can extract these files 
properly and so should we, even if the font is not OK. I tested my files and 
the differences are acceptable. However it is possible that this has a bad 
influence on the "big" regression test by Tim that is done shortly before a 
release. If this happens, then I would revert and you should switch to a plan B 
which is to use the actual glyph height in LegacyPDFStreamEngine (I can't find 
it right now but I have it somewhere, and it was also mentioned in a JIRA 
issue).


was (Author: tilman):
This is a duplicate of PDFBOX-3464, see the explanation there. I'll probably 
commit the proposed solution from there - Adobe can extract these files 
properly and so should be, even if the font is not OK. I tested my files and 
the differences are acceptable. However it is possible that this has a bad 
influence on the "big" regression test by Tim that is done shortly before a 
release. If this happens, then I would revert and you should switch to a plan B 
which is to use the actual glyph height in LegacyPDFStreamEngine (I can't find 
it right now but I have it somewhere, and it was also mentioned in a JIRA 
issue).

> Problem extracting text in newline characters and spaces beetween words
> -----------------------------------------------------------------------
>
>                 Key: PDFBOX-4480
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4480
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.13
>         Environment: macOs
>            Reporter: ANIL SANGHANI
>            Priority: Major
>              Labels: textextraction
>         Attachments: Document.txt, Narasimhan S.pdf
>
>
>  
> I have a PDF file , when I try to extract its text using
> It ignores some Enter characters between lines, so the last word in the line 
> and the first word in the next line appear as 1 word without spaces between 
> them !!
> For Example, In Attached Pdf
> main Bsk as mainBsk
> [[email protected] Bangalore|mailto:[email protected]] 
> as [email protected]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-4480) Problem extracting text in newline characters and spaces beetween words

Reply via email to