[jira] [Commented] (PDFBOX-4322) Extract Text feature is not working for some part of PDF

Tilman Hausherr (JIRA) Mon, 24 Sep 2018 21:06:12 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626744#comment-16626744
 ]


Tilman Hausherr commented on PDFBOX-4322:
-----------------------------------------

What I was thinking (unless other opinions come up) was to make the change in 
the 2.0 branch after release of 2.0.12 which is soon, hopefully. So it should 
be in 2.0.13 which would be released in 3-4 months. Or you use a snapshot which 
would appear within hours of making the commit. A third possibility is that you 
take the source code of the 2.0.2 release (you seem to insist on that one?), 
make the change (change a few lines in PDFont.java) and build locally (there 
will NOT be a modified 2.0.2 release).

I'd still recommend that you wait until Tim has made the regression test. This 
is a test with 250000 PDF files, and it is analysed whether the extraction is 
better or not.

The "worst" that could happen is that we get more PDFs with garbled text than 
before, as in PDFBOX-3123.

> Extract Text feature is not working for some part of PDF
> --------------------------------------------------------
>
>                 Key: PDFBOX-4322
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4322
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.2, 2.0.11
>            Reporter: Amit Maheshwari
>            Priority: Major
>             Fix For: 2.0.13, 3.0.0 PDFBox
>
>         Attachments: PDFBOX-4322-Empty-ToUnicode-reduced.pdf, pdf__1.pdf, 
> pdf__1.pdf.xml
>
>
> Text Extraction feature cannot extract text from attached pdf properly.
>  
> Text inside of rectangle box (e.g value of Lending Specialist and others) is 
> not getting extracted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-4322) Extract Text feature is not working for some part of PDF

Reply via email to