[
https://issues.apache.org/jira/browse/PDFBOX-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626744#comment-16626744
]
Tilman Hausherr commented on PDFBOX-4322:
-----------------------------------------
What I was thinking (unless other opinions come up) was to make the change in
the 2.0 branch after release of 2.0.12 which is soon, hopefully. So it should
be in 2.0.13 which would be released in 3-4 months. Or you use a snapshot which
would appear within hours of making the commit. A third possibility is that you
take the source code of the 2.0.2 release (you seem to insist on that one?),
make the change (change a few lines in PDFont.java) and build locally (there
will NOT be a modified 2.0.2 release).
I'd still recommend that you wait until Tim has made the regression test. This
is a test with 250000 PDF files, and it is analysed whether the extraction is
better or not.
The "worst" that could happen is that we get more PDFs with garbled text than
before, as in PDFBOX-3123.
> Extract Text feature is not working for some part of PDF
> --------------------------------------------------------
>
> Key: PDFBOX-4322
> URL: https://issues.apache.org/jira/browse/PDFBOX-4322
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.2, 2.0.11
> Reporter: Amit Maheshwari
> Priority: Major
> Fix For: 2.0.13, 3.0.0 PDFBox
>
> Attachments: PDFBOX-4322-Empty-ToUnicode-reduced.pdf, pdf__1.pdf,
> pdf__1.pdf.xml
>
>
> Text Extraction feature cannot extract text from attached pdf properly.
>
> Text inside of rectangle box (e.g value of Lending Specialist and others) is
> not getting extracted.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]