[ https://issues.apache.org/jira/browse/PDFBOX-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896950#comment-15896950 ]
Roman commented on PDFBOX-3710: ------------------------------- I see that cyan boxes are drawn by a separate cycle. Seems the only way how to workaround this, is to use this cycle for adding this "lost text" back. But this is going to be very problematic, we need to distinguish which characters will not be present in textPositions list, and implement the separate way of processing of such (not sure if is possible at all without a TextPosition object). > Text Stripper in 2.0 lost some texts - regression > ------------------------------------------------- > > Key: PDFBOX-3710 > URL: https://issues.apache.org/jira/browse/PDFBOX-3710 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Reporter: Roman > Attachments: highlight19.pdf_page1-marked-1.png, > highlight19.pdf_page1.pdf, regression_in_blue.png > > > After migration of our App from pdfbox 1.8 to 2.0, we noticed a regression: 4 > lines of texts are disappeared. Those are the texts followed by black bullet > (3 lines) and also "OVERALL" word which is placed above in table. > Problematic PDF attached - [^highlight19.pdf_page1.pdf] > Also, attached the result of > [DrawPrintTextLocations|https://apache.googlesource.com/pdfbox/+/trunk/examples/src/main/java/org/apache/pdfbox/examples/util/DrawPrintTextLocations.java] > example - > [highlight19.pdf_page1-marked-1.png|https://issues.apache.org/jira/secure/attachment/12856229/highlight19.pdf_page1-marked-1.png] > Notice, that unicodes, red and blue boxes missing for problematic text. The > main problem that these glyphs are absent in *textPositions* parameter which > is passed to *writeString* function, line #275. In the 1.8 version these > characters ARE present, so their positions along with their char codes could > be extracted fine in our App. > Also, attached picture of regression in our App - [^regression_in_blue.png]. > Here, blue boxes drawn where text WAS present and disappeared afterwards. > (The purple boxes are OK and should be ignored.) -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org