[ https://issues.apache.org/jira/browse/PDFBOX-439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722610#action_12722610 ]
Brian Carrier commented on PDFBOX-439: -------------------------------------- A few more details on what we tried: - Our goal was to detect the overlaps based on text coordinates and use logic similar to how we are currently detecting and merging in diacritics (see PDFBOX-444). - There is an open question about how we effeciently search through existing TextPositions to find the overlap because we are not storing them in sorted order. We initially did a basic approach of comparing new TextPositions with existing TextPositions and this caused the regression tests to take 4 times as long. Storing in sorted order would make things more efficient, but there has been a desire to preserve the non-sorted order of the text chunks. - In general, the merging approach worked, except that we found some files in the regression tests that had character widths of 0 and others with very large widths. The 0s were because the character width is not currently being calculated in processEncodedText() for rotated text and we could not find the source of the very large widths. > Incorrect text for Exolab.pdf in Regression Test > ------------------------------------------------ > > Key: PDFBOX-439 > URL: https://issues.apache.org/jira/browse/PDFBOX-439 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Reporter: Justin LeFebvre > > When looking through text for an unrelated issue, I noticed that the file > Exolab.pdf in the regression test produced the following line, > JAJAVVAA CODINING STANDAG > STANDARD.......................................................................................................................1 > when the line should say, > JAVA CODING STANDARD > .......................................................................................................................1 > Also this line, > 5 COD5 CODE EXAMPLMPLES................................S > ...................................................................................................................................26 > should be > 5 CODE > EXAMPLES...................................................................................................................................26 > However, Adobe has trouble with this one as well. > These two issues only occurred when the file was run with the -sort option > enabled. > However, In both the unsorted and sorted tests this line was improperly > handled: > APPENDIX A : DOCUMENT HISTORYT HISTORYT > HISTORY...................................................................................................33 > > should produce > APPENDIX A : DOCUMENT HISTORY > ...................................................................................................33 > I ran into this test using the current trunk. > The Exolab.pdf file is located in the ..\source\trunk\test\input folder -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.