[ 
https://issues.apache.org/jira/browse/PDFBOX-439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722610#action_12722610
 ] 

Brian Carrier commented on PDFBOX-439:
--------------------------------------

A few more details on what we tried:
- Our goal was to detect the overlaps based on text coordinates and use logic 
similar to how we are currently detecting and merging in diacritics (see 
PDFBOX-444). 
- There is an open question about how we effeciently search through existing 
TextPositions to find the overlap because we are not storing them in sorted 
order.  We initially did a basic approach of comparing new TextPositions with 
existing TextPositions and this caused the regression tests to take 4 times as 
long.  Storing in sorted order would make things more efficient, but there has 
been a desire to preserve the non-sorted order of the text chunks.
- In general, the merging approach worked, except that we found some files in 
the regression tests that had character widths of 0 and others with very large 
widths. The 0s were because the character width is not currently being 
calculated in processEncodedText() for rotated text and we could not find the 
source of the very large widths.



> Incorrect text for Exolab.pdf in Regression Test
> ------------------------------------------------
>
>                 Key: PDFBOX-439
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-439
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: Justin LeFebvre
>
> When looking through text for an unrelated issue, I noticed that the file 
> Exolab.pdf in the regression test produced the following line,
> JAJAVVAA CODINING STANDAG 
> STANDARD.......................................................................................................................1
> when the line should say,
> JAVA CODING STANDARD 
> .......................................................................................................................1
> Also this line,
> 5 COD5 CODE EXAMPLMPLES................................S 
> ...................................................................................................................................26
> should be
> 5 CODE 
> EXAMPLES...................................................................................................................................26
> However, Adobe has trouble with this one as well. 
> These two issues only occurred when the file was run with the -sort option 
> enabled. 
> However, In both the unsorted and sorted tests this line was improperly 
> handled:
> APPENDIX A : DOCUMENT HISTORYT HISTORYT 
> HISTORY...................................................................................................33
>  
> should produce
> APPENDIX A : DOCUMENT HISTORY 
> ...................................................................................................33
> I ran into this test using the current trunk. 
> The Exolab.pdf file is located in the ..\source\trunk\test\input folder 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to