[jira] Created: (PDFBOX-662) PDFTextStripper character suppression

Mel Martinez (JIRA) Mon, 15 Mar 2010 15:28:53 -0700

PDFTextStripper character suppression
-------------------------------------


                 Key: PDFBOX-662
                 URL: https://issues.apache.org/jira/browse/PDFBOX-662
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.0.0
         Environment: any
            Reporter: Mel Martinez


When parsing the file posted as an example for PDFBox-659, I noticed that 
numerous characters were missing from the extracted text.

They are getting 'suppressed' in the 
PDFTextStripper.processTextPosition(TextPosition) method in a section that is 
meant to try to filter duplicate chars found in some MS Word - generated 
documents.

The problem is that the filter is over-zealous (in the case of this document) 
and matches real characters against other real characters in the text.  Example

   This is some text that has the letter 'e' in it multiple times.

The filter might match one of the later 'e's to an earlier 'e' incorrectly (for 
example, the one at the end of 'some'), resulting in the extracted text:

   This is some text that has the letter 'e' in it multiple tims.
.
>From what I can tell this is because it is using the raw, padded coordinates 
>rather than resolved coordinates.

The example PDF document (see PDFBOX-659) has pages that use both positive and 
negative raw coordinates that upon my cursory inspection don't always resolve 
on the same offset point.

The suppression test logic compares textposition elements that seem to have 
different offsets, possibly due to different amounts of padding.  Thus the 
'overlap' that it detects is wrong.  Its not comparing apples to apples.

The document renders perfectly in Acrobat,  so I believe we are not handling 
the coordinates correctly.

A workaround is possible through suppressing the filtering by setting the 

PDFTextStripper.setSuppressDuplicateOverlappingText(boolean)

attribute to false.  But that is just hiding the fact that the logic is wrong.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PDFBOX-662) PDFTextStripper character suppression

Reply via email to