[jira] [Updated] (PDFBOX-5580) PDFTextStripperByArea ignores text for overlapping areas (regions) when suppressing duplicate overlapping text

Sebastian Holzki (Jira) Thu, 23 Mar 2023 04:08:27 -0700


     [ 
https://issues.apache.org/jira/browse/PDFBOX-5580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sebastian Holzki updated PDFBOX-5580:
-------------------------------------
    Description: 
h3. Problem

Recently we encountered duplicate texts in our clients PDF documents which are 
typically created by applications to simulate some kind of bold text when no 
bold variant of a font is available. Fortunately, PDFBox's 
PDFTextStripperByArea has some logic to ignore exact duplicates at the same 
positions for these situations (which is inherited from the normal 
PDFTextStripper). So we changed from setSuppressDuplicateOverlappingText(false) 
to true.

But we encountered that texts for multiple regions are not extracted correctly 
in this case when some special conditions are met:

When using multiple regions which overlap each other and would provide exactly 
the same text, the first region text is extracted correctly but any following 
region with same text remains empty.

We believe this is a bug due to duplicate suppression not being respected 
correctly in PDFTextStripperByArea.
h3. Possible cause

While investigating this problem we found that PDFTextStripperByArea swaps 
charactersByArticle for multiple regions and interprets a single page multiple 
times (once for each region). In PDFTextStripper a private HashMap 
characterListMapping keeps track of possible duplicate symbols with their 
positions. The HashMap is not being reset after each region extraction which 
leads to characters being ignored for subsequent areas.

Since the HashMap is private we were not able to subclass and customize 
PDFTextStripperByArea with some adjusted behavior to test this finding.
h3. Workaround

When extracting regions one at a time for every page everything works fine. We 
currently don't see any performance disadvantages.
h3. Reproduction

The attached PDF file does not actually include duplicate overlapping text 
since this is not needed to reproduce the issue.

 
{code:java}
try (final PDDocument doc = PDDocument.load(new File("C:\\Source\\test.pdf"))) {
    final PDFTextStripperByArea stripper = new PDFTextStripperByArea();
    stripper.setSuppressDuplicateOverlappingText(true);
    stripper.setPageEnd("");

    final Rectangle2D areaA = new Rectangle2D.Double(45, 319, 124, 19);
    final Rectangle2D areaB = new Rectangle2D.Double(43, 319, 130, 19);

    stripper.addRegion("A", areaA);
    stripper.addRegion("B", areaB);

    stripper.extractRegions(doc.getPage(0));

    System.out.println("A: " + stripper.getTextForRegion("A"));
    System.out.println("B: " + stripper.getTextForRegion("B"));
} {code}
 

 

  was:
h3. Problem

Recently we encountered duplicate texts in our clients PDF documents which are 
typically created by applications to simulate some kind of bold text when no 
bold variant of a font is available. Fortunately, PDFBox's 
PDFTextStripperByArea has some logic to ignore exact duplicates at the same 
positions for these situations (which is inherited from the normal 
PDFTextStripper). So we changed from setSuppressDuplicateOverlappingText(false) 
to true.

But we encountered that texts for multiple regions are not extracted correctly 
in this case when some special conditions are met:

When using multiple regions which overlap each other and would provide exactly 
the same text, the first region text is extracted correctly but any following 
region with same text remains empty.

We believe this is a bug due to duplicate suppression not being respected 
correctly in PDFTextStripperByArea.
h3. Possible cause

While investigating this problem we found that PDFTextStripperByArea swaps 
charactersByArticle for multiple regions and interprets a single page multiple 
times (once for each region). In PDFTextStripper a private HashMap 
characterListMapping keeps track of possible duplicate symbols with their 
positions. The HasMap is not being reset after each region extraction which 
leads to characters being ignored for subsequent areas.

Since the HashMap is private we were not able to subclass and customize 
PDFTextStripperByArea with some adjusted behavior to test this finding.
h3. Workaround

When extracting regions one at a time for every page everything works fine. We 
currently don't see any performance disadvantages.
h3. Reproduction

The attached PDF file does not actually include duplicate overlapping text 
which is actually not needed to reproduce the issue.

 
{code:java}
try (final PDDocument doc = PDDocument.load(new File("C:\\Source\\test.pdf"))) {
    final PDFTextStripperByArea stripper = new PDFTextStripperByArea();
    stripper.setSuppressDuplicateOverlappingText(true);
    stripper.setPageEnd("");

    final Rectangle2D areaA = new Rectangle2D.Double(45, 319, 124, 19);
    final Rectangle2D areaB = new Rectangle2D.Double(43, 319, 130, 19);

    stripper.addRegion("A", areaA);
    stripper.addRegion("B", areaB);

    stripper.extractRegions(doc.getPage(0));

    System.out.println("A: " + stripper.getTextForRegion("A"));
    System.out.println("B: " + stripper.getTextForRegion("B"));
} {code}
 

 


> PDFTextStripperByArea ignores text for overlapping areas (regions) when 
> suppressing duplicate overlapping text
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-5580
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5580
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.27
>            Reporter: Sebastian Holzki
>            Priority: Minor
>         Attachments: test.pdf
>
>
> h3. Problem
> Recently we encountered duplicate texts in our clients PDF documents which 
> are typically created by applications to simulate some kind of bold text when 
> no bold variant of a font is available. Fortunately, PDFBox's 
> PDFTextStripperByArea has some logic to ignore exact duplicates at the same 
> positions for these situations (which is inherited from the normal 
> PDFTextStripper). So we changed from 
> setSuppressDuplicateOverlappingText(false) to true.
> But we encountered that texts for multiple regions are not extracted 
> correctly in this case when some special conditions are met:
> When using multiple regions which overlap each other and would provide 
> exactly the same text, the first region text is extracted correctly but any 
> following region with same text remains empty.
> We believe this is a bug due to duplicate suppression not being respected 
> correctly in PDFTextStripperByArea.
> h3. Possible cause
> While investigating this problem we found that PDFTextStripperByArea swaps 
> charactersByArticle for multiple regions and interprets a single page 
> multiple times (once for each region). In PDFTextStripper a private HashMap 
> characterListMapping keeps track of possible duplicate symbols with their 
> positions. The HashMap is not being reset after each region extraction which 
> leads to characters being ignored for subsequent areas.
> Since the HashMap is private we were not able to subclass and customize 
> PDFTextStripperByArea with some adjusted behavior to test this finding.
> h3. Workaround
> When extracting regions one at a time for every page everything works fine. 
> We currently don't see any performance disadvantages.
> h3. Reproduction
> The attached PDF file does not actually include duplicate overlapping text 
> since this is not needed to reproduce the issue.
>  
> {code:java}
> try (final PDDocument doc = PDDocument.load(new 
> File("C:\\Source\\test.pdf"))) {
>     final PDFTextStripperByArea stripper = new PDFTextStripperByArea();
>     stripper.setSuppressDuplicateOverlappingText(true);
>     stripper.setPageEnd("");
>     final Rectangle2D areaA = new Rectangle2D.Double(45, 319, 124, 19);
>     final Rectangle2D areaB = new Rectangle2D.Double(43, 319, 130, 19);
>     stripper.addRegion("A", areaA);
>     stripper.addRegion("B", areaB);
>     stripper.extractRegions(doc.getPage(0));
>     System.out.println("A: " + stripper.getTextForRegion("A"));
>     System.out.println("B: " + stripper.getTextForRegion("B"));
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5580) PDFTextStripperByArea ignores text for overlapping areas (regions) when suppressing duplicate overlapping text

Reply via email to