[jira] (PDFBOX-5580) PDFTextStripperByArea ignores text for overlapping areas (regions) when suppressing duplicate overlapping text

Tilman Hausherr (Jira) Fri, 17 May 2024 06:55:13 -0700


    [ https://issues.apache.org/jira/browse/PDFBOX-5580 ]



    Tilman Hausherr deleted comment on PDFBOX-5580:
    -----------------------------------------

was (Author: jira-bot):
Commit 1917787 from Tilman Hausherr in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1917787 ]

PDFBOX-5660, PDFBOX-5580: revert commit due to incompatibility with 
PDFTextStripperByArea

> PDFTextStripperByArea ignores text for overlapping areas (regions) when 
> suppressing duplicate overlapping text
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-5580
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5580
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.27, 3.0.0 PDFBox
>            Reporter: Sebastian Holzki
>            Priority: Minor
>         Attachments: test.pdf
>
>
> h3. Problem
> Recently we encountered duplicate texts in our clients PDF documents which 
> are typically created by applications to simulate some kind of bold text when 
> no bold variant of a font is available. Fortunately, PDFBox's 
> PDFTextStripperByArea has some logic to ignore exact duplicates at the same 
> positions for these situations (which is inherited from the normal 
> PDFTextStripper). So we changed from 
> setSuppressDuplicateOverlappingText(false) to true.
> But we encountered that texts for multiple regions are not extracted 
> correctly in this case when some special conditions are met:
> When using multiple regions which overlap each other and would provide 
> exactly the same text, the first region text is extracted correctly but any 
> following region with same text remains empty.
> We believe this is a bug due to duplicate suppression not being respected 
> correctly in PDFTextStripperByArea.
> h3. Possible cause
> While investigating this problem we found that PDFTextStripperByArea swaps 
> charactersByArticle for multiple regions and interprets a single page 
> multiple times (once for each region). In PDFTextStripper a private HashMap 
> characterListMapping keeps track of possible duplicate symbols with their 
> positions. The HashMap is not being reset after each region extraction which 
> leads to characters being ignored for subsequent areas.
> Since the HashMap is private we were not able to subclass and customize 
> PDFTextStripperByArea with some adjusted behavior to test this finding.
> h3. Workaround
> When extracting regions one at a time for every page everything works fine. 
> We currently don't see any performance disadvantages.
> h3. Reproduction
> The attached PDF file does not actually include duplicate overlapping text 
> since this is not needed to reproduce the issue.
>  
> {code:java}
> try (final PDDocument doc = PDDocument.load(new 
> File("C:\\Source\\test.pdf"))) {
>     final PDFTextStripperByArea stripper = new PDFTextStripperByArea();
>     stripper.setSuppressDuplicateOverlappingText(true);
>     stripper.setPageEnd("");
>     final Rectangle2D areaA = new Rectangle2D.Double(45, 319, 124, 19);
>     final Rectangle2D areaB = new Rectangle2D.Double(43, 319, 130, 19);
>     stripper.addRegion("A", areaA);
>     stripper.addRegion("B", areaB);
>     stripper.extractRegions(doc.getPage(0));
>     System.out.println("A: " + stripper.getTextForRegion("A"));
>     System.out.println("B: " + stripper.getTextForRegion("B"));
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] (PDFBOX-5580) PDFTextStripperByArea ignores text for overlapping areas (regions) when suppressing duplicate overlapping text

Reply via email to