[
https://issues.apache.org/jira/browse/PDFBOX-5580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847298#comment-17847298
]
ASF subversion and git services commented on PDFBOX-5580:
---------------------------------------------------------
Commit 1917785 from Tilman Hausherr in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1917785 ]
PDFBOX-5660, PDFBOX-5580: revert commit due to incompatibility with
PDFTextStripperByArea
> PDFTextStripperByArea ignores text for overlapping areas (regions) when
> suppressing duplicate overlapping text
> --------------------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-5580
> URL: https://issues.apache.org/jira/browse/PDFBOX-5580
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.27, 3.0.0 PDFBox
> Reporter: Sebastian Holzki
> Priority: Minor
> Attachments: test.pdf
>
>
> h3. Problem
> Recently we encountered duplicate texts in our clients PDF documents which
> are typically created by applications to simulate some kind of bold text when
> no bold variant of a font is available. Fortunately, PDFBox's
> PDFTextStripperByArea has some logic to ignore exact duplicates at the same
> positions for these situations (which is inherited from the normal
> PDFTextStripper). So we changed from
> setSuppressDuplicateOverlappingText(false) to true.
> But we encountered that texts for multiple regions are not extracted
> correctly in this case when some special conditions are met:
> When using multiple regions which overlap each other and would provide
> exactly the same text, the first region text is extracted correctly but any
> following region with same text remains empty.
> We believe this is a bug due to duplicate suppression not being respected
> correctly in PDFTextStripperByArea.
> h3. Possible cause
> While investigating this problem we found that PDFTextStripperByArea swaps
> charactersByArticle for multiple regions and interprets a single page
> multiple times (once for each region). In PDFTextStripper a private HashMap
> characterListMapping keeps track of possible duplicate symbols with their
> positions. The HashMap is not being reset after each region extraction which
> leads to characters being ignored for subsequent areas.
> Since the HashMap is private we were not able to subclass and customize
> PDFTextStripperByArea with some adjusted behavior to test this finding.
> h3. Workaround
> When extracting regions one at a time for every page everything works fine.
> We currently don't see any performance disadvantages.
> h3. Reproduction
> The attached PDF file does not actually include duplicate overlapping text
> since this is not needed to reproduce the issue.
>
> {code:java}
> try (final PDDocument doc = PDDocument.load(new
> File("C:\\Source\\test.pdf"))) {
> final PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> stripper.setSuppressDuplicateOverlappingText(true);
> stripper.setPageEnd("");
> final Rectangle2D areaA = new Rectangle2D.Double(45, 319, 124, 19);
> final Rectangle2D areaB = new Rectangle2D.Double(43, 319, 130, 19);
> stripper.addRegion("A", areaA);
> stripper.addRegion("B", areaB);
> stripper.extractRegions(doc.getPage(0));
> System.out.println("A: " + stripper.getTextForRegion("A"));
> System.out.println("B: " + stripper.getTextForRegion("B"));
> } {code}
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]