Praveer created PDFBOX-3176:
-------------------------------
Summary: Add a removeRegion method in PDFTextSTripperByArea class
Key: PDFBOX-3176
URL: https://issues.apache.org/jira/browse/PDFBOX-3176
Project: PDFBox
Issue Type: Improvement
Components: Text extraction
Affects Versions: 1.8.10
Environment: All
Reporter: Praveer
Fix For: 1.8.10
Hi,
I am parsing a very complicated PDF, for which I had to enable
(setSortByPosition as true), otherwise the Parser is not able to do sequential
text extraction.
So I decided to use PDFTextStripperByArea class, and then make rectangles to
extract text. But problem here is that If I make many rectangles in a single
page, again there is no logical sequence of text extracted, So to get around
this it will be awesome to have a method to remove regions, then we can add a
region extract text, remove that region , then again add new region and so
on....
I have already done a POC in my local computer and it works fine. added this
method and tested.
public void removeRegion(String regionName) {
this.regions.remove(regionName);
this.regionArea.remove(regionName);
}
I can contribute this code myself, if you suggest, let me know, thanks and
regards
Praveer
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]