PDFTextStripperByArea extracts text only from 1 region, despite several regions being defined

Ismael Hasan Tue, 21 Jul 2009 03:18:20 -0700

Hello. I have a problem with the class
"org.apache.pdfbox.util.PDFTextStripperByArea":


If I add several regions to this class to extract the text from, it is
only retrieved from one of them. The example I build was to create two
regions with the same values (with different names), add them to the
text stripper, and use the "extractRegions" function.

I really appreciate if someone can answer me what I am doing wrong, or
if this is a bug in the tool.

Please, see at the end of the message the code with which I get this
issue; the final result buffers (localResult1 and localResult2) have
different content (one of them is empty). If you need a PDF document
to reproduce this, please ask me for it.

Thanks in advance,
Ismael



//Opening the document and getting the page
PDFParser parser = new PDFParser(new ByteArrayInputStream(documentInBytes));
parser.parse();
PDDocument doc = parser.getPDDocument();
PDPage page = (PDPage) doc.getDocumentCatalog().getAllPages().get(pageNumber);

// Creating the stripper
PDFTextStripperByArea areaStripper = new PDFTextStripperByArea();

// Creation and addition of the regions to the stripper
Rectangle2D rectangle = new Rectangle2D.Float();
rectangle.setRect(0, 0, 500, 100);
areaStripper.addRegion("1", rectangle);

Rectangle2D rectangle2 = new Rectangle2D.Float();
rectangle2.setRect(0, 0, 500, 100);
areaStripper.addRegion("2", rectangle2);

// Extracting the regions and getting the results
areaStripper.extractRegions(page);
String localResult1 = areaStripper.getTextForRegion("1");
String localResult2 = areaStripper.getTextForRegion("2");

PDFTextStripperByArea extracts text only from 1 region, despite several regions being defined

Reply via email to