[
https://issues.apache.org/jira/browse/PDFBOX-2775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14514755#comment-14514755
]
Tilman Hausherr commented on PDFBOX-2775:
-----------------------------------------
A bead is a rectangle with text, a thread is a sequence of such rectangles, and
a page can have several threads. Like a (real paper) newspaper has several
articles on several columns on a page.
In the normal text stripping, the charactersByArticle size is set to the size
of the count of threads * 2 + 1. There is a good comment that explains why:
{code}
* The charactersByArticle is used to extract text by article divisions.
For example
* a PDF that has two columns like a newspaper, we want to extract the
first column and
* then the second column. In this example the PDF would have 2 beads(or
articles), one for
* each column. The size of the charactersByArticle would be 5, because
not all text on the
* screen will fall into one of the articles. The five divisions are shown
below
*
* Text before first article
* first article text
* text between first article and second article
* second article text
* text after second article
*
* Most PDFs won't have any beads, so charactersByArticle will contain a
single entry.
{code}
Now comes PDFTextStripperByArea. That one sets charactersByArticle to a list of
regions that were submitted with addRegion(). However
PDFTextStripper.processPage() still considers the beads, and for each article
(i.e. text) it finds (now based on regions instead of on beads), it makes an
assignment as explained in the comment. But because there's only one element in
charactersByArticle (one region to strip in Andrews code) and not 5 (the
attached PDF has two beads on the first page), the
ArrayIndexOutOfBoundsException can occur.
Combining PDF "threads" with strip regions makes no sense, it would bring all
sort of new problems. Doing an area text extract means one knows where the
"interesting" text is.
So what I'm doing is:
- in the constructor, call super.setShouldSeparateByBeads(false)
- PDFTextStripperByArea.setShouldSeparateByBeads() will be ignored
> ArrayIndexOutOfBoundsException in PDFTextStripper.processTextPosition()
> -----------------------------------------------------------------------
>
> Key: PDFBOX-2775
> URL: https://issues.apache.org/jira/browse/PDFBOX-2775
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0
> Reporter: Tilman Hausherr
> Attachments: jaf-1-150219.pdf
>
>
> Reported by Andrew M. in the user mailing list:
> {code}
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Array
> index out of range: 3
> at java.util.Vector.get(Vector.java:744)
> at
> org.apache.pdfbox.text.PDFTextStripper.processTextPosition(PDFTextStripper.java:903)
> at
> org.apache.pdfbox.text.PDFTextStripperByArea.processTextPosition(PDFTextStripperByArea.java:132)
> at
> org.apache.pdfbox.text.PDFTextStreamEngine.showGlyph(PDFTextStreamEngine.java:229)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:717)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.showTextStrings(PDFStreamEngine.java:627)
> at
> org.apache.pdfbox.contentstream.operator.text.ShowTextAdjusted.process(ShowTextAdjusted.java:38)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:829)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:490)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:456)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:167)
> at
> org.apache.pdfbox.text.PDFTextStreamEngine.processPage(PDFTextStreamEngine.java:117)
> at
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:347)
> at
> org.apache.pdfbox.text.PDFTextStripperByArea.extractRegions(PDFTextStripperByArea.java:113)
> at testpdfbox20.ExtractTextError.textFromBox(ExtractTextError.java:25)
> at testpdfbox20.ExtractTextError.main(ExtractTextError.java:45)
> {code}
> {code}
> public class ExtractTextError
> {
> static String textFromBox(PDDocument doc, int x, int y, int w, int h, int
> page)
> throws IOException
> {
> PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> Rectangle rect = new Rectangle(x, y - h, w, h);
> stripper.addRegion("region", rect);
> int pageCount = doc.getDocumentCatalog().getPages().getCount();
> System.out.println("getting text from page #" + page + " of " +
> pageCount + " in doc.");
> if (page <= pageCount)
> {
> PDPage pp = doc.getDocumentCatalog().getPages().get(page - 1);
> stripper.extractRegions(pp);
> String text = stripper.getTextForRegion("region");
> System.out.println("text=" + text);
> return text;
> }
> else
> {
> return "No page #" + page;
> }
> }
> public static void main(String[] args) throws IOException
> {
> PDDocument doc = PDDocument.load(new File("jaf-1-150219.pdf"));
> textFromBox(doc, 33, 159, 216, 43, 1);
> }
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]