[ 
https://issues.apache.org/jira/browse/PDFBOX-2775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14514755#comment-14514755
 ] 

Tilman Hausherr commented on PDFBOX-2775:
-----------------------------------------

A bead is a rectangle with text, a thread is a sequence of such rectangles, and 
a page can have several threads. Like a (real paper) newspaper has several 
articles on several columns on a page.

In the normal text stripping, the charactersByArticle size is set to the size 
of the count of threads * 2 + 1. There is a good comment that explains why:

{code}
     * The charactersByArticle is used to extract text by article divisions.  
For example
     * a PDF that has two columns like a newspaper, we want to extract the 
first column and
     * then the second column.  In this example the PDF would have 2 beads(or 
articles), one for
     * each column.  The size of the charactersByArticle would be 5, because 
not all text on the
     * screen will fall into one of the articles.  The five divisions are shown 
below
     *
     * Text before first article
     * first article text
     * text between first article and second article
     * second article text
     * text after second article
     *
     * Most PDFs won't have any beads, so charactersByArticle will contain a 
single entry.
{code}
Now comes PDFTextStripperByArea. That one sets charactersByArticle to a list of 
regions that were submitted with addRegion(). However 
PDFTextStripper.processPage() still considers the beads, and for each article 
(i.e. text) it finds (now based on regions instead of on beads), it makes an 
assignment as explained in the comment. But because there's only one element in 
charactersByArticle (one region to strip in Andrews code) and not 5 (the 
attached PDF has two beads on the first page), the 
ArrayIndexOutOfBoundsException can occur.

Combining PDF "threads" with strip regions makes no sense, it would bring all 
sort of new problems. Doing an area text extract means one knows where the 
"interesting" text is.

So what I'm doing is:
- in the constructor, call super.setShouldSeparateByBeads(false)
- PDFTextStripperByArea.setShouldSeparateByBeads() will be ignored


> ArrayIndexOutOfBoundsException in PDFTextStripper.processTextPosition()
> -----------------------------------------------------------------------
>
>                 Key: PDFBOX-2775
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2775
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Tilman Hausherr
>         Attachments: jaf-1-150219.pdf
>
>
> Reported by Andrew M. in the user mailing list:
> {code}
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Array 
> index out of range: 3
>       at java.util.Vector.get(Vector.java:744)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.processTextPosition(PDFTextStripper.java:903)
>       at 
> org.apache.pdfbox.text.PDFTextStripperByArea.processTextPosition(PDFTextStripperByArea.java:132)
>       at 
> org.apache.pdfbox.text.PDFTextStreamEngine.showGlyph(PDFTextStreamEngine.java:229)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:717)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.showTextStrings(PDFStreamEngine.java:627)
>       at 
> org.apache.pdfbox.contentstream.operator.text.ShowTextAdjusted.process(ShowTextAdjusted.java:38)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:829)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:490)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:456)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:167)
>       at 
> org.apache.pdfbox.text.PDFTextStreamEngine.processPage(PDFTextStreamEngine.java:117)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:347)
>       at 
> org.apache.pdfbox.text.PDFTextStripperByArea.extractRegions(PDFTextStripperByArea.java:113)
>       at testpdfbox20.ExtractTextError.textFromBox(ExtractTextError.java:25)
>       at testpdfbox20.ExtractTextError.main(ExtractTextError.java:45)
> {code}
> {code}
> public class ExtractTextError
> {
>     static String textFromBox(PDDocument doc, int x, int y, int w, int h, int 
> page)
>             throws IOException
>     {
>         PDFTextStripperByArea stripper = new PDFTextStripperByArea();
>         Rectangle rect = new Rectangle(x, y - h, w, h);
>         stripper.addRegion("region", rect);
>         int pageCount = doc.getDocumentCatalog().getPages().getCount();
>         System.out.println("getting text from page #" + page + " of " + 
> pageCount + " in doc.");
>         if (page <= pageCount)
>         {
>             PDPage pp = doc.getDocumentCatalog().getPages().get(page - 1);
>             stripper.extractRegions(pp);
>             String text = stripper.getTextForRegion("region");
>             System.out.println("text=" + text);
>             return text;
>         }
>         else
>         {
>             return "No page #" + page;
>         }
>     }
>     public static void main(String[] args) throws IOException
>     {
>         PDDocument doc = PDDocument.load(new File("jaf-1-150219.pdf"));
>         textFromBox(doc, 33, 159, 216, 43, 1);
>     }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to