[ 
https://issues.apache.org/jira/browse/PDFBOX-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179965#comment-13179965
 ] 

Ilija Pavlic commented on PDFBOX-1201:
--------------------------------------

It seems like the missed text is part of the larger text box that starts and 
ends outside the capture region but the text itself is located inside the 
capture region. 
                
> PDFTextStripperByArea y coordinate shifted "up"
> -----------------------------------------------
>
>                 Key: PDFBOX-1201
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1201
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>            Reporter: Ilija Pavlic
>            Priority: Minor
>
> The text stripper region seems to be shifted up from the given coordinates, 
> causing lines below the region to be included and ones above the defined 
> region to be included.
> ...
> PDPage page = (PDPage) allPages.get(0);
> PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> Rectangle2D.Float region = new Rectangle2D.Float(x, y, width, height);
> stripper.addRegion("test region", region);
> // overlay the region with a cyan rectangle to check if I got the coordinates 
> and dimensions right
> PDPageContentStream contentStream = new PDPageContentStream(document, page, 
> true, true);
> contentStream.setNonStrokingColor( Color.CYAN );
> contentStream.fillRect(x, y, width, height);
> contentStream.close();
> stripper.extractRegions(page);
> String content = stripper.getTextForRegion("test region");
> ...
> document.save(...);
> ...
> The cyan rectangle overlays the desired region exactly when viewing the saved 
> output document. On the other hand, stripper misses a couple of lines at the 
> bottom of the rectangle and includes couple of lines above the rectangle.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to