[ 
https://issues.apache.org/jira/browse/PDFBOX-4764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034326#comment-17034326
 ] 

Michael Klink commented on PDFBOX-4764:
---------------------------------------

Most likely in the first PDFs the text in each line is not exactly at the same 
height: It _looks_ like it was but it actually is off by a small amount. Text 
extractors usually extract those pieces in different text lines.

In the last PDF, though, the entries of each line actually appear to be at the 
same height. Thus, text extractors usually extract them in single text lines.

(In my experience the ratio usually is different, though, most tables I find 
have all their entries at the same height, only a few don't.)

> When a PDF has table with blank entries in the column the stripper just 
> ignores the column and moves to next field in the coulmn
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-4764
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4764
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.8
>            Reporter: karthik guns
>            Priority: Major
>
> When a PDF has tables with columns with empty values,the stripper ignores the 
> field and moves to next column which has records(if its blank it should 
> capture)
>  
> PDFTextStripperByArea stripper = new PDFTextStripperByArea();
>  stripper.setSortByPosition(true);
> PDFTextStripper tStripper = new PDFTextStripper();
> String pdfFileInText = tStripper.getText(document);



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to