[ 
https://issues.apache.org/jira/browse/PDFBOX-4764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034050#comment-17034050
 ] 

karthik guns commented on PDFBOX-4764:
--------------------------------------

I was testing with nearly 10 different pdfs the pdf stripper takes the value 
from the table looking like below 

Order#   PO Number           
11111     TL12

Extracted output

Line1:

Order#

Line2:

PO Number

Line3:

11111

Line4:

TL12

 

But in one unique Pdf the strip value is getting displayed as below for the 
same table structure

Line1:

Order#   PO Number

Line2:

11111     TL12

In this case even if we delimit with space again the string  of PO and number 
is getting split as its one column (Any thoughts on this)

 

 

> When a PDF has table with blank entries in the column the stripper just 
> ignores the column and moves to next field in the coulmn
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-4764
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4764
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.8
>            Reporter: karthik guns
>            Priority: Major
>
> When a PDF has tables with columns with empty values,the stripper ignores the 
> field and moves to next column which has records(if its blank it should 
> capture)
>  
> PDFTextStripperByArea stripper = new PDFTextStripperByArea();
>  stripper.setSortByPosition(true);
> PDFTextStripper tStripper = new PDFTextStripper();
> String pdfFileInText = tStripper.getText(document);



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to