[
https://issues.apache.org/jira/browse/PDFBOX-4764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034050#comment-17034050
]
karthik guns commented on PDFBOX-4764:
--------------------------------------
I was testing with nearly 10 different pdfs the pdf stripper takes the value
from the table looking like below
Order# PO Number
11111 TL12
Extracted output
Line1:
Order#
Line2:
PO Number
Line3:
11111
Line4:
TL12
But in one unique Pdf the strip value is getting displayed as below for the
same table structure
Line1:
Order# PO Number
Line2:
11111 TL12
In this case even if we delimit with space again the string of PO and number
is getting split as its one column (Any thoughts on this)
> When a PDF has table with blank entries in the column the stripper just
> ignores the column and moves to next field in the coulmn
> --------------------------------------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-4764
> URL: https://issues.apache.org/jira/browse/PDFBOX-4764
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.8
> Reporter: karthik guns
> Priority: Major
>
> When a PDF has tables with columns with empty values,the stripper ignores the
> field and moves to next column which has records(if its blank it should
> capture)
>
> PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> stripper.setSortByPosition(true);
> PDFTextStripper tStripper = new PDFTextStripper();
> String pdfFileInText = tStripper.getText(document);
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]