[jira] [Commented] (PDFBOX-4764) When a PDF has table with blank entries in the column the stripper just ignores the column and moves to next field in the coulmn

Michael Klink (Jira) Fri, 07 Feb 2020 04:32:29 -0800


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032332#comment-17032332
 ]


Michael Klink commented on PDFBOX-4764:
---------------------------------------

{quote}[~madhube2...@gmail.com]> If the try to split with spaces i am not able 
to retrieve the spaces for description.{quote}

If you use a text representation that tries to reflect the original layout, 
splitting at spaces obviously is not the way to go anymore.

Neither PDF nor plain text necessarily have annotated tables, so you need to 
derive table structure by heuristics.

E.g. you can analyze all extracted rows. Positions for which all rows have a 
space you can assume to limit columns. Multiple consecutive such positions 
obviously have to be taken as a single column limit. 

> When a PDF has table with blank entries in the column the stripper just 
> ignores the column and moves to next field in the coulmn
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-4764
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4764
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.8
>            Reporter: karthik guns
>            Priority: Major
>
> When a PDF has tables with columns with empty values,the stripper ignores the 
> field and moves to next column which has records(if its blank it should 
> capture)
>  
> PDFTextStripperByArea stripper = new PDFTextStripperByArea();
>  stripper.setSortByPosition(true);
> PDFTextStripper tStripper = new PDFTextStripper();
> String pdfFileInText = tStripper.getText(document);



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-4764) When a PDF has table with blank entries in the column the stripper just ignores the column and moves to next field in the coulmn

Reply via email to