[ https://issues.apache.org/jira/browse/PDFBOX-4764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032332#comment-17032332 ]
Michael Klink commented on PDFBOX-4764: --------------------------------------- {quote}[~madhube2...@gmail.com]> If the try to split with spaces i am not able to retrieve the spaces for description.{quote} If you use a text representation that tries to reflect the original layout, splitting at spaces obviously is not the way to go anymore. Neither PDF nor plain text necessarily have annotated tables, so you need to derive table structure by heuristics. E.g. you can analyze all extracted rows. Positions for which all rows have a space you can assume to limit columns. Multiple consecutive such positions obviously have to be taken as a single column limit. > When a PDF has table with blank entries in the column the stripper just > ignores the column and moves to next field in the coulmn > -------------------------------------------------------------------------------------------------------------------------------- > > Key: PDFBOX-4764 > URL: https://issues.apache.org/jira/browse/PDFBOX-4764 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 2.0.8 > Reporter: karthik guns > Priority: Major > > When a PDF has tables with columns with empty values,the stripper ignores the > field and moves to next column which has records(if its blank it should > capture) > > PDFTextStripperByArea stripper = new PDFTextStripperByArea(); > stripper.setSortByPosition(true); > PDFTextStripper tStripper = new PDFTextStripper(); > String pdfFileInText = tStripper.getText(document); -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org