[ https://issues.apache.org/jira/browse/PDFBOX-83?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712589#action_12712589 ]
George Van Treeck commented on PDFBOX-83: ----------------------------------------- I just tried the latest version and ran into the issue here (jumbled text in a PDF table). I think the following alogrithm might work to fix the problem. First sort all text items into sets having the same y coordinate, i.e., assume all vertically adjacent text items with same y coordiante are all part of table cell. For each set, select a text item and locate a horizontally adjacent text item, if the adjacent text item is part of another set of text items all sharing a y coordinate, then the adjacent item is part of a different table cell, which means you should concatenate all the text items in the first set and then concatenate all the text items in the adjacent set. > Processing horizontally first then horizontally > ----------------------------------------------- > > Key: PDFBOX-83 > URL: https://issues.apache.org/jira/browse/PDFBOX-83 > Project: PDFBox > Issue Type: New Feature > Components: Text extraction > > [imported from SourceForge] > http://sourceforge.net/tracker/index.php?group_id=78314&atid=552835&aid=1269072 > Originally submitted by tanvinguyen on 2005-08-24 13:11. > I would like to see the implementation of coalescing > where all words will be appended horizontally first then > vertically. If this features is implemented properly all the > fields of a table will be extracted and printed correctly > as in the original PDF document. > Sample: Page 2 of PDFBox References. All Content of > column Project Name will be extracted before Colum > License. > =========== > Centric CRM > (http://www.centriccrm.com) > Free To Use But > Restricted/Commercial > The Most Advanced Open > Source CRM Software. > ============= > Thanks, > -tan > [attachment on SourceForge] > http://sourceforge.net/tracker/download.php?group_id=78314&atid=552835&aid=1269072&file_id=146953 > HtmlOutputDev.h (text/plain), 8329 bytes > This is the header file from PDFtoHTML > [comment on SourceForge] > Originally sent by tanvinguyen. > Logged In: YES > user_id=683822 > I uploaded an RTF file converted from PDF file using my > applicatin developed in C++. > [comment on SourceForge] > Originally sent by tanvinguyen. > Logged In: YES > user_id=683822 > Ben, > Thanks for quick response. Generally speaking, I highly > appreciated your effort in developing such a wonderful open- > source package. > I am interesting in developing a PDF to RTF converter. Its > main features include keeping all text attributes such as > strikethru, underlined, fonts attributes, and spacing. In the > past, I successfully developed an application in C++ using > XPDF package and added code to do what I want. > Now I would like to implement these features using PDFBox > to deploy the application in a J2EE environment. > Here's the basic algorithm they use in XPDF. First, they > build a link list of string nodes. These string nodes contain x- > y coordinates of text strings. Like your TextPosition > instance, however their string nodes also contain all > information about their coordinates including LowerLeft X,Y > and UpperRight X-Y. They call yMin, yMax and xMin, xMax. > They store all these Strings nodes in major y-x axis. > Then they coalesce and merge all string nodes with the > same Y-coordinate first, therefore I was able to extract and > convert into RTF and maintain the same content and format > of PDF file. > I am trying to figure out how to add extra information to your > TextPosition class, so later on, I will be able to traverse thru > major y-axis and build a list of these string nodes. > If you can provide me information needed to obtain all > information about coordinates or position of a text string, I > think I will be able to implement these features. I will > contribute these codes to your project. > I uploaded a header file from XPDF, a sample PDF file which I > tried to convert and an RTF file. > I am not trying to convert "TABLE" from PDF file. I > understand that concept does not exist in PDF. > > Thanks, > > Tan V. Nguyen > [comment on SourceForge] > Originally sent by benlitchfield. > Logged In: YES > user_id=601708 > text in a pdf document is drawn at x/y locations. Which > means there is no relationship to text drawn in a column. If > you can propose an algorithm to determine columns of text > then I will implement it. As a side note, there is no such > thing as a 'table' in a pdf document, only lines drawn between > two points and text drawn at x/y locations. The only way > a 'column' of could be determined is by analyzing lines on the > PDF document, not an easy thing to do. > Ben Litchfield -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.