Yes for that there should be tagging previously done. This will be great if the company or individual generating the PDF generates in that way. But if you are looking in general PDFs then it may not work. And also I am not sure how much automatic tagging is reliable.
On Mon, Mar 23, 2009 at 12:26 PM, Jeremias Maerki <[email protected]>wrote: > Actually, there is a way but for that to work, but AFAIK that's not > supported, yet, and the PDF has to be tagged (which most PDFs aren't). > > Tagged PDF: http://www.planetpdf.com/enterprise/article.asp?ContentID=6067 > > On 22.03.2009 11:55:00 Dexter Mishra wrote: > > Hi Hanna, > > I dont think there is an way to say a data is table data. the one thing > you > > can do is use the article/bead feature in the PDFTextStripper example. We > > also have similar requirement. One have a few metadata in terms of PDF > > comments. SO i am modifying thePDFBox library for using the PDF comments > as > > userspace meta data. > > One apporach you can try is manipulating the x,y cooridinates of the PDF. > > > > On Sat, Mar 21, 2009 at 9:01 PM, Hanan Harush <[email protected]> > wrote: > > > > > Hi > > > > > > > > > > > > My name is Hanan and I am developing an in-house application that > requires > > > reading pdf file and extract tables text to a local Database. > > > > > > Of course the table number of rows might change from time to time . > > > > > > > > > > > > After reading a lot about PDF as well as pdfbox I have succeeded to : > > > > > > Load a PDF document > > > > > > Iterate through its pages > > > > > > > > > > > > My questions are: > > > > > > 1. Is there a way to identify a table in PDF file ? > > > > > > 2. What are the alternatives for extracting tables data only using > pdfBox > > > ? > > > > > > > > > 3. How is it possible to step through a table ? > > > > > > > > > > > > Best Regards, > > > > > > Hanan Harush > > > > > > > > > > > > > > > > > Jeremias Maerki > >
