Generous Pohshna wrote:
> Hi everyone,
> 
> I am newbie to PDF files and their structure.
> 
> However i need to parse a pdf file and read all the contents of the file.
> Infact i need some text from the pdf and use which can be inside tables.

It depends on how the PDF was produced, but in general this is not easy. 
  The visible contents of a PDF - text, lines, etc - are mostly 
contained within content streams. These are sequences of graphics 
operations that describe the appearance of the page or other object.

It's not like (say) HTML where you have marked up structure like:

<table><tr><td>item</td><td>value</td></tr></table>

Rather, a table in PDF would usually say something along the lines of

Draw the text `item' at (500,500)
Draw the text `value' at (600,600)
Draw a line from (490,480) to (490,520)

... etc.

To obtain the table data you might have to process the content stream 
and extract what you want based on location in the stream, on-page 
position, or other factors. The PdfContentsParser class in PoDoFo will 
help with this; have a look at test/ContentsParser/ for an example/test 
program.

Alternately, you could use an existing package for extracting text from 
PDF and process the resulting text.

> I found this library PoDoFo.
> 
> So how should i use this library to achive what i intended.

As noted above, the nature of the PDF format makes what you want 
potentially rather tricky.

Occasionally a PDF will contain data that's meant for extraction and 
processing by other software too. Generally the data will be something 
like an embedded copy of the document the PDF was made from, some 
application specific data (Illustrator info; XML snippets from various 
apps; etc) or additional metadata like JDF information. IF your PDF was 
intended for machine processing it could potentially contain the data 
you need outside a content stream. I'd know if I was leading you down 
the wrong path if you provided some more information, like the program 
used to create the PDF.

--
Craig Ringer

-------------------------------------------------------------------------
SF.Net email is sponsored by:
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services
for just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
_______________________________________________
Podofo-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/podofo-users

Reply via email to