Generous Pohshna wrote: > Hi Craig, > > Thanks a lot for replying. That was an informative mail. At least illl > know where to get started.
No problem. Wish I could be more helpful. By the way, please use "reply to all" so that the list gets a copy of the mail. It might help someone else with a similar question who's searching for information later. > Well regarding ur questions. > like the program > used to create the PDF. Right now i dont have information on the program > that generated the PDF. You must have one of the PDFs you need to work with, though, right? You should be able to see what program created it if you have one of the PDFs. Just open it in Adobe Reader or Adobe Acrobat and view the document properties in the file menu. The Creator and the Producer fields tell you which software was used to make the PDF. > Also getting the position of the table data too would be difficult. It's not that you need the co-ordinates of the data. Rather, the trouble is that the data in a PDF content stream just isn't structured in a way that makes it easy to extract particular pieces of information. It's more like PostScript or really badly written old style HTML in that the formatting is completely mixed up with the data being formatted. For the uses PDF is designed for that's just fine, but it does make it hard to get data out if you do need to. If you want to see what I mean, use podofobrowser to examine a PDF content stream, or use podofouncompress to make a human-readable version of the PDF and view that in a text editor. The PDF is structured as a bunch of objects, each of which contains various data structures - usually dictionaries - and possibly a data stream. PDF content streams appear as data streams. Here's an informative quote from the pdf reference: ---- Example 5.1 illustrates the most straightforward use of a font. The text ABC is placed 10 inches from the bottom of the page and 4 inches from the left edge, using 12-point Helvetica. BT /F13 12 Tf 288 720 Td (ABC) Tj ET The five lines of this example perform the following steps: 1. Begin a text object. 2. Set the font and font size to use, installing them as parameters in the text state. (The font resource identified by the name F13 specifies the font externally known as Helvetica.) 3. Specify a starting position on the page, setting parameters in the text object. 4. Paint the glyphs for a string of characters at that position. 5. End the text object. ---- (That's an example from the PDF reference, section 5.1.1, from http://www.adobe.com/devnet/pdf/pdf_reference.html, which you REALLY need to download and use as a reference). As you can see, the string (ABC) is surrounded by a bunch of formatting operators. To extract it, you need to process the content stream. There's no guarantee that strings will appear in reading order, or as whole words/phrases. For example, instead of (PoDoFo) a PDF could contain (Po)...blah...(DoF)....blah...(o) ... say, if there was some per-character layout control being applied. In fact, especially in the presence of columns or other complex layout there's sometimes little resemblance between the order of the content stream data and how it renders. (I don't know much about PDF content streams, but anyone who's had to suffer though trying to get text out of a PDF knows that much pretty quickly). In your case, even if your table looks like: ---------------- KEY | VALUE ----------------- k1 | v1 k2 | v2 ... the actual text elements could appear in the PDF in all sorts of orders, surrounded by various positioning and formatting operators. There isn't even a guarantee that they're part of the same content stream - the app could use XObjects or just multiple content streams per page. They might not even be text. Sometimes software will convert text to outlines - essentially to a mathematical description of the shape of the character. In that case there's no longer any text string in the PDF at all. If you only have to handle data from one particular application you can probably figure out how it arranges it and extract it by processing the content stream. It'll take some work, though, and there's no guarantee it'll be reliable. If you have another way of obtaining the same data, consider looking into it. -- Craig Ringer ------------------------------------------------------------------------- SF.Net email is sponsored by: Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace _______________________________________________ Podofo-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/podofo-users
