Re: [Podofo-users] Locating/reading a string in a PDF

Craig Ringer Mon, 24 Nov 2008 19:56:42 -0800

Virgiliu Craciun wrote:
> Hi
> 
> I work in Healthcare and we are are now developing an in-house 
> (non-commercial)
> application that, at some point, has to read a string in some PDF medical
> records.
> As we don't have an interest and knowledge in PDF-related coding, I've come
> across PoDoFo as a possible solution for us. I had a quick look on the library
> but it's difficult to find the time to try/experiment various things.
> 
> Looking in the tools provided with the package, it seems that there's nothing 
> to
> help us writing quickly a code to locate a text string into a PDF and to read 
> a
> text string as well. I am particularly interested in locating a heading word
> ('Doc:') and to read the string that follows immediately after (till a blank 
> or
> whatever other separator is encountered).


Pierre pointed out podofotxtextract, which is probably a good start.

Note, however, that in PDF there are two very different concepts of 
"follows" when it comes to text. First, there's the order that content 
appears rendered on screen, eg:

Field1: Value1     Field3: value3
Field2: Value2     Field4: Value4

However, this is not necessarily the order it appears in the PDF content 
stream. In fact, there's no guarantee that the field labels are even in 
the same content stream as the associated values.

It is quite common to overlay filled values on top of a static 
"background" containing field labels, etc. This background is often a 
separate content stream, perhaps an XObject referenced by the main 
content stream or just another stream prepended to the list of content 
streams for the page.

It's also not impossible that the field values might be stored as filled 
PDF form fields, which are different again.

This means that you can't expect to look for "Doc: " and just pull the 
text after it out. It's nice if you can, but that will NOT be a robust 
solution unless your PDF is always generated by one particular source. 
Even then, it might be broken when the generator is upgraded to a newer 
software version that does things differently.

First, you need to decide just how robust this must be. Will you try to 
handle all potential ways the input might be filled, or will you target 
just the particular structure the generating application uses? If the 
latter, you need to examine the PDF (PoDoFoBrowser is likely to be 
useful here) to determine just how the document is structured and how 
the field values are stored.

Personally, I'd start with making it work for your particular 
application, even if the end goal is to accept forms filled using a 
variety of methods. You might land up having to try to process the 
content stream to determine what text is drawn within a certain set of 
co-ordinates. You might just be able to look for text after a known 
static string. You might instead land up having to read PDF form 
annotations. If you're really lucky, and the generating app is well 
written, there will be an embedded XML document containing a 
machine-readable representation of the form that you can just extract 
and load into a DOM for processing. It really depends on the 
construction of the document you're processing.

Once you know how it's structured and how the values are stored, you can 
look at extracting them and isolating the value(s) of interest.

Feel free to post a dummy form here if you're stumped. This is a public 
mailing list, so of course it should contain only sample data.

In case you're wondering why this is so needlessly complex: It's because 
people don't write applications that generate PDF very well. They don't 
tend to consider machine processing of forms, so they don't embed a 
machine-readable representation of the form data or use proper PDF form 
annotations like they should. They just use text drawing operations to 
produce something that looks right and makes sense to a human on screen 
or when printed, but isn't anything much more than plain graphical data 
to a computer.

--
Craig Ringer

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Podofo-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/podofo-users

Re: [Podofo-users] Locating/reading a string in a PDF

Reply via email to