Virgiliu Craciun wrote:
> Hi
>
> I work in Healthcare and we are are now developing an in-house
> (non-commercial)
> application that, at some point, has to read a string in some PDF medical
> records.
> As we don't have an interest and knowledge in PDF-related coding, I've come
> across PoDoFo as a possible solution for us. I had a quick look on the library
> but it's difficult to find the time to try/experiment various things.
>
> Looking in the tools provided with the package, it seems that there's nothing
> to
> help us writing quickly a code to locate a text string into a PDF and to read
> a
> text string as well. I am particularly interested in locating a heading word
> ('Doc:') and to read the string that follows immediately after (till a blank
> or
> whatever other separator is encountered).
Pierre pointed out podofotxtextract, which is probably a good start.
Note, however, that in PDF there are two very different concepts of
"follows" when it comes to text. First, there's the order that content
appears rendered on screen, eg:
Field1: Value1 Field3: value3
Field2: Value2 Field4: Value4
However, this is not necessarily the order it appears in the PDF content
stream. In fact, there's no guarantee that the field labels are even in
the same content stream as the associated values.
It is quite common to overlay filled values on top of a static
"background" containing field labels, etc. This background is often a
separate content stream, perhaps an XObject referenced by the main
content stream or just another stream prepended to the list of content
streams for the page.
It's also not impossible that the field values might be stored as filled
PDF form fields, which are different again.
This means that you can't expect to look for "Doc: " and just pull the
text after it out. It's nice if you can, but that will NOT be a robust
solution unless your PDF is always generated by one particular source.
Even then, it might be broken when the generator is upgraded to a newer
software version that does things differently.
First, you need to decide just how robust this must be. Will you try to
handle all potential ways the input might be filled, or will you target
just the particular structure the generating application uses? If the
latter, you need to examine the PDF (PoDoFoBrowser is likely to be
useful here) to determine just how the document is structured and how
the field values are stored.
Personally, I'd start with making it work for your particular
application, even if the end goal is to accept forms filled using a
variety of methods. You might land up having to try to process the
content stream to determine what text is drawn within a certain set of
co-ordinates. You might just be able to look for text after a known
static string. You might instead land up having to read PDF form
annotations. If you're really lucky, and the generating app is well
written, there will be an embedded XML document containing a
machine-readable representation of the form that you can just extract
and load into a DOM for processing. It really depends on the
construction of the document you're processing.
Once you know how it's structured and how the values are stored, you can
look at extracting them and isolating the value(s) of interest.
Feel free to post a dummy form here if you're stumped. This is a public
mailing list, so of course it should contain only sample data.
In case you're wondering why this is so needlessly complex: It's because
people don't write applications that generate PDF very well. They don't
tend to consider machine processing of forms, so they don't embed a
machine-readable representation of the form data or use proper PDF form
annotations like they should. They just use text drawing operations to
produce something that looks right and makes sense to a human on screen
or when printed, but isn't anything much more than plain graphical data
to a computer.
--
Craig Ringer
-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Podofo-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/podofo-users