Re: [Podofo-users] Locating/reading a string in a PDF

Dominik Seichter Tue, 25 Nov 2008 07:44:41 -0800

Hi,

The text extraction features of PoDoFo (as well as the podofotxtextract tool) 
are only available in SVN currently. So you might want to try SVN. Current 
trunk contains more bug fixes than the last release and a release of current 
SVN is planned sometime before christmas (as soon as I find some free time).


You can find a build of PoDoFoBrowser for Windows here:
http://downloads.sourceforge.net/podofo/podofobrowser-0.5-r1-win32-bin.zip?modtime=1184101010&big_mirror=0

Please note that this version is about a year old. I do not have a more recent 
windows binary at the moment.

best regards,
        Dom

Am Tuesday 25 November 2008 schrieb Virgiliu Craciun:
> Many thanks for your advices and explanation.
>
> Unfortunately, it seems that there is no 'podofotxtextract' tool
> included in the
> package, and so far we've been unsuccessful in finding it.
>
> Any help in this regard would be greatly appreciated (we use Windows 2000).
>
> We had the feeling that this wouldn't be straightforward at all.
> Fortunately, the PDFs we're interested in are generated by one single
> application. I will try to use the PDFBrowser to understand the structure
> (are there any binaries for Windows?)
>
> Another good thing would be that the string has always the same format
> (pattern,
> numer of digits), please see example:
> Doc: 17220080930.121655.008
> So we may be able to distinguish if from the crowd.
>
> If I am stuck, I will post an annonymised file. We do work now under time
> pressure with this (and it's just a small bit of the project!), so maybe
> someone could help us with podofotxtextract or other code example to get
> the text out from a PDF.
>
> Many thanks.
>
> Virgiliu
>
> Quoting Craig Ringer <[EMAIL PROTECTED]>:
> > Virgiliu Craciun wrote:
> >> Hi
> >>
> >> I work in Healthcare and we are are now developing an in-house
> >> (non-commercial)
> >> application that, at some point, has to read a string in some PDF
> >> medical records.
> >> As we don't have an interest and knowledge in PDF-related coding, I've
> >> come across PoDoFo as a possible solution for us. I had a quick look on
> >> the library
> >> but it's difficult to find the time to try/experiment various things.
> >>
> >> Looking in the tools provided with the package, it seems that
> >> there's nothing to
> >> help us writing quickly a code to locate a text string into a PDF
> >> and to read a
> >> text string as well. I am particularly interested in locating a heading
> >> word ('Doc:') and to read the string that follows immediately after
> >> (till a blank or
> >> whatever other separator is encountered).
> >
> > Pierre pointed out podofotxtextract, which is probably a good start.
> >
> > Note, however, that in PDF there are two very different concepts of
> > "follows" when it comes to text. First, there's the order that content
> > appears rendered on screen, eg:
> >
> > Field1: Value1     Field3: value3
> > Field2: Value2     Field4: Value4
> >
> > However, this is not necessarily the order it appears in the PDF content
> > stream. In fact, there's no guarantee that the field labels are even in
> > the same content stream as the associated values.
> >
> > It is quite common to overlay filled values on top of a static
> > "background" containing field labels, etc. This background is often a
> > separate content stream, perhaps an XObject referenced by the main
> > content stream or just another stream prepended to the list of content
> > streams for the page.
> >
> > It's also not impossible that the field values might be stored as filled
> > PDF form fields, which are different again.
> >
> > This means that you can't expect to look for "Doc: " and just pull the
> > text after it out. It's nice if you can, but that will NOT be a robust
> > solution unless your PDF is always generated by one particular source.
> > Even then, it might be broken when the generator is upgraded to a newer
> > software version that does things differently.
> >
> > First, you need to decide just how robust this must be. Will you try to
> > handle all potential ways the input might be filled, or will you target
> > just the particular structure the generating application uses? If the
> > latter, you need to examine the PDF (PoDoFoBrowser is likely to be
> > useful here) to determine just how the document is structured and how
> > the field values are stored.
> >
> > Personally, I'd start with making it work for your particular
> > application, even if the end goal is to accept forms filled using a
> > variety of methods. You might land up having to try to process the
> > content stream to determine what text is drawn within a certain set of
> > co-ordinates. You might just be able to look for text after a known
> > static string. You might instead land up having to read PDF form
> > annotations. If you're really lucky, and the generating app is well
> > written, there will be an embedded XML document containing a
> > machine-readable representation of the form that you can just extract
> > and load into a DOM for processing. It really depends on the
> > construction of the document you're processing.
> >
> > Once you know how it's structured and how the values are stored, you can
> > look at extracting them and isolating the value(s) of interest.
> >
> > Feel free to post a dummy form here if you're stumped. This is a public
> > mailing list, so of course it should contain only sample data.
> >
> > In case you're wondering why this is so needlessly complex: It's because
> > people don't write applications that generate PDF very well. They don't
> > tend to consider machine processing of forms, so they don't embed a
> > machine-readable representation of the form data or use proper PDF form
> > annotations like they should. They just use text drawing operations to
> > produce something that looks right and makes sense to a human on screen
> > or when printed, but isn't anything much more than plain graphical data
> > to a computer.
> >
> > --
> > Craig Ringer
>
> -------------------------------------------------------------------------
> This SF.Net email is sponsored by the Moblin Your Move Developer's
> challenge Build the coolest Linux based applications with Moblin SDK & win
> great prizes Grand prize is a trip for two to an Open Source event anywhere
> in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/
> _______________________________________________
> Podofo-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/podofo-users



-- 
**********************************************************************
Dominik Seichter - [EMAIL PROTECTED]
KRename  - http://www.krename.net  - Powerful batch renamer for KDE
KBarcode - http://www.kbarcode.net - Barcode and label printing
PoDoFo - http://podofo.sf.net - PDF generation and parsing library
SchafKopf - http://schafkopf.berlios.de - Schafkopf, a card game,  for KDE
Alan - http://alan.sf.net - A Turing Machine in Java
**********************************************************************

signature.asc
Description: This is a digitally signed message part.

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/

_______________________________________________
Podofo-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/podofo-users

Re: [Podofo-users] Locating/reading a string in a PDF

Reply via email to