On 26 April 2010 22:29, Richard Boulton <[email protected]> wrote:
> PDFs vary massively in how hard it is to automatically extract
> structure from them. I've been playing with this quite a bit lately
> (really must write a blog post about it soon, and document some sample
> code I've been working on). Meanwhile, if you can point me at a few
> example PDFs and tell me what you'd like to automatically extract from
> them, I'd be happy to have a good go at them.
Hmm - I found the PDFs via Francis' blog post. The offical draft is
reasonably managable, and I reckon a couple of hours of work could
pull out the sections of text, with the footnotes attached to the
appropriate sections.
Unfortunately, the leaked draft is just a set of scanned images copied
into PDF form, with no attached text version. This means that any
automated analysis is going to have to start by OCRing the text (which
might be a hard job in itself, since much of the text is in fuzzy grey
with a heavy "watermark" ("EU and member states") across the
background. Once you've OCRed it, it would be a case of writing some
rules to determine where footnotes start on the page, and to look for
the cross references. An interesting project, but probably much more
suitable for crowdsourcing (or just sitting at a computer for a couple
of days to type it all in).
Once you have the two documents, I assume you'd want, for each piece
of text marked by a footnote in the official document, to find similar
sections in the leaked document, and the footnotes from those pieces
of text. This could probably done with a fair degree of success by a
standard text similarity search algorithm (and then checked manually).
--
Richard
_______________________________________________
Mailing list [email protected]
Archive, settings, or unsubscribe:
https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public