Hi,
Okay, so I've finally got around to getting this up and public, and I
can only apologise that this didn't happen sooner:
http://parlvid.mysociety.org:81/parliament/bills/
This is all the current Bills from parliament.uk in an XML format
generated by mif2xml (MIF being the format used by Framemaker which is
presumably what they use to make their Bills).
From a brief look at the XML, it's more than possible that working with
the HTML might be saner(!), but it's there in case you can work it out
and maybe produce better XML. Somewhere in the midst of all that cruft
are <ParaLine>s and <String>s, honest. As an example, there's some stuff
out of 11001.xml with *lots* removed at the end of this email. But
seriously, the HTML might be nicer to work with, these aren't semantic
XML documents, more DTP-level layout XML documents.
You can see something manually done years ago at
http://www.theyworkforyou.com/bills/2005-06/legislative-and-regulatory-reform/
- what you really need is to be able to take the current amendments of a
bill and see what effect each one would have. I'll see if I can get
amendment paper XML up too.
As someone said, amendments are referred to by page and line numbers,
but it's been discussed before and they're written programmatically
enough that you might be able to do something clever with them.
Alternatively, something Julian has suggested in the past was something
where it gave you the bill text, let you amend it and it then created
the "On page N, line M, from FOO replace with BAR." amendment text for
you, which would be pretty nice.
ATB,
Matthew
--------------------------------------
<Para>
<PgfNumString>\t(1)\t</PgfNumString>
<ParaLine>
<ElementBegin><ETag>ClauseText</ETag></ElementBegin>
<ElementBegin><ETag>SubSection</ETag></ElementBegin>
<String>The Identity Cards Act 2006 is repealed.</String>
<ElementEnd>SubSection</ElementEnd>
</ParaLine>
</Para>
<Para>
<PgfNumString>\t(2)\t</PgfNumString>
<ParaLine>
<ElementBegin><ETag>SubSection</ETag></ElementBegin>
<String>But</String>
<Char>EmDash</Char>
</ParaLine>
</Para>
<Para>
<PgfNumString>\t(a)\t</PgfNumString>
<ParaLine>
<ElementBegin><ETag>Para</ETag></ElementBegin>
<String>sections 25 and 26 of that Act (possession of false
identity documents</String>
</ParaLine>
<ParaLine>
<String>etc), and</String>
<ElementEnd>Para</ElementEnd>
</ParaLine>
</Para>
--------------------------------------
On 18/11/2010 14:38, Mark Wrangham wrote:
If anyone has attempted scraping Bills before, I would love to hear
from them. There might be some reusable code....
I'm working on this at the moment. I've got some PHP code that reads the
HTML of the bills and outputs XML with markup for the clauses,
subsections, paragraphs, pages and line numbers. I'm hoping to be able
to set this up as a web-service. My initial idea for a quick web
interface was to feed the XML files into subversion so that changes
between the stages of a bill can be seen using visual diff tools in Trac
(web frontend for viewing a subversion repository).
_______________________________________________
Mailing list [email protected]
Archive, settings, or unsubscribe:
https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public