On 18/11/10 14:17, Tom Kaneko wrote:
For the consensus wiki, wouldn't the text be enough?
On looking further at the HTML for the Bills, I think I can put
something together, thank you. Last time I looked, I thought there
was no standard format to the documents I found, but the Bills on
their RSS feed seems to follow some sort of standard.
Looking at the markup, it looks like it's going to be a lot of work to
piece each document together. For example, each line of text is
separated into separate table rows - even where the lines form one
paragraph. I can put some code together to create a sensible itemized
structure - it is a matter of time.
If anyone has attempted scraping Bills before, I would love to hear
from them. There might be some reusable code....
Tom,
I'm working on this at the moment. I've got some PHP code that reads the
HTML of the bills and outputs XML with markup for the clauses,
subsections, paragraphs, pages and line numbers. I'm hoping to be able
to set this up as a web-service. My initial idea for a quick web
interface was to feed the XML files into subversion so that changes
between the stages of a bill can be seen using visual diff tools in Trac
(web frontend for viewing a subversion repository).
The reason for marking-up the pages and line numbers separately is to do
with the rather archaic way that amendments are worded. (e.g. Clause 1,
page 4, line 15 replace foo with bar). Often its not possible to tell
from reading the amendment which paragraph is being changed because the
amendment refers to pages and lines.
Please get in touch if you want to know more. I'd be happy to share the
code that I've got so far.
Cheers
Mark Wrangham
_______________________________________________
Mailing list [email protected]
Archive, settings, or unsubscribe:
https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public