On 18/11/10 14:17, Tom Kaneko wrote:
For the consensus wiki, wouldn't the text be enough?
On looking further at the HTML for the Bills, I think I can put something together, thank you. Last time I looked, I thought there was no standard format to the documents I found, but the Bills on their RSS feed seems to follow some sort of standard.

Looking at the markup, it looks like it's going to be a lot of work to piece each document together. For example, each line of text is separated into separate table rows - even where the lines form one paragraph. I can put some code together to create a sensible itemized structure - it is a matter of time.

If anyone has attempted scraping Bills before, I would love to hear from them. There might be some reusable code....

Tom,

I'm working on this at the moment. I've got some PHP code that reads the HTML of the bills and outputs XML with markup for the clauses, subsections, paragraphs, pages and line numbers. I'm hoping to be able to set this up as a web-service. My initial idea for a quick web interface was to feed the XML files into subversion so that changes between the stages of a bill can be seen using visual diff tools in Trac (web frontend for viewing a subversion repository).

The reason for marking-up the pages and line numbers separately is to do with the rather archaic way that amendments are worded. (e.g. Clause 1, page 4, line 15 replace foo with bar). Often its not possible to tell from reading the amendment which paragraph is being changed because the amendment refers to pages and lines.

Please get in touch if you want to know more. I'd be happy to share the code that I've got so far.

Cheers

Mark Wrangham



_______________________________________________
Mailing list [email protected]
Archive, settings, or unsubscribe:
https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public

Reply via email to