Francis
I'm keen to move this forward. Do you have any suggestions for the next step, and have you had confirmation from Alex/John
Thanks
C

Francis Irving wrote:
On Fri, Jun 19, 2009 at 02:01:07PM +0100, CountCulture wrote:
Do you have all local authorities in there, or do you just set up a record when there's a request for a new body that you've not come across before?

We have them all.

Also, what do you do when there's a change e.g. when an authority is split or merged, or when a govt department is partly renamed and partly subsumed -- create a new record (with related to links), or rename the old one. If it's the former I don't see a reason why we couldn't use your primary IDs (though is problematic if it's the latter).

I think we tend to create a new record - Alex or John can confirm.

However, a simpler solution could be either be for publc bodies in the WDTK DB to have a :snac_code field (or similar) and then you could call them with a url of (something like):

http://www.whatdotheyknow.com/body?snac_code=AB23


and do something like:

@public_body = PublicBody.find_by_url_name_with_historic(params[:url_name]) || PublicBody.find_by_snac_code(params[:snac_code])


or alternatively have a :common_uid field with a format of la_[snac_code] which would allow you to use other common uids for other public bodies as and when you see fit. I'd have no problem prepending 'la_' to wdtk requests and wdtk urls

Cheers
C
p.s. By the way, I'm guessing the govt asset register (can't remember what it's called of the top of my head) doesn't have a central record of current and past public bodies

Francis Irving wrote:
For councils there are other ids we could use, so I agree.
.
But in general for WDTK, there is no website other than Wikipedia with
anything approaching identifiers for the 3000-odd authorities we have
in there.

Francis

On Fri, Jun 19, 2009 at 01:20:48PM +0100, CountCulture wrote:
Redirects are fine for viewing a webpage, but somewhat problematic as a
canonical, immutable id that can be used to get data from a number of
sources (which is what we're after, I reckon -- if I wanted the WDTK
data on Cheshire West and Chester, for example, would I be able to get
it via:

    * http://www.whatdotheyknow.com/body/Cheshire_West_and_Chester
    * http://www.whatdotheyknow.com/body/West_Cheshire_and_Chester
    * http://www.whatdotheyknow.com/body/City_of_Chester_and_West_Cheshire

Seems a lot of work from the developer point of view (ignoring probs
caused by rogue edits).

Also while the redirects do provide a partial history, as I understand
it are only a one-step history, i.e. though the wikipedia article on
http://en.wikipedia.org/wiki/City_of_Chester_and_West_Cheshire redirects
to Cheshire_West_and_Chester and not via West_Cheshire_and_Chester (I'm
not saying that's a big prob, just that it's not a full history; it's
also not a history of the official name changes, just of the wiki
editing process).

All we're after here is a common code that doesn't change (while the
local authority or other public body doesn't change) that various
websites can support (with minimum coding) to provide the data without
ambiguity. Wikipedia article URLs, much as we love them, doesn't really
work in that respect IMHO.
Cheers
C



Francis Irving wrote:
Yes, I mean the article name, probably in the form it appears in the
URI.

Although Wikipedia titles do change, they always provide a redirect.

The nice thing about it, is that the redirects become part of the
structured information.

Francis

On Fri, Jun 19, 2009 at 12:02:49PM +0100, CountCulture wrote:
Francis
Think we should investigate Alex's suggestions of SNAC codes. Not sure about Wikipedia ids -- do you mean uris, or do they have numerical ids too; prefer numerical/poss alphanumerical unique ids rather than strings, and Wikipedia page titles change too often to be canonical IMHO.
Cheers
C


Francis Irving wrote:
(copied to WhatDoTheyKnow team)

Anyone here know about identifiers for local authorities?

I'm inclined to use Wikipedia article ids, as that will extend to
other authorities as well.

Francis

On Thu, Jun 18, 2009 at 11:44:12AM +0100, CountCulture wrote:
Francis
Thought it might be useful if twfylocal could show status of WDTK requests (total, recent, no answered, outstanding late etc), with basic details of requests (though prob makes sense to go to WDTK site for full details of request).

Re id system, it's something I've been struggling with as everywhere uses a different system, so at the moment each twfylocal council record stores the following ids/refs:

:id (integer, twfy_local internal primary id. WON'T CHANGE)
:name (string, as scraped from eGR, though with some minor edits)
:wikipedia_url (string, as scraped from eGR, though have already found one mistake)
:ons_url (string)
:egr_id (integer, this is most useful as it gives links to loads of other things -- e.g. various gov pages -- doesn't change AFAIK even if the authority name does) :wdtk_name (string, from scraping WDTK and trying to match against shortened version of name -- successful about 80% of the time)

Had a look at the WDTK code and I seem to remember the internal primary id is exposed in at least one place, but that it didn't help as you couldn't do queries by it. What we could really do with is a canonical id for each authority.

FWIW you can use the eGR on twfylocal, though it adds an extra step (if you go to theyworkforyoulocal.com/councils.xml it returns all the councils together with their ids and the eGR ids. If you could match WDTK with eGR ids (for example) and make the match available programmatically would have the beginnings of a makeshift common id.

Thoughts?


Francis Irving wrote:
There are RSS feeds of latest responses, including quite fancy ones if
you use advanced search keywords. They only give extracts from the new
messages though. What exact information are you trying to get?

There is no structured way to get status or similar out of the site.

Finally, we could agree an id system for name matching. I'd quite like
in a way to mark every authority with, say, its identifier in
Wikipedia, to aid merging with other databases.

What identifiers are you using in your system?

Francis

On Wed, Jun 17, 2009 at 03:05:26PM +0200, Tom Steinberg wrote:
Hi,

I'm afraid I don't know, but I've CCed the team who look after WDTK to ask.

Tom

2009/6/17 CountCulture <[email protected]>:
Tom
Follow up question. At the moment I've got a link to the What Do They Know
page for the council. Any probs with including more info from WDTK such as
status, and latest responses, and is there a good way to get that other than
scraping the data ( had a look at the code and there didn't really seem to
be)?
Cheers
C

-------- Original Message --------

Tom

Digging deeper is actually where I'd intended to go first, but when I
started to explore some of the council websites I found that even shallow
data was problematic and reckoned I needed a API and structure that at the
very least could cope with those variants (and reuse the scrapers/parsers
once written) -- hence the proof-of-concept nature.

However, now I've got the basics worked out (though there's still tweaking
and issues to be done there), delving deeper's the next step. In particular,
working out the best way of finding/storing/parsing council docs (which are
often unstructured PDFs, sometimes even just PDFs which are just scans), and
also working out an elegant way of linking with other data sources.

Thanks for the kind words, I'll keep the list updated with major
developments, or you can always watch the github repository.

Cheers
C

Tom Steinberg wrote:
Hi there,

Cool - great to see people hacking on councils, it's been something
I've wanted to see for ages.

I see you've gone straight for getting the councillors of several
different councils, but I'd actually suggest going deeper rather than
wider. Why not just dive deep into one council and see if you can get
transcripts or other documents nicely scraped and parsed? I'd love to
see at least a handful of councils in TheyWorkForYou proper by the end
of the year.

Well done anyway!

Tom

2009/6/16 CountCulture <[email protected]>:

Quick note about something I've been working on in my spare time:

http://theyworkforyoulocal.com -- a small app to scrape and parse local
authority info.

At the moment, it's barely more than a proof of concept, with only about
20 or so councils parsed, and even then only current councillors,
committees, committee membership and forthcoming meetings are parsed.

On the upside, it's fairly quick for me to add new parsers for councils
(and reuse ones already written if they use same CMS), there's an API
built in (basically just add .json or .xml to get the info as json or
XML), and there's lots of potential.

Getting this far has also been an education in understanding what a
full-blown twfy_local might look like (in general there seems no way to
see how councillors voted, for example), the need for such a resource
(there's no publicly available central repository for council election
results, for example), and the sorry state of local authority websites
(just finding a list of councillors is a challenge on some, and don't
get me started on the HTML markup).

Comments welcome. Code is at
http://github.com/CountCulture/twfy_local_parser/ (I'll probably GPL it
soon). Bug reports at
http://github.com/CountCulture/twfy_local_parser/issues and offers of
help to countculture at googlemail dot com.

I'd especially be interested in hearing from anyone who's got any
knowledge about local authority CMSs (e.g. there seem to be several
different versions of Modern.Gov producing different URLs), or sources
for more data other than the local authority websites (e.g. eGR,
info4local).

Cheers

C

_______________________________________________
Mailing list [email protected]
Archive, settings, or unsubscribe:

https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public


_______________________________________________
Mailing list [email protected]
Archive, settings, or unsubscribe:
https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public



_______________________________________________
Mailing list [email protected]
Archive, settings, or unsubscribe:
https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public

Reply via email to