Do you have all local authorities in there, or do you just set up a record when there's a request for a new body that you've not come across before?
Also, what do you do when there's a change e.g. when an authority is split or merged, or when a govt department is partly renamed and partly subsumed -- create a new record (with related to links), or rename the old one. If it's the former I don't see a reason why we couldn't use your primary IDs (though is problematic if it's the latter). However, a simpler solution could be either be for publc bodies in the WDTK DB to have a :snac_code field (or similar) and then you could call them with a url of (something like): http://www.whatdotheyknow.com/body?snac_code=AB23 and do something like: @public_body = PublicBody.find_by_url_name_with_historic(params[:url_name]) || PublicBody.find_by_snac_code(params[:snac_code]) or alternatively have a :common_uid field with a format of la_[snac_code] which would allow you to use other common uids for other public bodies as and when you see fit. I'd have no problem prepending 'la_' to wdtk requests and wdtk urls Cheers C p.s. By the way, I'm guessing the govt asset register (can't remember what it's called of the top of my head) doesn't have a central record of current and past public bodies Francis Irving wrote: > For councils there are other ids we could use, so I agree. > . > But in general for WDTK, there is no website other than Wikipedia with > anything approaching identifiers for the 3000-odd authorities we have > in there. > > Francis > > On Fri, Jun 19, 2009 at 01:20:48PM +0100, CountCulture wrote: > >> Redirects are fine for viewing a webpage, but somewhat problematic as a >> canonical, immutable id that can be used to get data from a number of >> sources (which is what we're after, I reckon -- if I wanted the WDTK >> data on Cheshire West and Chester, for example, would I be able to get >> it via: >> >> * http://www.whatdotheyknow.com/body/Cheshire_West_and_Chester >> * http://www.whatdotheyknow.com/body/West_Cheshire_and_Chester >> * http://www.whatdotheyknow.com/body/City_of_Chester_and_West_Cheshire >> >> Seems a lot of work from the developer point of view (ignoring probs >> caused by rogue edits). >> >> Also while the redirects do provide a partial history, as I understand >> it are only a one-step history, i.e. though the wikipedia article on >> http://en.wikipedia.org/wiki/City_of_Chester_and_West_Cheshire redirects >> to Cheshire_West_and_Chester and not via West_Cheshire_and_Chester (I'm >> not saying that's a big prob, just that it's not a full history; it's >> also not a history of the official name changes, just of the wiki >> editing process). >> >> All we're after here is a common code that doesn't change (while the >> local authority or other public body doesn't change) that various >> websites can support (with minimum coding) to provide the data without >> ambiguity. Wikipedia article URLs, much as we love them, doesn't really >> work in that respect IMHO. >> Cheers >> C >> >> >> >> Francis Irving wrote: >> >>> Yes, I mean the article name, probably in the form it appears in the >>> URI. >>> >>> Although Wikipedia titles do change, they always provide a redirect. >>> >>> The nice thing about it, is that the redirects become part of the >>> structured information. >>> >>> Francis >>> >>> On Fri, Jun 19, 2009 at 12:02:49PM +0100, CountCulture wrote: >>> >>> >>>> Francis >>>> Think we should investigate Alex's suggestions of SNAC codes. Not sure >>>> about Wikipedia ids -- do you mean uris, or do they have numerical ids >>>> too; prefer numerical/poss alphanumerical unique ids rather than >>>> strings, and Wikipedia page titles change too often to be canonical IMHO. >>>> Cheers >>>> C >>>> >>>> >>>> Francis Irving wrote: >>>> >>>> >>>>> (copied to WhatDoTheyKnow team) >>>>> >>>>> Anyone here know about identifiers for local authorities? >>>>> >>>>> I'm inclined to use Wikipedia article ids, as that will extend to >>>>> other authorities as well. >>>>> >>>>> Francis >>>>> >>>>> On Thu, Jun 18, 2009 at 11:44:12AM +0100, CountCulture wrote: >>>>> >>>>> >>>>> >>>>>> Francis >>>>>> Thought it might be useful if twfylocal could show status of WDTK >>>>>> requests (total, recent, no answered, outstanding late etc), with >>>>>> basic details of requests (though prob makes sense to go to WDTK >>>>>> site for full details of request). >>>>>> >>>>>> Re id system, it's something I've been struggling with as everywhere >>>>>> uses a different system, so at the moment each twfylocal council >>>>>> record stores the following ids/refs: >>>>>> >>>>>> :id (integer, twfy_local internal primary id. WON'T CHANGE) >>>>>> :name (string, as scraped from eGR, though with some minor edits) >>>>>> :wikipedia_url (string, as scraped from eGR, though have already >>>>>> found one mistake) >>>>>> :ons_url (string) >>>>>> :egr_id (integer, this is most useful as it gives links to loads of >>>>>> other things -- e.g. various gov pages -- doesn't change AFAIK even >>>>>> if the authority name does) >>>>>> :wdtk_name (string, from scraping WDTK and trying to match against >>>>>> shortened version of name -- successful about 80% of the time) >>>>>> >>>>>> Had a look at the WDTK code and I seem to remember the internal >>>>>> primary id is exposed in at least one place, but that it didn't help >>>>>> as you couldn't do queries by it. What we could really do with is a >>>>>> canonical id for each authority. >>>>>> >>>>>> FWIW you can use the eGR on twfylocal, though it adds an extra step >>>>>> (if you go to theyworkforyoulocal.com/councils.xml it returns all >>>>>> the councils together with their ids and the eGR ids. If you could >>>>>> match WDTK with eGR ids (for example) and make the match available >>>>>> programmatically would have the beginnings of a makeshift common id. >>>>>> >>>>>> Thoughts? >>>>>> >>>>>> >>>>>> Francis Irving wrote: >>>>>> >>>>>> >>>>>> >>>>>>> There are RSS feeds of latest responses, including quite fancy ones if >>>>>>> you use advanced search keywords. They only give extracts from the new >>>>>>> messages though. What exact information are you trying to get? >>>>>>> >>>>>>> There is no structured way to get status or similar out of the site. >>>>>>> >>>>>>> Finally, we could agree an id system for name matching. I'd quite like >>>>>>> in a way to mark every authority with, say, its identifier in >>>>>>> Wikipedia, to aid merging with other databases. >>>>>>> >>>>>>> What identifiers are you using in your system? >>>>>>> >>>>>>> Francis >>>>>>> >>>>>>> On Wed, Jun 17, 2009 at 03:05:26PM +0200, Tom Steinberg wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I'm afraid I don't know, but I've CCed the team who look after WDTK to >>>>>>>> ask. >>>>>>>> >>>>>>>> Tom >>>>>>>> >>>>>>>> 2009/6/17 CountCulture <[email protected]>: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Tom >>>>>>>>> Follow up question. At the moment I've got a link to the What Do They >>>>>>>>> Know >>>>>>>>> page for the council. Any probs with including more info from WDTK >>>>>>>>> such as >>>>>>>>> status, and latest responses, and is there a good way to get that >>>>>>>>> other than >>>>>>>>> scraping the data ( had a look at the code and there didn't really >>>>>>>>> seem to >>>>>>>>> be)? >>>>>>>>> Cheers >>>>>>>>> C >>>>>>>>> >>>>>>>>> -------- Original Message -------- >>>>>>>>> >>>>>>>>> Tom >>>>>>>>> >>>>>>>>> Digging deeper is actually where I'd intended to go first, but when I >>>>>>>>> started to explore some of the council websites I found that even >>>>>>>>> shallow >>>>>>>>> data was problematic and reckoned I needed a API and structure that >>>>>>>>> at the >>>>>>>>> very least could cope with those variants (and reuse the >>>>>>>>> scrapers/parsers >>>>>>>>> once written) -- hence the proof-of-concept nature. >>>>>>>>> >>>>>>>>> However, now I've got the basics worked out (though there's still >>>>>>>>> tweaking >>>>>>>>> and issues to be done there), delving deeper's the next step. In >>>>>>>>> particular, >>>>>>>>> working out the best way of finding/storing/parsing council docs >>>>>>>>> (which are >>>>>>>>> often unstructured PDFs, sometimes even just PDFs which are just >>>>>>>>> scans), and >>>>>>>>> also working out an elegant way of linking with other data sources. >>>>>>>>> >>>>>>>>> Thanks for the kind words, I'll keep the list updated with major >>>>>>>>> developments, or you can always watch the github repository. >>>>>>>>> >>>>>>>>> Cheers >>>>>>>>> C >>>>>>>>> >>>>>>>>> Tom Steinberg wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> Hi there, >>>>>>>>>> >>>>>>>>>> Cool - great to see people hacking on councils, it's been something >>>>>>>>>> I've wanted to see for ages. >>>>>>>>>> >>>>>>>>>> I see you've gone straight for getting the councillors of several >>>>>>>>>> different councils, but I'd actually suggest going deeper rather than >>>>>>>>>> wider. Why not just dive deep into one council and see if you can get >>>>>>>>>> transcripts or other documents nicely scraped and parsed? I'd love to >>>>>>>>>> see at least a handful of councils in TheyWorkForYou proper by the >>>>>>>>>> end >>>>>>>>>> of the year. >>>>>>>>>> >>>>>>>>>> Well done anyway! >>>>>>>>>> >>>>>>>>>> Tom >>>>>>>>>> >>>>>>>>>> 2009/6/16 CountCulture <[email protected]>: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Quick note about something I've been working on in my spare time: >>>>>>>>>>> >>>>>>>>>>> http://theyworkforyoulocal.com -- a small app to scrape and parse >>>>>>>>>>> local >>>>>>>>>>> authority info. >>>>>>>>>>> >>>>>>>>>>> At the moment, it's barely more than a proof of concept, with only >>>>>>>>>>> about >>>>>>>>>>> 20 or so councils parsed, and even then only current councillors, >>>>>>>>>>> committees, committee membership and forthcoming meetings are >>>>>>>>>>> parsed. >>>>>>>>>>> >>>>>>>>>>> On the upside, it's fairly quick for me to add new parsers for >>>>>>>>>>> councils >>>>>>>>>>> (and reuse ones already written if they use same CMS), there's an >>>>>>>>>>> API >>>>>>>>>>> built in (basically just add .json or .xml to get the info as json >>>>>>>>>>> or >>>>>>>>>>> XML), and there's lots of potential. >>>>>>>>>>> >>>>>>>>>>> Getting this far has also been an education in understanding what a >>>>>>>>>>> full-blown twfy_local might look like (in general there seems no >>>>>>>>>>> way to >>>>>>>>>>> see how councillors voted, for example), the need for such a >>>>>>>>>>> resource >>>>>>>>>>> (there's no publicly available central repository for council >>>>>>>>>>> election >>>>>>>>>>> results, for example), and the sorry state of local authority >>>>>>>>>>> websites >>>>>>>>>>> (just finding a list of councillors is a challenge on some, and >>>>>>>>>>> don't >>>>>>>>>>> get me started on the HTML markup). >>>>>>>>>>> >>>>>>>>>>> Comments welcome. Code is at >>>>>>>>>>> http://github.com/CountCulture/twfy_local_parser/ (I'll probably >>>>>>>>>>> GPL it >>>>>>>>>>> soon). Bug reports at >>>>>>>>>>> http://github.com/CountCulture/twfy_local_parser/issues and offers >>>>>>>>>>> of >>>>>>>>>>> help to countculture at googlemail dot com. >>>>>>>>>>> >>>>>>>>>>> I'd especially be interested in hearing from anyone who's got any >>>>>>>>>>> knowledge about local authority CMSs (e.g. there seem to be several >>>>>>>>>>> different versions of Modern.Gov producing different URLs), or >>>>>>>>>>> sources >>>>>>>>>>> for more data other than the local authority websites (e.g. eGR, >>>>>>>>>>> info4local). >>>>>>>>>>> >>>>>>>>>>> Cheers >>>>>>>>>>> >>>>>>>>>>> C >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Mailing list [email protected] >>>>>>>>>>> Archive, settings, or unsubscribe: >>>>>>>>>>> >>>>>>>>>>> https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> _______________________________________________ >> Mailing list [email protected] >> Archive, settings, or unsubscribe: >> https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public >> >> > > _______________________________________________ Mailing list [email protected] Archive, settings, or unsubscribe: https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public
