On Fri, Jun 19, 2009 at 02:01:07PM +0100, CountCulture wrote: > Do you have all local authorities in there, or do you just set up a > record when there's a request for a new body that you've not come across > before?
We have them all. > Also, what do you do when there's a change e.g. when an authority is > split or merged, or when a govt department is partly renamed and partly > subsumed -- create a new record (with related to links), or rename the > old one. If it's the former I don't see a reason why we couldn't use > your primary IDs (though is problematic if it's the latter). I think we tend to create a new record - Alex or John can confirm. > However, a simpler solution could be either be for publc bodies in the > WDTK DB to have a :snac_code field (or similar) and then you could call > them with a url of (something like): > > http://www.whatdotheyknow.com/body?snac_code=AB23 > > > and do something like: > > @public_body = > PublicBody.find_by_url_name_with_historic(params[:url_name]) || > PublicBody.find_by_snac_code(params[:snac_code]) > > > or alternatively have a :common_uid field with a format of > la_[snac_code] which would allow you to use other common uids for other > public bodies as and when you see fit. I'd have no problem prepending > 'la_' to wdtk requests and wdtk urls > > Cheers > C > p.s. By the way, I'm guessing the govt asset register (can't remember > what it's called of the top of my head) doesn't have a central record of > current and past public bodies > > Francis Irving wrote: >> For councils there are other ids we could use, so I agree. >> . >> But in general for WDTK, there is no website other than Wikipedia with >> anything approaching identifiers for the 3000-odd authorities we have >> in there. >> >> Francis >> >> On Fri, Jun 19, 2009 at 01:20:48PM +0100, CountCulture wrote: >> >>> Redirects are fine for viewing a webpage, but somewhat problematic as a >>> canonical, immutable id that can be used to get data from a number of >>> sources (which is what we're after, I reckon -- if I wanted the WDTK >>> data on Cheshire West and Chester, for example, would I be able to get >>> it via: >>> >>> * http://www.whatdotheyknow.com/body/Cheshire_West_and_Chester >>> * http://www.whatdotheyknow.com/body/West_Cheshire_and_Chester >>> * http://www.whatdotheyknow.com/body/City_of_Chester_and_West_Cheshire >>> >>> Seems a lot of work from the developer point of view (ignoring probs >>> caused by rogue edits). >>> >>> Also while the redirects do provide a partial history, as I understand >>> it are only a one-step history, i.e. though the wikipedia article on >>> http://en.wikipedia.org/wiki/City_of_Chester_and_West_Cheshire redirects >>> to Cheshire_West_and_Chester and not via West_Cheshire_and_Chester (I'm >>> not saying that's a big prob, just that it's not a full history; it's >>> also not a history of the official name changes, just of the wiki >>> editing process). >>> >>> All we're after here is a common code that doesn't change (while the >>> local authority or other public body doesn't change) that various >>> websites can support (with minimum coding) to provide the data without >>> ambiguity. Wikipedia article URLs, much as we love them, doesn't really >>> work in that respect IMHO. >>> Cheers >>> C >>> >>> >>> >>> Francis Irving wrote: >>> >>>> Yes, I mean the article name, probably in the form it appears in the >>>> URI. >>>> >>>> Although Wikipedia titles do change, they always provide a redirect. >>>> >>>> The nice thing about it, is that the redirects become part of the >>>> structured information. >>>> >>>> Francis >>>> >>>> On Fri, Jun 19, 2009 at 12:02:49PM +0100, CountCulture wrote: >>>> >>>>> Francis >>>>> Think we should investigate Alex's suggestions of SNAC codes. Not >>>>> sure about Wikipedia ids -- do you mean uris, or do they have >>>>> numerical ids too; prefer numerical/poss alphanumerical unique >>>>> ids rather than strings, and Wikipedia page titles change too >>>>> often to be canonical IMHO. >>>>> Cheers >>>>> C >>>>> >>>>> >>>>> Francis Irving wrote: >>>>> >>>>>> (copied to WhatDoTheyKnow team) >>>>>> >>>>>> Anyone here know about identifiers for local authorities? >>>>>> >>>>>> I'm inclined to use Wikipedia article ids, as that will extend to >>>>>> other authorities as well. >>>>>> >>>>>> Francis >>>>>> >>>>>> On Thu, Jun 18, 2009 at 11:44:12AM +0100, CountCulture wrote: >>>>>> >>>>>>> Francis >>>>>>> Thought it might be useful if twfylocal could show status of >>>>>>> WDTK requests (total, recent, no answered, outstanding late >>>>>>> etc), with basic details of requests (though prob makes >>>>>>> sense to go to WDTK site for full details of request). >>>>>>> >>>>>>> Re id system, it's something I've been struggling with as >>>>>>> everywhere uses a different system, so at the moment each >>>>>>> twfylocal council record stores the following ids/refs: >>>>>>> >>>>>>> :id (integer, twfy_local internal primary id. WON'T CHANGE) >>>>>>> :name (string, as scraped from eGR, though with some minor edits) >>>>>>> :wikipedia_url (string, as scraped from eGR, though have >>>>>>> already found one mistake) >>>>>>> :ons_url (string) >>>>>>> :egr_id (integer, this is most useful as it gives links to >>>>>>> loads of other things -- e.g. various gov pages -- doesn't >>>>>>> change AFAIK even if the authority name does) >>>>>>> :wdtk_name (string, from scraping WDTK and trying to match >>>>>>> against shortened version of name -- successful about 80% >>>>>>> of the time) >>>>>>> >>>>>>> Had a look at the WDTK code and I seem to remember the >>>>>>> internal primary id is exposed in at least one place, but >>>>>>> that it didn't help as you couldn't do queries by it. What >>>>>>> we could really do with is a canonical id for each >>>>>>> authority. >>>>>>> >>>>>>> FWIW you can use the eGR on twfylocal, though it adds an >>>>>>> extra step (if you go to >>>>>>> theyworkforyoulocal.com/councils.xml it returns all the >>>>>>> councils together with their ids and the eGR ids. If you >>>>>>> could match WDTK with eGR ids (for example) and make the >>>>>>> match available programmatically would have the beginnings >>>>>>> of a makeshift common id. >>>>>>> >>>>>>> Thoughts? >>>>>>> >>>>>>> >>>>>>> Francis Irving wrote: >>>>>>> >>>>>>>> There are RSS feeds of latest responses, including quite fancy ones if >>>>>>>> you use advanced search keywords. They only give extracts from the new >>>>>>>> messages though. What exact information are you trying to get? >>>>>>>> >>>>>>>> There is no structured way to get status or similar out of the site. >>>>>>>> >>>>>>>> Finally, we could agree an id system for name matching. I'd quite like >>>>>>>> in a way to mark every authority with, say, its identifier in >>>>>>>> Wikipedia, to aid merging with other databases. >>>>>>>> >>>>>>>> What identifiers are you using in your system? >>>>>>>> >>>>>>>> Francis >>>>>>>> >>>>>>>> On Wed, Jun 17, 2009 at 03:05:26PM +0200, Tom Steinberg wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I'm afraid I don't know, but I've CCed the team who look after WDTK >>>>>>>>> to ask. >>>>>>>>> >>>>>>>>> Tom >>>>>>>>> >>>>>>>>> 2009/6/17 CountCulture <[email protected]>: >>>>>>>>> >>>>>>>>>> Tom >>>>>>>>>> Follow up question. At the moment I've got a link to the What Do >>>>>>>>>> They Know >>>>>>>>>> page for the council. Any probs with including more info from WDTK >>>>>>>>>> such as >>>>>>>>>> status, and latest responses, and is there a good way to get that >>>>>>>>>> other than >>>>>>>>>> scraping the data ( had a look at the code and there didn't really >>>>>>>>>> seem to >>>>>>>>>> be)? >>>>>>>>>> Cheers >>>>>>>>>> C >>>>>>>>>> >>>>>>>>>> -------- Original Message -------- >>>>>>>>>> >>>>>>>>>> Tom >>>>>>>>>> >>>>>>>>>> Digging deeper is actually where I'd intended to go first, but when I >>>>>>>>>> started to explore some of the council websites I found that even >>>>>>>>>> shallow >>>>>>>>>> data was problematic and reckoned I needed a API and structure that >>>>>>>>>> at the >>>>>>>>>> very least could cope with those variants (and reuse the >>>>>>>>>> scrapers/parsers >>>>>>>>>> once written) -- hence the proof-of-concept nature. >>>>>>>>>> >>>>>>>>>> However, now I've got the basics worked out (though there's still >>>>>>>>>> tweaking >>>>>>>>>> and issues to be done there), delving deeper's the next step. In >>>>>>>>>> particular, >>>>>>>>>> working out the best way of finding/storing/parsing council docs >>>>>>>>>> (which are >>>>>>>>>> often unstructured PDFs, sometimes even just PDFs which are just >>>>>>>>>> scans), and >>>>>>>>>> also working out an elegant way of linking with other data sources. >>>>>>>>>> >>>>>>>>>> Thanks for the kind words, I'll keep the list updated with major >>>>>>>>>> developments, or you can always watch the github repository. >>>>>>>>>> >>>>>>>>>> Cheers >>>>>>>>>> C >>>>>>>>>> >>>>>>>>>> Tom Steinberg wrote: >>>>>>>>>> >>>>>>>>>>> Hi there, >>>>>>>>>>> >>>>>>>>>>> Cool - great to see people hacking on councils, it's been something >>>>>>>>>>> I've wanted to see for ages. >>>>>>>>>>> >>>>>>>>>>> I see you've gone straight for getting the councillors of several >>>>>>>>>>> different councils, but I'd actually suggest going deeper rather >>>>>>>>>>> than >>>>>>>>>>> wider. Why not just dive deep into one council and see if you can >>>>>>>>>>> get >>>>>>>>>>> transcripts or other documents nicely scraped and parsed? I'd love >>>>>>>>>>> to >>>>>>>>>>> see at least a handful of councils in TheyWorkForYou proper by the >>>>>>>>>>> end >>>>>>>>>>> of the year. >>>>>>>>>>> >>>>>>>>>>> Well done anyway! >>>>>>>>>>> >>>>>>>>>>> Tom >>>>>>>>>>> >>>>>>>>>>> 2009/6/16 CountCulture <[email protected]>: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> Quick note about something I've been working on in my spare time: >>>>>>>>>>>> >>>>>>>>>>>> http://theyworkforyoulocal.com -- a small app to scrape and parse >>>>>>>>>>>> local >>>>>>>>>>>> authority info. >>>>>>>>>>>> >>>>>>>>>>>> At the moment, it's barely more than a proof of concept, with only >>>>>>>>>>>> about >>>>>>>>>>>> 20 or so councils parsed, and even then only current councillors, >>>>>>>>>>>> committees, committee membership and forthcoming meetings are >>>>>>>>>>>> parsed. >>>>>>>>>>>> >>>>>>>>>>>> On the upside, it's fairly quick for me to add new parsers for >>>>>>>>>>>> councils >>>>>>>>>>>> (and reuse ones already written if they use same CMS), there's an >>>>>>>>>>>> API >>>>>>>>>>>> built in (basically just add .json or .xml to get the info as json >>>>>>>>>>>> or >>>>>>>>>>>> XML), and there's lots of potential. >>>>>>>>>>>> >>>>>>>>>>>> Getting this far has also been an education in understanding what a >>>>>>>>>>>> full-blown twfy_local might look like (in general there seems no >>>>>>>>>>>> way to >>>>>>>>>>>> see how councillors voted, for example), the need for such a >>>>>>>>>>>> resource >>>>>>>>>>>> (there's no publicly available central repository for council >>>>>>>>>>>> election >>>>>>>>>>>> results, for example), and the sorry state of local authority >>>>>>>>>>>> websites >>>>>>>>>>>> (just finding a list of councillors is a challenge on some, and >>>>>>>>>>>> don't >>>>>>>>>>>> get me started on the HTML markup). >>>>>>>>>>>> >>>>>>>>>>>> Comments welcome. Code is at >>>>>>>>>>>> http://github.com/CountCulture/twfy_local_parser/ (I'll probably >>>>>>>>>>>> GPL it >>>>>>>>>>>> soon). Bug reports at >>>>>>>>>>>> http://github.com/CountCulture/twfy_local_parser/issues and offers >>>>>>>>>>>> of >>>>>>>>>>>> help to countculture at googlemail dot com. >>>>>>>>>>>> >>>>>>>>>>>> I'd especially be interested in hearing from anyone who's got any >>>>>>>>>>>> knowledge about local authority CMSs (e.g. there seem to be several >>>>>>>>>>>> different versions of Modern.Gov producing different URLs), or >>>>>>>>>>>> sources >>>>>>>>>>>> for more data other than the local authority websites (e.g. eGR, >>>>>>>>>>>> info4local). >>>>>>>>>>>> >>>>>>>>>>>> Cheers >>>>>>>>>>>> >>>>>>>>>>>> C >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Mailing list [email protected] >>>>>>>>>>>> Archive, settings, or unsubscribe: >>>>>>>>>>>> >>>>>>>>>>>> https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>> _______________________________________________ >>> Mailing list [email protected] >>> Archive, settings, or unsubscribe: >>> https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public >>> >>> >> >> > > _______________________________________________ Mailing list [email protected] Archive, settings, or unsubscribe: https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public
