For councils there are other ids we could use, so I agree. . But in general for WDTK, there is no website other than Wikipedia with anything approaching identifiers for the 3000-odd authorities we have in there.
Francis On Fri, Jun 19, 2009 at 01:20:48PM +0100, CountCulture wrote: > Redirects are fine for viewing a webpage, but somewhat problematic as a > canonical, immutable id that can be used to get data from a number of > sources (which is what we're after, I reckon -- if I wanted the WDTK > data on Cheshire West and Chester, for example, would I be able to get > it via: > > * http://www.whatdotheyknow.com/body/Cheshire_West_and_Chester > * http://www.whatdotheyknow.com/body/West_Cheshire_and_Chester > * http://www.whatdotheyknow.com/body/City_of_Chester_and_West_Cheshire > > Seems a lot of work from the developer point of view (ignoring probs > caused by rogue edits). > > Also while the redirects do provide a partial history, as I understand > it are only a one-step history, i.e. though the wikipedia article on > http://en.wikipedia.org/wiki/City_of_Chester_and_West_Cheshire redirects > to Cheshire_West_and_Chester and not via West_Cheshire_and_Chester (I'm > not saying that's a big prob, just that it's not a full history; it's > also not a history of the official name changes, just of the wiki > editing process). > > All we're after here is a common code that doesn't change (while the > local authority or other public body doesn't change) that various > websites can support (with minimum coding) to provide the data without > ambiguity. Wikipedia article URLs, much as we love them, doesn't really > work in that respect IMHO. > Cheers > C > > > > Francis Irving wrote: > > Yes, I mean the article name, probably in the form it appears in the > > URI. > > > > Although Wikipedia titles do change, they always provide a redirect. > > > > The nice thing about it, is that the redirects become part of the > > structured information. > > > > Francis > > > > On Fri, Jun 19, 2009 at 12:02:49PM +0100, CountCulture wrote: > > > >> Francis > >> Think we should investigate Alex's suggestions of SNAC codes. Not sure > >> about Wikipedia ids -- do you mean uris, or do they have numerical ids > >> too; prefer numerical/poss alphanumerical unique ids rather than > >> strings, and Wikipedia page titles change too often to be canonical IMHO. > >> Cheers > >> C > >> > >> > >> Francis Irving wrote: > >> > >>> (copied to WhatDoTheyKnow team) > >>> > >>> Anyone here know about identifiers for local authorities? > >>> > >>> I'm inclined to use Wikipedia article ids, as that will extend to > >>> other authorities as well. > >>> > >>> Francis > >>> > >>> On Thu, Jun 18, 2009 at 11:44:12AM +0100, CountCulture wrote: > >>> > >>> > >>>> Francis > >>>> Thought it might be useful if twfylocal could show status of WDTK > >>>> requests (total, recent, no answered, outstanding late etc), with > >>>> basic details of requests (though prob makes sense to go to WDTK > >>>> site for full details of request). > >>>> > >>>> Re id system, it's something I've been struggling with as everywhere > >>>> uses a different system, so at the moment each twfylocal council > >>>> record stores the following ids/refs: > >>>> > >>>> :id (integer, twfy_local internal primary id. WON'T CHANGE) > >>>> :name (string, as scraped from eGR, though with some minor edits) > >>>> :wikipedia_url (string, as scraped from eGR, though have already > >>>> found one mistake) > >>>> :ons_url (string) > >>>> :egr_id (integer, this is most useful as it gives links to loads of > >>>> other things -- e.g. various gov pages -- doesn't change AFAIK even > >>>> if the authority name does) > >>>> :wdtk_name (string, from scraping WDTK and trying to match against > >>>> shortened version of name -- successful about 80% of the time) > >>>> > >>>> Had a look at the WDTK code and I seem to remember the internal > >>>> primary id is exposed in at least one place, but that it didn't help > >>>> as you couldn't do queries by it. What we could really do with is a > >>>> canonical id for each authority. > >>>> > >>>> FWIW you can use the eGR on twfylocal, though it adds an extra step > >>>> (if you go to theyworkforyoulocal.com/councils.xml it returns all > >>>> the councils together with their ids and the eGR ids. If you could > >>>> match WDTK with eGR ids (for example) and make the match available > >>>> programmatically would have the beginnings of a makeshift common id. > >>>> > >>>> Thoughts? > >>>> > >>>> > >>>> Francis Irving wrote: > >>>> > >>>> > >>>>> There are RSS feeds of latest responses, including quite fancy ones if > >>>>> you use advanced search keywords. They only give extracts from the new > >>>>> messages though. What exact information are you trying to get? > >>>>> > >>>>> There is no structured way to get status or similar out of the site. > >>>>> > >>>>> Finally, we could agree an id system for name matching. I'd quite like > >>>>> in a way to mark every authority with, say, its identifier in > >>>>> Wikipedia, to aid merging with other databases. > >>>>> > >>>>> What identifiers are you using in your system? > >>>>> > >>>>> Francis > >>>>> > >>>>> On Wed, Jun 17, 2009 at 03:05:26PM +0200, Tom Steinberg wrote: > >>>>> > >>>>> > >>>>>> Hi, > >>>>>> > >>>>>> I'm afraid I don't know, but I've CCed the team who look after WDTK to > >>>>>> ask. > >>>>>> > >>>>>> Tom > >>>>>> > >>>>>> 2009/6/17 CountCulture <[email protected]>: > >>>>>> > >>>>>> > >>>>>>> Tom > >>>>>>> Follow up question. At the moment I've got a link to the What Do They > >>>>>>> Know > >>>>>>> page for the council. Any probs with including more info from WDTK > >>>>>>> such as > >>>>>>> status, and latest responses, and is there a good way to get that > >>>>>>> other than > >>>>>>> scraping the data ( had a look at the code and there didn't really > >>>>>>> seem to > >>>>>>> be)? > >>>>>>> Cheers > >>>>>>> C > >>>>>>> > >>>>>>> -------- Original Message -------- > >>>>>>> > >>>>>>> Tom > >>>>>>> > >>>>>>> Digging deeper is actually where I'd intended to go first, but when I > >>>>>>> started to explore some of the council websites I found that even > >>>>>>> shallow > >>>>>>> data was problematic and reckoned I needed a API and structure that > >>>>>>> at the > >>>>>>> very least could cope with those variants (and reuse the > >>>>>>> scrapers/parsers > >>>>>>> once written) -- hence the proof-of-concept nature. > >>>>>>> > >>>>>>> However, now I've got the basics worked out (though there's still > >>>>>>> tweaking > >>>>>>> and issues to be done there), delving deeper's the next step. In > >>>>>>> particular, > >>>>>>> working out the best way of finding/storing/parsing council docs > >>>>>>> (which are > >>>>>>> often unstructured PDFs, sometimes even just PDFs which are just > >>>>>>> scans), and > >>>>>>> also working out an elegant way of linking with other data sources. > >>>>>>> > >>>>>>> Thanks for the kind words, I'll keep the list updated with major > >>>>>>> developments, or you can always watch the github repository. > >>>>>>> > >>>>>>> Cheers > >>>>>>> C > >>>>>>> > >>>>>>> Tom Steinberg wrote: > >>>>>>> > >>>>>>> > >>>>>>>> Hi there, > >>>>>>>> > >>>>>>>> Cool - great to see people hacking on councils, it's been something > >>>>>>>> I've wanted to see for ages. > >>>>>>>> > >>>>>>>> I see you've gone straight for getting the councillors of several > >>>>>>>> different councils, but I'd actually suggest going deeper rather than > >>>>>>>> wider. Why not just dive deep into one council and see if you can get > >>>>>>>> transcripts or other documents nicely scraped and parsed? I'd love to > >>>>>>>> see at least a handful of councils in TheyWorkForYou proper by the > >>>>>>>> end > >>>>>>>> of the year. > >>>>>>>> > >>>>>>>> Well done anyway! > >>>>>>>> > >>>>>>>> Tom > >>>>>>>> > >>>>>>>> 2009/6/16 CountCulture <[email protected]>: > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> Quick note about something I've been working on in my spare time: > >>>>>>>>> > >>>>>>>>> http://theyworkforyoulocal.com -- a small app to scrape and parse > >>>>>>>>> local > >>>>>>>>> authority info. > >>>>>>>>> > >>>>>>>>> At the moment, it's barely more than a proof of concept, with only > >>>>>>>>> about > >>>>>>>>> 20 or so councils parsed, and even then only current councillors, > >>>>>>>>> committees, committee membership and forthcoming meetings are > >>>>>>>>> parsed. > >>>>>>>>> > >>>>>>>>> On the upside, it's fairly quick for me to add new parsers for > >>>>>>>>> councils > >>>>>>>>> (and reuse ones already written if they use same CMS), there's an > >>>>>>>>> API > >>>>>>>>> built in (basically just add .json or .xml to get the info as json > >>>>>>>>> or > >>>>>>>>> XML), and there's lots of potential. > >>>>>>>>> > >>>>>>>>> Getting this far has also been an education in understanding what a > >>>>>>>>> full-blown twfy_local might look like (in general there seems no > >>>>>>>>> way to > >>>>>>>>> see how councillors voted, for example), the need for such a > >>>>>>>>> resource > >>>>>>>>> (there's no publicly available central repository for council > >>>>>>>>> election > >>>>>>>>> results, for example), and the sorry state of local authority > >>>>>>>>> websites > >>>>>>>>> (just finding a list of councillors is a challenge on some, and > >>>>>>>>> don't > >>>>>>>>> get me started on the HTML markup). > >>>>>>>>> > >>>>>>>>> Comments welcome. Code is at > >>>>>>>>> http://github.com/CountCulture/twfy_local_parser/ (I'll probably > >>>>>>>>> GPL it > >>>>>>>>> soon). Bug reports at > >>>>>>>>> http://github.com/CountCulture/twfy_local_parser/issues and offers > >>>>>>>>> of > >>>>>>>>> help to countculture at googlemail dot com. > >>>>>>>>> > >>>>>>>>> I'd especially be interested in hearing from anyone who's got any > >>>>>>>>> knowledge about local authority CMSs (e.g. there seem to be several > >>>>>>>>> different versions of Modern.Gov producing different URLs), or > >>>>>>>>> sources > >>>>>>>>> for more data other than the local authority websites (e.g. eGR, > >>>>>>>>> info4local). > >>>>>>>>> > >>>>>>>>> Cheers > >>>>>>>>> > >>>>>>>>> C > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> Mailing list [email protected] > >>>>>>>>> Archive, settings, or unsubscribe: > >>>>>>>>> > >>>>>>>>> https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>> > >>>>> > >>>> > >>>> > >>> > >>> > >> > > > > > > > > _______________________________________________ > Mailing list [email protected] > Archive, settings, or unsubscribe: > https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public > _______________________________________________ Mailing list [email protected] Archive, settings, or unsubscribe: https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public
