Francis Think we should investigate Alex's suggestions of SNAC codes. Not sure about Wikipedia ids -- do you mean uris, or do they have numerical ids too; prefer numerical/poss alphanumerical unique ids rather than strings, and Wikipedia page titles change too often to be canonical IMHO. Cheers C
Francis Irving wrote: > (copied to WhatDoTheyKnow team) > > Anyone here know about identifiers for local authorities? > > I'm inclined to use Wikipedia article ids, as that will extend to > other authorities as well. > > Francis > > On Thu, Jun 18, 2009 at 11:44:12AM +0100, CountCulture wrote: > >> Francis >> Thought it might be useful if twfylocal could show status of WDTK >> requests (total, recent, no answered, outstanding late etc), with basic >> details of requests (though prob makes sense to go to WDTK site for full >> details of request). >> >> Re id system, it's something I've been struggling with as everywhere >> uses a different system, so at the moment each twfylocal council record >> stores the following ids/refs: >> >> :id (integer, twfy_local internal primary id. WON'T CHANGE) >> :name (string, as scraped from eGR, though with some minor edits) >> :wikipedia_url (string, as scraped from eGR, though have already found >> one mistake) >> :ons_url (string) >> :egr_id (integer, this is most useful as it gives links to loads of >> other things -- e.g. various gov pages -- doesn't change AFAIK even if >> the authority name does) >> :wdtk_name (string, from scraping WDTK and trying to match against >> shortened version of name -- successful about 80% of the time) >> >> Had a look at the WDTK code and I seem to remember the internal primary >> id is exposed in at least one place, but that it didn't help as you >> couldn't do queries by it. What we could really do with is a canonical >> id for each authority. >> >> FWIW you can use the eGR on twfylocal, though it adds an extra step (if >> you go to theyworkforyoulocal.com/councils.xml it returns all the >> councils together with their ids and the eGR ids. If you could match >> WDTK with eGR ids (for example) and make the match available >> programmatically would have the beginnings of a makeshift common id. >> >> Thoughts? >> >> >> Francis Irving wrote: >> >>> There are RSS feeds of latest responses, including quite fancy ones if >>> you use advanced search keywords. They only give extracts from the new >>> messages though. What exact information are you trying to get? >>> >>> There is no structured way to get status or similar out of the site. >>> >>> Finally, we could agree an id system for name matching. I'd quite like >>> in a way to mark every authority with, say, its identifier in >>> Wikipedia, to aid merging with other databases. >>> >>> What identifiers are you using in your system? >>> >>> Francis >>> >>> On Wed, Jun 17, 2009 at 03:05:26PM +0200, Tom Steinberg wrote: >>> >>> >>>> Hi, >>>> >>>> I'm afraid I don't know, but I've CCed the team who look after WDTK to ask. >>>> >>>> Tom >>>> >>>> 2009/6/17 CountCulture <[email protected]>: >>>> >>>> >>>>> Tom >>>>> Follow up question. At the moment I've got a link to the What Do They Know >>>>> page for the council. Any probs with including more info from WDTK such as >>>>> status, and latest responses, and is there a good way to get that other >>>>> than >>>>> scraping the data ( had a look at the code and there didn't really seem to >>>>> be)? >>>>> Cheers >>>>> C >>>>> >>>>> -------- Original Message -------- >>>>> >>>>> Tom >>>>> >>>>> Digging deeper is actually where I'd intended to go first, but when I >>>>> started to explore some of the council websites I found that even shallow >>>>> data was problematic and reckoned I needed a API and structure that at the >>>>> very least could cope with those variants (and reuse the scrapers/parsers >>>>> once written) -- hence the proof-of-concept nature. >>>>> >>>>> However, now I've got the basics worked out (though there's still tweaking >>>>> and issues to be done there), delving deeper's the next step. In >>>>> particular, >>>>> working out the best way of finding/storing/parsing council docs (which >>>>> are >>>>> often unstructured PDFs, sometimes even just PDFs which are just scans), >>>>> and >>>>> also working out an elegant way of linking with other data sources. >>>>> >>>>> Thanks for the kind words, I'll keep the list updated with major >>>>> developments, or you can always watch the github repository. >>>>> >>>>> Cheers >>>>> C >>>>> >>>>> Tom Steinberg wrote: >>>>> >>>>> >>>>>> Hi there, >>>>>> >>>>>> Cool - great to see people hacking on councils, it's been something >>>>>> I've wanted to see for ages. >>>>>> >>>>>> I see you've gone straight for getting the councillors of several >>>>>> different councils, but I'd actually suggest going deeper rather than >>>>>> wider. Why not just dive deep into one council and see if you can get >>>>>> transcripts or other documents nicely scraped and parsed? I'd love to >>>>>> see at least a handful of councils in TheyWorkForYou proper by the end >>>>>> of the year. >>>>>> >>>>>> Well done anyway! >>>>>> >>>>>> Tom >>>>>> >>>>>> 2009/6/16 CountCulture <[email protected]>: >>>>>> >>>>>> >>>>>> >>>>>>> Quick note about something I've been working on in my spare time: >>>>>>> >>>>>>> http://theyworkforyoulocal.com -- a small app to scrape and parse local >>>>>>> authority info. >>>>>>> >>>>>>> At the moment, it's barely more than a proof of concept, with only about >>>>>>> 20 or so councils parsed, and even then only current councillors, >>>>>>> committees, committee membership and forthcoming meetings are parsed. >>>>>>> >>>>>>> On the upside, it's fairly quick for me to add new parsers for councils >>>>>>> (and reuse ones already written if they use same CMS), there's an API >>>>>>> built in (basically just add .json or .xml to get the info as json or >>>>>>> XML), and there's lots of potential. >>>>>>> >>>>>>> Getting this far has also been an education in understanding what a >>>>>>> full-blown twfy_local might look like (in general there seems no way to >>>>>>> see how councillors voted, for example), the need for such a resource >>>>>>> (there's no publicly available central repository for council election >>>>>>> results, for example), and the sorry state of local authority websites >>>>>>> (just finding a list of councillors is a challenge on some, and don't >>>>>>> get me started on the HTML markup). >>>>>>> >>>>>>> Comments welcome. Code is at >>>>>>> http://github.com/CountCulture/twfy_local_parser/ (I'll probably GPL it >>>>>>> soon). Bug reports at >>>>>>> http://github.com/CountCulture/twfy_local_parser/issues and offers of >>>>>>> help to countculture at googlemail dot com. >>>>>>> >>>>>>> I'd especially be interested in hearing from anyone who's got any >>>>>>> knowledge about local authority CMSs (e.g. there seem to be several >>>>>>> different versions of Modern.Gov producing different URLs), or sources >>>>>>> for more data other than the local authority websites (e.g. eGR, >>>>>>> info4local). >>>>>>> >>>>>>> Cheers >>>>>>> >>>>>>> C >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Mailing list [email protected] >>>>>>> Archive, settings, or unsubscribe: >>>>>>> >>>>>>> https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>> >>> >> > > _______________________________________________ Mailing list [email protected] Archive, settings, or unsubscribe: https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public
