Re: [mySociety:public] [Fwd: Re: TheyWorkForYou Local]

Francis Irving Fri, 19 Jun 2009 07:44:50 -0700

On Fri, Jun 19, 2009 at 02:01:07PM +0100, CountCulture wrote:
> Do you have all local authorities in there, or do you just set up a  
> record when there's a request for a new body that you've not come across  
> before?


We have them all.

> Also, what do you do when there's a change e.g. when an authority is  
> split or merged, or when a govt department is partly renamed and partly  
> subsumed -- create a new record (with related to links), or rename the  
> old one. If it's the former I don't see a reason why we couldn't use  
> your primary IDs (though is problematic if it's the latter).

I think we tend to create a new record - Alex or John can confirm.

> However, a simpler solution could be either be for publc bodies in the  
> WDTK DB to have a :snac_code field (or similar) and then you could call  
> them with a url of (something like):
>
> http://www.whatdotheyknow.com/body?snac_code=AB23
>
>
> and do something like:
>
> @public_body = 
> PublicBody.find_by_url_name_with_historic(params[:url_name]) ||           
>        PublicBody.find_by_snac_code(params[:snac_code])
>
>
> or alternatively have a :common_uid field with a format of  
> la_[snac_code] which would allow you to use other common uids for other  
> public bodies as and when you see fit. I'd have no problem prepending  
> 'la_' to wdtk requests and wdtk urls
>
> Cheers
> C
> p.s. By the way, I'm guessing the govt asset register (can't remember  
> what it's called of the top of my head) doesn't have a central record of  
> current and past public bodies
>
> Francis Irving wrote:
>> For councils there are other ids we could use, so I agree.
>> .
>> But in general for WDTK, there is no website other than Wikipedia with
>> anything approaching identifiers for the 3000-odd authorities we have
>> in there.
>>
>> Francis
>>
>> On Fri, Jun 19, 2009 at 01:20:48PM +0100, CountCulture wrote:
>>   
>>> Redirects are fine for viewing a webpage, but somewhat problematic as a
>>> canonical, immutable id that can be used to get data from a number of
>>> sources (which is what we're after, I reckon -- if I wanted the WDTK
>>> data on Cheshire West and Chester, for example, would I be able to get
>>> it via:
>>>
>>>     * http://www.whatdotheyknow.com/body/Cheshire_West_and_Chester
>>>     * http://www.whatdotheyknow.com/body/West_Cheshire_and_Chester
>>>     * http://www.whatdotheyknow.com/body/City_of_Chester_and_West_Cheshire
>>>
>>> Seems a lot of work from the developer point of view (ignoring probs
>>> caused by rogue edits).
>>>
>>> Also while the redirects do provide a partial history, as I understand
>>> it are only a one-step history, i.e. though the wikipedia article on
>>> http://en.wikipedia.org/wiki/City_of_Chester_and_West_Cheshire redirects
>>> to Cheshire_West_and_Chester and not via West_Cheshire_and_Chester (I'm
>>> not saying that's a big prob, just that it's not a full history; it's
>>> also not a history of the official name changes, just of the wiki
>>> editing process).
>>>
>>> All we're after here is a common code that doesn't change (while the
>>> local authority or other public body doesn't change) that various
>>> websites can support (with minimum coding) to provide the data without
>>> ambiguity. Wikipedia article URLs, much as we love them, doesn't really
>>> work in that respect IMHO.
>>> Cheers
>>> C
>>>
>>>
>>>
>>> Francis Irving wrote:
>>>     
>>>> Yes, I mean the article name, probably in the form it appears in the
>>>> URI.
>>>>
>>>> Although Wikipedia titles do change, they always provide a redirect.
>>>>
>>>> The nice thing about it, is that the redirects become part of the
>>>> structured information.
>>>>
>>>> Francis
>>>>
>>>> On Fri, Jun 19, 2009 at 12:02:49PM +0100, CountCulture wrote:
>>>>         
>>>>> Francis
>>>>> Think we should investigate Alex's suggestions of SNAC codes. Not 
>>>>> sure  about Wikipedia ids -- do you mean uris, or do they have 
>>>>> numerical ids  too; prefer numerical/poss alphanumerical unique 
>>>>> ids rather than  strings, and Wikipedia page titles change too 
>>>>> often to be canonical IMHO.
>>>>> Cheers
>>>>> C
>>>>>
>>>>>
>>>>> Francis Irving wrote:
>>>>>             
>>>>>> (copied to WhatDoTheyKnow team)
>>>>>>
>>>>>> Anyone here know about identifiers for local authorities?
>>>>>>
>>>>>> I'm inclined to use Wikipedia article ids, as that will extend to
>>>>>> other authorities as well.
>>>>>>
>>>>>> Francis
>>>>>>
>>>>>> On Thu, Jun 18, 2009 at 11:44:12AM +0100, CountCulture wrote:
>>>>>>                   
>>>>>>> Francis
>>>>>>> Thought it might be useful if twfylocal could show status of 
>>>>>>> WDTK   requests (total, recent, no answered, outstanding late 
>>>>>>> etc), with basic  details of requests (though prob makes 
>>>>>>> sense to go to WDTK site for full  details of request).
>>>>>>>
>>>>>>> Re id system, it's something I've been struggling with as 
>>>>>>> everywhere  uses a different system, so at the moment each 
>>>>>>> twfylocal council record  stores the following ids/refs:
>>>>>>>
>>>>>>> :id (integer, twfy_local internal primary id. WON'T CHANGE)
>>>>>>> :name (string, as scraped from eGR, though with some minor edits)
>>>>>>> :wikipedia_url (string, as scraped from eGR, though have 
>>>>>>> already found  one mistake)
>>>>>>> :ons_url (string)
>>>>>>> :egr_id (integer, this is most useful as it gives links to 
>>>>>>> loads of   other things -- e.g. various gov pages -- doesn't 
>>>>>>> change AFAIK even if  the authority name does)
>>>>>>> :wdtk_name (string, from scraping WDTK and trying to match 
>>>>>>> against   shortened version of name -- successful about 80% 
>>>>>>> of the time)
>>>>>>>
>>>>>>> Had a look at the WDTK code and I seem to remember the 
>>>>>>> internal primary  id is exposed in at least one place, but 
>>>>>>> that it didn't help as you  couldn't do queries by it. What 
>>>>>>> we could really do with is a canonical  id for each 
>>>>>>> authority.
>>>>>>>
>>>>>>> FWIW you can use the eGR on twfylocal, though it adds an 
>>>>>>> extra step (if  you go to 
>>>>>>> theyworkforyoulocal.com/councils.xml it returns all the  
>>>>>>> councils together with their ids and the eGR ids. If you 
>>>>>>> could match  WDTK with eGR ids (for example) and make the 
>>>>>>> match available   programmatically would have the beginnings 
>>>>>>> of a makeshift common id.
>>>>>>>
>>>>>>> Thoughts?
>>>>>>>
>>>>>>>
>>>>>>> Francis Irving wrote:
>>>>>>>                         
>>>>>>>> There are RSS feeds of latest responses, including quite fancy ones if
>>>>>>>> you use advanced search keywords. They only give extracts from the new
>>>>>>>> messages though. What exact information are you trying to get?
>>>>>>>>
>>>>>>>> There is no structured way to get status or similar out of the site.
>>>>>>>>
>>>>>>>> Finally, we could agree an id system for name matching. I'd quite like
>>>>>>>> in a way to mark every authority with, say, its identifier in
>>>>>>>> Wikipedia, to aid merging with other databases.
>>>>>>>>
>>>>>>>> What identifiers are you using in your system?
>>>>>>>>
>>>>>>>> Francis
>>>>>>>>
>>>>>>>> On Wed, Jun 17, 2009 at 03:05:26PM +0200, Tom Steinberg wrote:
>>>>>>>>                                 
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I'm afraid I don't know, but I've CCed the team who look after WDTK 
>>>>>>>>> to ask.
>>>>>>>>>
>>>>>>>>> Tom
>>>>>>>>>
>>>>>>>>> 2009/6/17 CountCulture <[email protected]>:
>>>>>>>>>                                         
>>>>>>>>>> Tom
>>>>>>>>>> Follow up question. At the moment I've got a link to the What Do 
>>>>>>>>>> They Know
>>>>>>>>>> page for the council. Any probs with including more info from WDTK 
>>>>>>>>>> such as
>>>>>>>>>> status, and latest responses, and is there a good way to get that 
>>>>>>>>>> other than
>>>>>>>>>> scraping the data ( had a look at the code and there didn't really 
>>>>>>>>>> seem to
>>>>>>>>>> be)?
>>>>>>>>>> Cheers
>>>>>>>>>> C
>>>>>>>>>>
>>>>>>>>>> -------- Original Message --------
>>>>>>>>>>
>>>>>>>>>> Tom
>>>>>>>>>>
>>>>>>>>>> Digging deeper is actually where I'd intended to go first, but when I
>>>>>>>>>> started to explore some of the council websites I found that even 
>>>>>>>>>> shallow
>>>>>>>>>> data was problematic and reckoned I needed a API and structure that 
>>>>>>>>>> at the
>>>>>>>>>> very least could cope with those variants (and reuse the 
>>>>>>>>>> scrapers/parsers
>>>>>>>>>> once written) -- hence the proof-of-concept nature.
>>>>>>>>>>
>>>>>>>>>> However, now I've got the basics worked out (though there's still 
>>>>>>>>>> tweaking
>>>>>>>>>> and issues to be done there), delving deeper's the next step. In 
>>>>>>>>>> particular,
>>>>>>>>>> working out the best way of finding/storing/parsing council docs 
>>>>>>>>>> (which are
>>>>>>>>>> often unstructured PDFs, sometimes even just PDFs which are just 
>>>>>>>>>> scans), and
>>>>>>>>>> also working out an elegant way of linking with other data sources.
>>>>>>>>>>
>>>>>>>>>> Thanks for the kind words, I'll keep the list updated with major
>>>>>>>>>> developments, or you can always watch the github repository.
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>> C
>>>>>>>>>>
>>>>>>>>>> Tom Steinberg wrote:
>>>>>>>>>>                                                 
>>>>>>>>>>> Hi there,
>>>>>>>>>>>
>>>>>>>>>>> Cool - great to see people hacking on councils, it's been something
>>>>>>>>>>> I've wanted to see for ages.
>>>>>>>>>>>
>>>>>>>>>>> I see you've gone straight for getting the councillors of several
>>>>>>>>>>> different councils, but I'd actually suggest going deeper rather 
>>>>>>>>>>> than
>>>>>>>>>>> wider. Why not just dive deep into one council and see if you can 
>>>>>>>>>>> get
>>>>>>>>>>> transcripts or other documents nicely scraped and parsed? I'd love 
>>>>>>>>>>> to
>>>>>>>>>>> see at least a handful of councils in TheyWorkForYou proper by the 
>>>>>>>>>>> end
>>>>>>>>>>> of the year.
>>>>>>>>>>>
>>>>>>>>>>> Well done anyway!
>>>>>>>>>>>
>>>>>>>>>>> Tom
>>>>>>>>>>>
>>>>>>>>>>> 2009/6/16 CountCulture <[email protected]>:
>>>>>>>>>>>
>>>>>>>>>>>                                                       
>>>>>>>>>>>   
>>>>>>>>>>>> Quick note about something I've been working on in my spare time:
>>>>>>>>>>>>
>>>>>>>>>>>> http://theyworkforyoulocal.com -- a small app to scrape and parse 
>>>>>>>>>>>> local
>>>>>>>>>>>> authority info.
>>>>>>>>>>>>
>>>>>>>>>>>> At the moment, it's barely more than a proof of concept, with only 
>>>>>>>>>>>> about
>>>>>>>>>>>> 20 or so councils parsed, and even then only current councillors,
>>>>>>>>>>>> committees, committee membership and forthcoming meetings are 
>>>>>>>>>>>> parsed.
>>>>>>>>>>>>
>>>>>>>>>>>> On the upside, it's fairly quick for me to add new parsers for 
>>>>>>>>>>>> councils
>>>>>>>>>>>> (and reuse ones already written if they use same CMS), there's an 
>>>>>>>>>>>> API
>>>>>>>>>>>> built in (basically just add .json or .xml to get the info as json 
>>>>>>>>>>>> or
>>>>>>>>>>>> XML), and there's lots of potential.
>>>>>>>>>>>>
>>>>>>>>>>>> Getting this far has also been an education in understanding what a
>>>>>>>>>>>> full-blown twfy_local might look like (in general there seems no 
>>>>>>>>>>>> way to
>>>>>>>>>>>> see how councillors voted, for example), the need for such a 
>>>>>>>>>>>> resource
>>>>>>>>>>>> (there's no publicly available central repository for council 
>>>>>>>>>>>> election
>>>>>>>>>>>> results, for example), and the sorry state of local authority 
>>>>>>>>>>>> websites
>>>>>>>>>>>> (just finding a list of councillors is a challenge on some, and 
>>>>>>>>>>>> don't
>>>>>>>>>>>> get me started on the HTML markup).
>>>>>>>>>>>>
>>>>>>>>>>>> Comments welcome. Code is at
>>>>>>>>>>>> http://github.com/CountCulture/twfy_local_parser/ (I'll probably 
>>>>>>>>>>>> GPL it
>>>>>>>>>>>> soon). Bug reports at
>>>>>>>>>>>> http://github.com/CountCulture/twfy_local_parser/issues and offers 
>>>>>>>>>>>> of
>>>>>>>>>>>> help to countculture at googlemail dot com.
>>>>>>>>>>>>
>>>>>>>>>>>> I'd especially be interested in hearing from anyone who's got any
>>>>>>>>>>>> knowledge about local authority CMSs (e.g. there seem to be several
>>>>>>>>>>>> different versions of Modern.Gov producing different URLs), or 
>>>>>>>>>>>> sources
>>>>>>>>>>>> for more data other than the local authority websites (e.g. eGR,
>>>>>>>>>>>> info4local).
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers
>>>>>>>>>>>>
>>>>>>>>>>>> C
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Mailing list [email protected]
>>>>>>>>>>>> Archive, settings, or unsubscribe:
>>>>>>>>>>>>
>>>>>>>>>>>> https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>                                                     
>>>>>>>>>>>>             
>>>>>>>>>>>                                                       
>>>>>>>>>>>   
>>>>>>>>>>                                                 
>>>>>>>>                                 
>>>>>>>                         
>>>>>>                   
>>>>>             
>>>>         
>>>
>>> _______________________________________________
>>> Mailing list [email protected]
>>> Archive, settings, or unsubscribe:
>>> https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public
>>>
>>>     
>>
>>   
>
>

_______________________________________________
Mailing list [email protected]
Archive, settings, or unsubscribe:
https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public

Re: [mySociety:public] [Fwd: Re: TheyWorkForYou Local]

Reply via email to