Re: [mySociety:public] [Fwd: Re: TheyWorkForYou Local]

CountCulture Fri, 19 Jun 2009 05:22:01 -0700

Redirects are fine for viewing a webpage, but somewhat problematic as a
canonical, immutable id that can be used to get data from a number of
sources (which is what we're after, I reckon -- if I wanted the WDTK
data on Cheshire West and Chester, for example, would I be able to get
it via:


    * http://www.whatdotheyknow.com/body/Cheshire_West_and_Chester
    * http://www.whatdotheyknow.com/body/West_Cheshire_and_Chester
    * http://www.whatdotheyknow.com/body/City_of_Chester_and_West_Cheshire

Seems a lot of work from the developer point of view (ignoring probs
caused by rogue edits).

Also while the redirects do provide a partial history, as I understand
it are only a one-step history, i.e. though the wikipedia article on
http://en.wikipedia.org/wiki/City_of_Chester_and_West_Cheshire redirects
to Cheshire_West_and_Chester and not via West_Cheshire_and_Chester (I'm
not saying that's a big prob, just that it's not a full history; it's
also not a history of the official name changes, just of the wiki
editing process).

All we're after here is a common code that doesn't change (while the
local authority or other public body doesn't change) that various
websites can support (with minimum coding) to provide the data without
ambiguity. Wikipedia article URLs, much as we love them, doesn't really
work in that respect IMHO.
Cheers
C



Francis Irving wrote:
> Yes, I mean the article name, probably in the form it appears in the
> URI.
>
> Although Wikipedia titles do change, they always provide a redirect.
>
> The nice thing about it, is that the redirects become part of the
> structured information.
>
> Francis
>
> On Fri, Jun 19, 2009 at 12:02:49PM +0100, CountCulture wrote:
>   
>> Francis
>> Think we should investigate Alex's suggestions of SNAC codes. Not sure  
>> about Wikipedia ids -- do you mean uris, or do they have numerical ids  
>> too; prefer numerical/poss alphanumerical unique ids rather than  
>> strings, and Wikipedia page titles change too often to be canonical IMHO.
>> Cheers
>> C
>>
>>
>> Francis Irving wrote:
>>     
>>> (copied to WhatDoTheyKnow team)
>>>
>>> Anyone here know about identifiers for local authorities?
>>>
>>> I'm inclined to use Wikipedia article ids, as that will extend to
>>> other authorities as well.
>>>
>>> Francis
>>>
>>> On Thu, Jun 18, 2009 at 11:44:12AM +0100, CountCulture wrote:
>>>   
>>>       
>>>> Francis
>>>> Thought it might be useful if twfylocal could show status of WDTK   
>>>> requests (total, recent, no answered, outstanding late etc), with 
>>>> basic  details of requests (though prob makes sense to go to WDTK 
>>>> site for full  details of request).
>>>>
>>>> Re id system, it's something I've been struggling with as everywhere  
>>>> uses a different system, so at the moment each twfylocal council 
>>>> record  stores the following ids/refs:
>>>>
>>>> :id (integer, twfy_local internal primary id. WON'T CHANGE)
>>>> :name (string, as scraped from eGR, though with some minor edits)
>>>> :wikipedia_url (string, as scraped from eGR, though have already 
>>>> found  one mistake)
>>>> :ons_url (string)
>>>> :egr_id (integer, this is most useful as it gives links to loads of   
>>>> other things -- e.g. various gov pages -- doesn't change AFAIK even 
>>>> if  the authority name does)
>>>> :wdtk_name (string, from scraping WDTK and trying to match against   
>>>> shortened version of name -- successful about 80% of the time)
>>>>
>>>> Had a look at the WDTK code and I seem to remember the internal 
>>>> primary  id is exposed in at least one place, but that it didn't help 
>>>> as you  couldn't do queries by it. What we could really do with is a 
>>>> canonical  id for each authority.
>>>>
>>>> FWIW you can use the eGR on twfylocal, though it adds an extra step 
>>>> (if  you go to theyworkforyoulocal.com/councils.xml it returns all 
>>>> the  councils together with their ids and the eGR ids. If you could 
>>>> match  WDTK with eGR ids (for example) and make the match available   
>>>> programmatically would have the beginnings of a makeshift common id.
>>>>
>>>> Thoughts?
>>>>
>>>>
>>>> Francis Irving wrote:
>>>>     
>>>>         
>>>>> There are RSS feeds of latest responses, including quite fancy ones if
>>>>> you use advanced search keywords. They only give extracts from the new
>>>>> messages though. What exact information are you trying to get?
>>>>>
>>>>> There is no structured way to get status or similar out of the site.
>>>>>
>>>>> Finally, we could agree an id system for name matching. I'd quite like
>>>>> in a way to mark every authority with, say, its identifier in
>>>>> Wikipedia, to aid merging with other databases.
>>>>>
>>>>> What identifiers are you using in your system?
>>>>>
>>>>> Francis
>>>>>
>>>>> On Wed, Jun 17, 2009 at 03:05:26PM +0200, Tom Steinberg wrote:
>>>>>         
>>>>>           
>>>>>> Hi,
>>>>>>
>>>>>> I'm afraid I don't know, but I've CCed the team who look after WDTK to 
>>>>>> ask.
>>>>>>
>>>>>> Tom
>>>>>>
>>>>>> 2009/6/17 CountCulture <[email protected]>:
>>>>>>             
>>>>>>             
>>>>>>> Tom
>>>>>>> Follow up question. At the moment I've got a link to the What Do They 
>>>>>>> Know
>>>>>>> page for the council. Any probs with including more info from WDTK such 
>>>>>>> as
>>>>>>> status, and latest responses, and is there a good way to get that other 
>>>>>>> than
>>>>>>> scraping the data ( had a look at the code and there didn't really seem 
>>>>>>> to
>>>>>>> be)?
>>>>>>> Cheers
>>>>>>> C
>>>>>>>
>>>>>>> -------- Original Message --------
>>>>>>>
>>>>>>> Tom
>>>>>>>
>>>>>>> Digging deeper is actually where I'd intended to go first, but when I
>>>>>>> started to explore some of the council websites I found that even 
>>>>>>> shallow
>>>>>>> data was problematic and reckoned I needed a API and structure that at 
>>>>>>> the
>>>>>>> very least could cope with those variants (and reuse the 
>>>>>>> scrapers/parsers
>>>>>>> once written) -- hence the proof-of-concept nature.
>>>>>>>
>>>>>>> However, now I've got the basics worked out (though there's still 
>>>>>>> tweaking
>>>>>>> and issues to be done there), delving deeper's the next step. In 
>>>>>>> particular,
>>>>>>> working out the best way of finding/storing/parsing council docs (which 
>>>>>>> are
>>>>>>> often unstructured PDFs, sometimes even just PDFs which are just 
>>>>>>> scans), and
>>>>>>> also working out an elegant way of linking with other data sources.
>>>>>>>
>>>>>>> Thanks for the kind words, I'll keep the list updated with major
>>>>>>> developments, or you can always watch the github repository.
>>>>>>>
>>>>>>> Cheers
>>>>>>> C
>>>>>>>
>>>>>>> Tom Steinberg wrote:
>>>>>>>                 
>>>>>>>               
>>>>>>>> Hi there,
>>>>>>>>
>>>>>>>> Cool - great to see people hacking on councils, it's been something
>>>>>>>> I've wanted to see for ages.
>>>>>>>>
>>>>>>>> I see you've gone straight for getting the councillors of several
>>>>>>>> different councils, but I'd actually suggest going deeper rather than
>>>>>>>> wider. Why not just dive deep into one council and see if you can get
>>>>>>>> transcripts or other documents nicely scraped and parsed? I'd love to
>>>>>>>> see at least a handful of councils in TheyWorkForYou proper by the end
>>>>>>>> of the year.
>>>>>>>>
>>>>>>>> Well done anyway!
>>>>>>>>
>>>>>>>> Tom
>>>>>>>>
>>>>>>>> 2009/6/16 CountCulture <[email protected]>:
>>>>>>>>
>>>>>>>>                     
>>>>>>>>                 
>>>>>>>>> Quick note about something I've been working on in my spare time:
>>>>>>>>>
>>>>>>>>> http://theyworkforyoulocal.com -- a small app to scrape and parse 
>>>>>>>>> local
>>>>>>>>> authority info.
>>>>>>>>>
>>>>>>>>> At the moment, it's barely more than a proof of concept, with only 
>>>>>>>>> about
>>>>>>>>> 20 or so councils parsed, and even then only current councillors,
>>>>>>>>> committees, committee membership and forthcoming meetings are parsed.
>>>>>>>>>
>>>>>>>>> On the upside, it's fairly quick for me to add new parsers for 
>>>>>>>>> councils
>>>>>>>>> (and reuse ones already written if they use same CMS), there's an API
>>>>>>>>> built in (basically just add .json or .xml to get the info as json or
>>>>>>>>> XML), and there's lots of potential.
>>>>>>>>>
>>>>>>>>> Getting this far has also been an education in understanding what a
>>>>>>>>> full-blown twfy_local might look like (in general there seems no way 
>>>>>>>>> to
>>>>>>>>> see how councillors voted, for example), the need for such a resource
>>>>>>>>> (there's no publicly available central repository for council election
>>>>>>>>> results, for example), and the sorry state of local authority websites
>>>>>>>>> (just finding a list of councillors is a challenge on some, and don't
>>>>>>>>> get me started on the HTML markup).
>>>>>>>>>
>>>>>>>>> Comments welcome. Code is at
>>>>>>>>> http://github.com/CountCulture/twfy_local_parser/ (I'll probably GPL 
>>>>>>>>> it
>>>>>>>>> soon). Bug reports at
>>>>>>>>> http://github.com/CountCulture/twfy_local_parser/issues and offers of
>>>>>>>>> help to countculture at googlemail dot com.
>>>>>>>>>
>>>>>>>>> I'd especially be interested in hearing from anyone who's got any
>>>>>>>>> knowledge about local authority CMSs (e.g. there seem to be several
>>>>>>>>> different versions of Modern.Gov producing different URLs), or sources
>>>>>>>>> for more data other than the local authority websites (e.g. eGR,
>>>>>>>>> info4local).
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>>
>>>>>>>>> C
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Mailing list [email protected]
>>>>>>>>> Archive, settings, or unsubscribe:
>>>>>>>>>
>>>>>>>>> https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                         
>>>>>>>>>                   
>>>>>>>>                     
>>>>>>>>                 
>>>>>>>                 
>>>>>>>               
>>>>>         
>>>>>           
>>>>     
>>>>         
>>>   
>>>       
>>     
>
>   



_______________________________________________
Mailing list [email protected]
Archive, settings, or unsubscribe:
https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public

Re: [mySociety:public] [Fwd: Re: TheyWorkForYou Local]

Reply via email to