Hi Graeme, The robots.txt file has been accidentally dropped from the new release and we will be re-introducing it, this is due to initial concerns & complaints raised about personal data population in external search engines when the service was launched. On the subject of scraping the data, I've asked the catalogue.bbc.co.uk team to clarify the terms of use on the data to see if that will help answer your question but if you have a specific request then I would recommend using the Contact Us page http://catalogue.bbc.co.uk/catalogue/infax/contact Regards,
________________________________ From: [EMAIL PROTECTED] on behalf of Graeme West Sent: Tue 7/24/2007 20:39 To: backstage@lists.bbc.co.uk Subject: Re: Uploading the BBC programme catalogue to freebase (was RE: [backstage] Programme Catalogue vs. Freebase (was: BBC Programme Catalogue -any APIs yet?)) Hi all, Sorry to re-open an old thread - just wondering what the position is on scraping the catalogue.bbc.co.uk test site? I say this because I'm trying a little experiment - ingesting the whole catalogue into our Fedora repository ( http://www.fedora.info ) to be cross-referenced with the 200+ hours of BBC audio and video which we legally hold in our legacy repository as per our deposit agreement with the BBC ( http://www.spokenword.ac.uk/using-audio-video/copyright/ ). The reason I ask is that I've constructed a set of scripts which scrape the catalogue.bbc.co.uk archive's RDF files. I've already got a 'master' list of all programme URLs (the script to generate that took a pretty long time on a JANET connection), but having started the crawler grabbing the actual RDF streams for each programme, I can see that this is going to involve a pretty large amount of data transfer. FYI, my crawler uses Wget and respects robots.txt files. There's no robots.txt file on catalogue.bbc.co.uk so it seems to be fair game, but there is one on open.bbc.co.uk - I'm scraping from the former obviously. Clearly there's a licensing issue with copying the content but I'm only trying this as a technical experiment at this stage anyway - it will not be publicly available. -- Graeme West Spoken Word Services Glasgow Caledonian University Email: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> Project web site: http://www.spokenword.ac.uk/ <http://www.spokenword.ac.uk/> On 9 Jul 2007, at 21:30, Brendan Quinn wrote: I was considering entering a hack for Hack Day around that very thing. But then they went and made me one of the judges ;-) Wanna help? A simple set of scripts that scrape the archive (er I mean "call that big RESTful API") and post entries/updates to the freebase sandbox server would be an interesting experiment. I agree that freebase is an amazing resource, especially when the programme data is curated properly: compare http://www.freebase.com/view/?id=%239202a8c04000641f8000000000012406 with http://open.bbc.co.uk/catalogue/infax/series/DOCTOR+WHO ! There may be some rights issues around what would basically amount to opening up the programme catalogue under the creative commons attribution license, where the attribution wouldn't go to the BBC but to Freebase... Brendan. -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Oliver Cole Sent: 09 July 2007 20:51 To: backstage@lists.bbc.co.uk Subject: [backstage] Programme Catalogue vs. Freebase (was: BBC Programme Catalogue -any APIs yet?) I've been following the Programme Catalogue since it was announced, and its pretty interesting. I do however have a question for the BBC people on the list - have you considered simply uploading all the information to Freebase[1]? I can understand that you might want to keep it in house, but if you merged it with the wealth of information on Freebase you can do exponentially more. For example, if it was properly integrated you could run a query that would tell me how many of the contributors to Spooks series 2 were born in London. Regards, Oli [1] http://www.freebase.com - A very cool structured database, currently handling 2.3 million instances of 870 'types' - Sent via the backstage.bbc.co.uk discussion group. To unsubscribe, please visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html. Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/ Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems