Hi Graeme,
 
The robots.txt file has been accidentally dropped from the new release and we 
will be re-introducing it, this is due to initial concerns & complaints raised 
about personal data population in external search engines  when the service was 
launched.
 
On the subject of scraping the data, I've asked the catalogue.bbc.co.uk team to 
clarify the terms of use on the data to see if that will help answer your 
question but if you have a specific request then I would recommend using the 
Contact Us page http://catalogue.bbc.co.uk/catalogue/infax/contact
Regards,
 

________________________________

From: [EMAIL PROTECTED] on behalf of Graeme West
Sent: Tue 7/24/2007 20:39
To: backstage@lists.bbc.co.uk
Subject: Re: Uploading the BBC programme catalogue to freebase (was RE: 
[backstage] Programme Catalogue vs. Freebase (was: BBC Programme Catalogue -any 
APIs yet?))


Hi all, 
Sorry to re-open an old thread - just wondering what the position is on 
scraping the catalogue.bbc.co.uk test site? I say this because I'm trying a 
little experiment - ingesting the whole catalogue into our Fedora repository ( 
http://www.fedora.info ) to be cross-referenced with the 200+ hours of BBC 
audio and video which we legally hold in our legacy repository as per our 
deposit agreement with the BBC ( 
http://www.spokenword.ac.uk/using-audio-video/copyright/ ).

The reason I ask is that I've constructed a set of scripts which scrape the 
catalogue.bbc.co.uk archive's RDF files. I've already got a 'master' list of 
all programme URLs (the script to generate that took a pretty long time on a 
JANET connection), but having started the crawler grabbing the actual RDF 
streams for each programme, I can see that this is going to involve a pretty 
large amount of data transfer.

FYI, my crawler uses Wget and respects robots.txt files. There's no robots.txt 
file on catalogue.bbc.co.uk so it seems to be fair game, but there is one on 
open.bbc.co.uk - I'm scraping from the former obviously. Clearly there's a 
licensing issue with copying the content but I'm only trying this as a 
technical experiment at this stage anyway - it will not be publicly available.

--
Graeme West
Spoken Word Services
Glasgow Caledonian University

Email: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> 
Project web site: 
http://www.spokenword.ac.uk/ <http://www.spokenword.ac.uk/> 


On 9 Jul 2007, at 21:30, Brendan Quinn wrote:


        I was considering entering a hack for Hack Day around that very thing.
        But then they went and made me one of the judges ;-)

        Wanna help? A simple set of scripts that scrape the archive (er I mean
        "call that big RESTful API") and post entries/updates to the freebase
        sandbox server would be an interesting experiment.

        I agree that freebase is an amazing resource, especially when the
        programme data is curated properly:

        compare
        http://www.freebase.com/view/?id=%239202a8c04000641f8000000000012406 
        with
        http://open.bbc.co.uk/catalogue/infax/series/DOCTOR+WHO
        !

        There may be some rights issues around what would basically amount to
        opening up the programme catalogue under the creative commons
        attribution license, where the attribution wouldn't go to the BBC but to
        Freebase...

        Brendan.

        -----Original Message-----
        From: [EMAIL PROTECTED]
        [mailto:[EMAIL PROTECTED] On Behalf Of Oliver Cole
        Sent: 09 July 2007 20:51
        To: backstage@lists.bbc.co.uk
        Subject: [backstage] Programme Catalogue vs. Freebase (was: BBC
        Programme Catalogue -any APIs yet?)

        I've been following the Programme Catalogue since it was announced, and
        its pretty interesting.

        I do however have a question for the BBC people on the list - have you
        considered simply uploading all the information to Freebase[1]? I can
        understand that you might want to keep it in house, but if you merged it
        with the wealth of information on Freebase you can do exponentially
        more.

        For example, if it was properly integrated you could run a query that
        would tell me how many of the contributors to Spooks series 2 were born
        in London.

        Regards,
        Oli

        [1] http://www.freebase.com - A very cool structured database, currently
        handling 2.3 million instances of 870 'types'

        -
        Sent via the backstage.bbc.co.uk discussion group.  To unsubscribe, 
please visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html.  
Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/

        Email has been scanned for viruses by Altman Technologies' email 
management service - www.altman.co.uk/emailsystems


Reply via email to