Hi all,
Sorry to re-open an old thread - just wondering what the position is on scraping the catalogue.bbc.co.uk test site? I say this because I'm trying a little experiment - ingesting the whole catalogue into our Fedora repository ( http://www.fedora.info ) to be cross-referenced with the 200+ hours of BBC audio and video which we legally hold in our legacy repository as per our deposit agreement with the BBC ( http://www.spokenword.ac.uk/using-audio-video/copyright/ ).

The reason I ask is that I've constructed a set of scripts which scrape the catalogue.bbc.co.uk archive's RDF files. I've already got a 'master' list of all programme URLs (the script to generate that took a pretty long time on a JANET connection), but having started the crawler grabbing the actual RDF streams for each programme, I can see that this is going to involve a pretty large amount of data transfer.

FYI, my crawler uses Wget and respects robots.txt files. There's no robots.txt file on catalogue.bbc.co.uk so it seems to be fair game, but there is one on open.bbc.co.uk - I'm scraping from the former obviously. Clearly there's a licensing issue with copying the content but I'm only trying this as a technical experiment at this stage anyway - it will not be publicly available.

--
Graeme West
Spoken Word Services
Glasgow Caledonian University

Email: [EMAIL PROTECTED]
Project web site:
http://www.spokenword.ac.uk/


On 9 Jul 2007, at 21:30, Brendan Quinn wrote:

I was considering entering a hack for Hack Day around that very thing.
But then they went and made me one of the judges ;-)

Wanna help? A simple set of scripts that scrape the archive (er I mean
"call that big RESTful API") and post entries/updates to the freebase
sandbox server would be an interesting experiment.

I agree that freebase is an amazing resource, especially when the
programme data is curated properly:

compare
http://www.freebase.com/view/?id=%239202a8c04000641f8000000000012406
with
http://open.bbc.co.uk/catalogue/infax/series/DOCTOR+WHO
!

There may be some rights issues around what would basically amount to
opening up the programme catalogue under the creative commons
attribution license, where the attribution wouldn't go to the BBC but to
Freebase...

Brendan.

-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Oliver Cole
Sent: 09 July 2007 20:51
To: backstage@lists.bbc.co.uk
Subject: [backstage] Programme Catalogue vs. Freebase (was: BBC
Programme Catalogue -any APIs yet?)

I've been following the Programme Catalogue since it was announced, and
its pretty interesting.

I do however have a question for the BBC people on the list - have you
considered simply uploading all the information to Freebase[1]? I can
understand that you might want to keep it in house, but if you merged it
with the wealth of information on Freebase you can do exponentially
more.

For example, if it was properly integrated you could run a query that
would tell me how many of the contributors to Spooks series 2 were born
in London.

Regards,
Oli

[1] http://www.freebase.com - A very cool structured database, currently
handling 2.3 million instances of 870 'types'

-
Sent via the backstage.bbc.co.uk discussion group. To unsubscribe, please visit http://backstage.bbc.co.uk/archives/2005/01/ mailing_list.html. Unofficial list archive: http://www.mail- archive.com/backstage@lists.bbc.co.uk/

Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems

Reply via email to