Re: Uploading the BBC programme catalogue to freebase (was RE: [backstage] Programme Catalogue vs. Freebase (was: BBC Programme Catalogue -any APIs yet?))

Graeme West Tue, 24 Jul 2007 12:45:08 -0700

Hi all,

Sorry to re-open an old thread - just wondering what the position ison scraping the catalogue.bbc.co.uk test site? I say this because I'mtrying a little experiment - ingesting the whole catalogue into ourFedora repository ( http://www.fedora.info ) to be cross-referencedwith the 200+ hours of BBC audio and video which we legally hold inour legacy repository as per our deposit agreement with the BBC( http://www.spokenword.ac.uk/using-audio-video/copyright/ ).

The reason I ask is that I've constructed a set of scripts whichscrape the catalogue.bbc.co.uk archive's RDF files. I've already gota 'master' list of all programme URLs (the script to generate thattook a pretty long time on a JANET connection), but having startedthe crawler grabbing the actual RDF streams for each programme, I cansee that this is going to involve a pretty large amount of datatransfer.

FYI, my crawler uses Wget and respects robots.txt files. There's norobots.txt file on catalogue.bbc.co.uk so it seems to be fair game,but there is one on open.bbc.co.uk - I'm scraping from the formerobviously. Clearly there's a licensing issue with copying the contentbut I'm only trying this as a technical experiment at this stageanyway - it will not be publicly available.


--
Graeme West
Spoken Word Services
Glasgow Caledonian University

Email: [EMAIL PROTECTED]
Project web site:
http://www.spokenword.ac.uk/


On 9 Jul 2007, at 21:30, Brendan Quinn wrote:

I was considering entering a hack for Hack Day around that very thing.
But then they went and made me one of the judges ;-)

Wanna help? A simple set of scripts that scrape the archive (er I mean
"call that big RESTful API") and post entries/updates to the freebase
sandbox server would be an interesting experiment.

I agree that freebase is an amazing resource, especially when the
programme data is curated properly:

compare
http://www.freebase.com/view/?id=%239202a8c04000641f8000000000012406
with
http://open.bbc.co.uk/catalogue/infax/series/DOCTOR+WHO
!

There may be some rights issues around what would basically amount to
opening up the programme catalogue under the creative commons
attribution license, where the attribution wouldn't go to the BBCbut to
Freebase...

Brendan.

-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Oliver Cole
Sent: 09 July 2007 20:51
To: backstage@lists.bbc.co.uk
Subject: [backstage] Programme Catalogue vs. Freebase (was: BBC
Programme Catalogue -any APIs yet?)
I've been following the Programme Catalogue since it was announced,and
its pretty interesting.

I do however have a question for the BBC people on the list - have you
considered simply uploading all the information to Freebase[1]? I can
understand that you might want to keep it in house, but if youmerged it
with the wealth of information on Freebase you can do exponentially
more.

For example, if it was properly integrated you could run a query that
would tell me how many of the contributors to Spooks series 2 wereborn
in London.

Regards,
Oli
[1] http://www.freebase.com - A very cool structured database,currently
handling 2.3 million instances of 870 'types'

-
Sent via the backstage.bbc.co.uk discussion group. To unsubscribe,please visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html. Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/
Email has been scanned for viruses by Altman Technologies' emailmanagement service - www.altman.co.uk/emailsystems

Re: Uploading the BBC programme catalogue to freebase (was RE: [backstage] Programme Catalogue vs. Freebase (was: BBC Programme Catalogue -any APIs yet?))

Reply via email to