I can report my progress on this front.

I’ve got a system in place that moves Freebase dumps,  recompresses them and 
stores them in the AMZN cloud.  I can suck in DBpedia data the same way.

I’m hadoopifying my Infovore tools so I can do my preprocessing,  parallel 
super eyeball and be able to run basic reports.  The plan is to keep most of 
the results in requester-pays S3 buckets,  which can be accessed for free in 
the AWS,  particularly with Elastic MapReduce.

The first release of the system will focus about rules that apply to individual 
triples,  but it’s not a difficult extension of that to build something that 
only copies records where the subjects are kings and queens,  about sealing 
wax,  whatever.

As a rough idea of costs and time involved,  it takes around two hours,  $2 in 
transfer cost and about $1 in CPU to package the dump for EMR.  It will take 
more EMR costs to clean the data up and probably compress it to speed up your 
Q’s

A somewhat tuned system could deliver you a custom subset of DBpedia in an hour 
or two on a cluster that costs about as much to run as a minimum wage employee. 
 You might then need to transfer the files out of AMZN but TANSTAFFL.

From: Dan Gravell 
Sent: Monday, July 15, 2013 9:34 AM
To: [email protected] 
Subject: [Dbpedia-discussion] Strategies to download subsets of DBPedia

What is the most efficient (CPU and network time) of extracting subsets of 
DBPedia? 

I am only interested in <http://dbpedia.org/ontology/MusicalWork> and the first 
level of relationships.

First, I want to work on the data dumps provided either by DBPedia or Wikipedia 
(via the extraction framework, maybe). I realise I could do what I want via 
http://dbpedia.org/sparql but there are a number of problems with this:

- It adds load on dbpedia.org
- dbpedia.org often appears to have maintenance periods
- There are limits placed on the number of results from dbpedia.org

However, the DBPedia dumps themselves have one big problem: they are so massive 
it appears to take days to do anything with them. Loading them into Apache Jena 
for instance takes ages. I also tried a little sed'ing and awk'ing of the file 
but with little success.

How is everyone else dealing with subsets of the data dumps? Is it possible to 
configure the extraction framework to ignore input records, or maybe output to 
something other than text n-tuples which would then be easier to slice and dice 
(e.g. output to SQL, then perform a query?).

Thanks,
Dan


--------------------------------------------------------------------------------
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk 


--------------------------------------------------------------------------------
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to