Re: [Dbpedia-discussion] Strategies to download subsets of DBPedia

Paul A. Houle Tue, 16 Jul 2013 09:42:52 -0700

I’ll keep this use case in mind.

Really there is no need to use Hadoop to handle English DBpedia data with 
Infovore because DBpedia-en isn’t as big as Freebase.  It may be possible to 
design the cutter so at least parts of it can be run w/o Hadoop.


On the other hand I think there is also going to be more interest in other 
language Dbpedias,  so now “metamemomic” work means facing up to bigger 
databases.  For instance,  en-Wikipedia has pictures of perhaps half of the 
“pretty places” in the world that one might want photos of,  travel to, 
whatever and it would be great to call on them too.



From: Dan Gravell 
Sent: Tuesday, July 16, 2013 4:26 AM
To: Paul A. Houle 
Cc: [email protected] 
Subject: Re: [Dbpedia-discussion] Strategies to download subsets of DBPedia

Thanks Paul. The end goal of this data is import into AWS SimpleDB and 
CloudSearch (for the strings), as a matter of fact. 

What I was doing though was having all of my data sources (also: Discogs, 
MusicBrainz) export to a common-ish JSON structure which then gets uploaded to 
the above services.

I was keen on ways of just working on the dbpedia tuples from the download. I'm 
still looking at the feasibililty of this. One grep of the nt file on a 
consumer SSD gets through the file in just over two minutes, which bodes well. 
I will continue with this line of investigation.

The other thing to investigate is writing custom formatters (I think they're 
called) for the extraction framework... not sure how 'pluggable' that is yet 
though.



On Mon, Jul 15, 2013 at 5:01 PM, Paul A. Houle <[email protected]> wrote:

  I can report my progress on this front.

  I’ve got a system in place that moves Freebase dumps,  recompresses them and 
stores them in the AMZN cloud.  I can suck in DBpedia data the same way.

  I’m hadoopifying my Infovore tools so I can do my preprocessing,  parallel 
super eyeball and be able to run basic reports.  The plan is to keep most of 
the results in requester-pays S3 buckets,  which can be accessed for free in 
the AWS,  particularly with Elastic MapReduce.

  The first release of the system will focus about rules that apply to 
individual triples,  but it’s not a difficult extension of that to build 
something that only copies records where the subjects are kings and queens,  
about sealing wax,  whatever.

  As a rough idea of costs and time involved,  it takes around two hours,  $2 
in transfer cost and about $1 in CPU to package the dump for EMR.  It will take 
more EMR costs to clean the data up and probably compress it to speed up your 
Q’s

  A somewhat tuned system could deliver you a custom subset of DBpedia in an 
hour or two on a cluster that costs about as much to run as a minimum wage 
employee.  You might then need to transfer the files out of AMZN but TANSTAFFL.

  From: Dan Gravell 
  Sent: Monday, July 15, 2013 9:34 AM
  To: [email protected] 
  Subject: [Dbpedia-discussion] Strategies to download subsets of DBPedia

  What is the most efficient (CPU and network time) of extracting subsets of 
DBPedia? 

  I am only interested in <http://dbpedia.org/ontology/MusicalWork> and the 
first level of relationships.

  First, I want to work on the data dumps provided either by DBPedia or 
Wikipedia (via the extraction framework, maybe). I realise I could do what I 
want via http://dbpedia.org/sparql but there are a number of problems with this:

  - It adds load on dbpedia.org
  - dbpedia.org often appears to have maintenance periods
  - There are limits placed on the number of results from dbpedia.org

  However, the DBPedia dumps themselves have one big problem: they are so 
massive it appears to take days to do anything with them. Loading them into 
Apache Jena for instance takes ages. I also tried a little sed'ing and awk'ing 
of the file but with little success.

  How is everyone else dealing with subsets of the data dumps? Is it possible 
to configure the extraction framework to ignore input records, or maybe output 
to something other than text n-tuples which would then be easier to slice and 
dice (e.g. output to SQL, then perform a query?).

  Thanks,
  Dan

------------------------------------------------------------------------------
  ------------------------------------------------------------------------------
  See everything from the browser to the database with AppDynamics
  Get end-to-end visibility with application monitoring from AppDynamics
  Isolate bottlenecks and diagnose root cause in seconds.
  Start your free trial of AppDynamics Pro today!
  http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk 
------------------------------------------------------------------------------
  _______________________________________________
  Dbpedia-discussion mailing list
  [email protected]
  https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk

_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Strategies to download subsets of DBPedia

Reply via email to