I’ll keep this use case in mind.
Really there is no need to use Hadoop to handle English DBpedia data with
Infovore because DBpedia-en isn’t as big as Freebase. It may be possible to
design the cutter so at least parts of it can be run w/o Hadoop.
On the other hand I think there is also going to be more interest in other
language Dbpedias, so now “metamemomic” work means facing up to bigger
databases. For instance, en-Wikipedia has pictures of perhaps half of the
“pretty places” in the world that one might want photos of, travel to,
whatever and it would be great to call on them too.
From: Dan Gravell
Sent: Tuesday, July 16, 2013 4:26 AM
To: Paul A. Houle
Cc: [email protected]
Subject: Re: [Dbpedia-discussion] Strategies to download subsets of DBPedia
Thanks Paul. The end goal of this data is import into AWS SimpleDB and
CloudSearch (for the strings), as a matter of fact.
What I was doing though was having all of my data sources (also: Discogs,
MusicBrainz) export to a common-ish JSON structure which then gets uploaded to
the above services.
I was keen on ways of just working on the dbpedia tuples from the download. I'm
still looking at the feasibililty of this. One grep of the nt file on a
consumer SSD gets through the file in just over two minutes, which bodes well.
I will continue with this line of investigation.
The other thing to investigate is writing custom formatters (I think they're
called) for the extraction framework... not sure how 'pluggable' that is yet
though.
On Mon, Jul 15, 2013 at 5:01 PM, Paul A. Houle <[email protected]> wrote:
I can report my progress on this front.
I’ve got a system in place that moves Freebase dumps, recompresses them and
stores them in the AMZN cloud. I can suck in DBpedia data the same way.
I’m hadoopifying my Infovore tools so I can do my preprocessing, parallel
super eyeball and be able to run basic reports. The plan is to keep most of
the results in requester-pays S3 buckets, which can be accessed for free in
the AWS, particularly with Elastic MapReduce.
The first release of the system will focus about rules that apply to
individual triples, but it’s not a difficult extension of that to build
something that only copies records where the subjects are kings and queens,
about sealing wax, whatever.
As a rough idea of costs and time involved, it takes around two hours, $2
in transfer cost and about $1 in CPU to package the dump for EMR. It will take
more EMR costs to clean the data up and probably compress it to speed up your
Q’s
A somewhat tuned system could deliver you a custom subset of DBpedia in an
hour or two on a cluster that costs about as much to run as a minimum wage
employee. You might then need to transfer the files out of AMZN but TANSTAFFL.
From: Dan Gravell
Sent: Monday, July 15, 2013 9:34 AM
To: [email protected]
Subject: [Dbpedia-discussion] Strategies to download subsets of DBPedia
What is the most efficient (CPU and network time) of extracting subsets of
DBPedia?
I am only interested in <http://dbpedia.org/ontology/MusicalWork> and the
first level of relationships.
First, I want to work on the data dumps provided either by DBPedia or
Wikipedia (via the extraction framework, maybe). I realise I could do what I
want via http://dbpedia.org/sparql but there are a number of problems with this:
- It adds load on dbpedia.org
- dbpedia.org often appears to have maintenance periods
- There are limits placed on the number of results from dbpedia.org
However, the DBPedia dumps themselves have one big problem: they are so
massive it appears to take days to do anything with them. Loading them into
Apache Jena for instance takes ages. I also tried a little sed'ing and awk'ing
of the file but with little success.
How is everyone else dealing with subsets of the data dumps? Is it possible
to configure the extraction framework to ignore input records, or maybe output
to something other than text n-tuples which would then be easier to slice and
dice (e.g. output to SQL, then perform a query?).
Thanks,
Dan
------------------------------------------------------------------------------
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
------------------------------------------------------------------------------
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion