What is the most efficient (CPU and network time) of extracting subsets of
DBPedia?
I am only interested in <http://dbpedia.org/ontology/MusicalWork> and the
first level of relationships.
First, I want to work on the data dumps provided either by DBPedia or
Wikipedia (via the extraction framework, maybe). I realise I could do what
I want via http://dbpedia.org/sparql but there are a number of problems
with this:
- It adds load on dbpedia.org
- dbpedia.org often appears to have maintenance periods
- There are limits placed on the number of results from dbpedia.org
However, the DBPedia dumps themselves have one big problem: they are so
massive it appears to take days to do anything with them. Loading them into
Apache Jena for instance takes ages. I also tried a little sed'ing and
awk'ing of the file but with little success.
How is everyone else dealing with subsets of the data dumps? Is it possible
to configure the extraction framework to ignore input records, or maybe
output to something other than text n-tuples which would then be easier to
slice and dice (e.g. output to SQL, then perform a query?).
Thanks,
Dan
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion