Re: [Dbpedia-discussion] Strategies to download subsets of DBPedia

Dimitris Kontokostas Thu, 18 Jul 2013 01:47:22 -0700

Hi Dan,

On Tue, Jul 16, 2013 at 11:26 AM, Dan Gravell <[email protected]>wrote:


> Thanks Paul. The end goal of this data is import into AWS SimpleDB and
> CloudSearch (for the strings), as a matter of fact.
>
> What I was doing though was having all of my data sources (also: Discogs,
> MusicBrainz) export to a common-ish JSON structure which then gets uploaded
> to the above services.
>
> I was keen on ways of just working on the dbpedia tuples from the
> download. I'm still looking at the feasibililty of this. One grep of the nt
> file on a consumer SSD gets through the file in just over two minutes,
> which bodes well. I will continue with this line of investigation.
>
> The other thing to investigate is writing custom formatters (I think
> they're called) for the extraction framework... not sure how 'pluggable'
> that is yet though.
>

They 're pretty pluggable already. There are 2 extra formatters for DBpedia
live [1] but both are used manually in the code.
You can adapt the PolicyParser [2] class to enable them in the
configuration file for the dump extraction.

Best,
Dimitris

[1]
https://github.com/dbpedia/extraction-framework/tree/master/live/src/main/scala/org/dbpedia/extraction/destinations/formatters
[2]
https://github.com/dbpedia/extraction-framework/blob/master/dump/src/main/scala/org/dbpedia/extraction/dump/extract/PolicyParser.scala

>
>
> On Mon, Jul 15, 2013 at 5:01 PM, Paul A. Houle <[email protected]> wrote:
>
>>   I can report my progress on this front.
>>
>> I’ve got a system in place that moves Freebase dumps,  recompresses them
>> and stores them in the AMZN cloud.  I can suck in DBpedia data the same way.
>>
>> I’m hadoopifying my Infovore tools so I can do my preprocessing,
>> parallel super eyeball and be able to run basic reports.  The plan is to
>> keep most of the results in requester-pays S3 buckets,  which can be
>> accessed for free in the AWS,  particularly with Elastic MapReduce.
>>
>> The first release of the system will focus about rules that apply to
>> individual triples,  but it’s not a difficult extension of that to build
>> something that only copies records where the subjects are kings and
>> queens,  about sealing wax,  whatever.
>>
>> As a rough idea of costs and time involved,  it takes around two hours,
>> $2 in transfer cost and about $1 in CPU to package the dump for EMR.  It
>> will take more EMR costs to clean the data up and probably compress it to
>> speed up your Q’s
>>
>> A somewhat tuned system could deliver you a custom subset of DBpedia in
>> an hour or two on a cluster that costs about as much to run as a minimum
>> wage employee.  You might then need to transfer the files out of AMZN but
>> TANSTAFFL.
>>
>>   *From:* Dan Gravell <[email protected]>
>> *Sent:* Monday, July 15, 2013 9:34 AM
>> *To:* [email protected]
>> *Subject:* [Dbpedia-discussion] Strategies to download subsets of DBPedia
>>
>>  What is the most efficient (CPU and network time) of extracting subsets
>> of DBPedia?
>>
>> I am only interested in <http://dbpedia.org/ontology/MusicalWork> and
>> the first level of relationships.
>>
>> First, I want to work on the data dumps provided either by DBPedia or
>> Wikipedia (via the extraction framework, maybe). I realise I could do what
>> I want via http://dbpedia.org/sparql but there are a number of problems
>> with this:
>>
>> - It adds load on dbpedia.org
>> - dbpedia.org often appears to have maintenance periods
>> - There are limits placed on the number of results from dbpedia.org
>>
>> However, the DBPedia dumps themselves have one big problem: they are so
>> massive it appears to take days to do anything with them. Loading them into
>> Apache Jena for instance takes ages. I also tried a little sed'ing and
>> awk'ing of the file but with little success.
>>
>> How is everyone else dealing with subsets of the data dumps? Is it
>> possible to configure the extraction framework to ignore input records, or
>> maybe output to something other than text n-tuples which would then be
>> easier to slice and dice (e.g. output to SQL, then perform a query?).
>>
>> Thanks,
>> Dan
>>
>> ------------------------------
>>
>> ------------------------------------------------------------------------------
>> See everything from the browser to the database with AppDynamics
>> Get end-to-end visibility with application monitoring from AppDynamics
>> Isolate bottlenecks and diagnose root cause in seconds.
>> Start your free trial of AppDynamics Pro today!
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>
>> ------------------------------
>> _______________________________________________
>> Dbpedia-discussion mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>
>>
>
>
> ------------------------------------------------------------------------------
> See everything from the browser to the database with AppDynamics
> Get end-to-end visibility with application monitoring from AppDynamics
> Isolate bottlenecks and diagnose root cause in seconds.
> Start your free trial of AppDynamics Pro today!
> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> _______________________________________________
> Dbpedia-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>
>


-- 
Kontokostas Dimitris

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk

_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Strategies to download subsets of DBPedia

Reply via email to