Thanks Paul. The end goal of this data is import into AWS SimpleDB and
CloudSearch (for the strings), as a matter of fact.

What I was doing though was having all of my data sources (also: Discogs,
MusicBrainz) export to a common-ish JSON structure which then gets uploaded
to the above services.

I was keen on ways of just working on the dbpedia tuples from the download.
I'm still looking at the feasibililty of this. One grep of the nt file on a
consumer SSD gets through the file in just over two minutes, which bodes
well. I will continue with this line of investigation.

The other thing to investigate is writing custom formatters (I think
they're called) for the extraction framework... not sure how 'pluggable'
that is yet though.


On Mon, Jul 15, 2013 at 5:01 PM, Paul A. Houle <[email protected]> wrote:

>   I can report my progress on this front.
>
> I’ve got a system in place that moves Freebase dumps,  recompresses them
> and stores them in the AMZN cloud.  I can suck in DBpedia data the same way.
>
> I’m hadoopifying my Infovore tools so I can do my preprocessing,  parallel
> super eyeball and be able to run basic reports.  The plan is to keep most
> of the results in requester-pays S3 buckets,  which can be accessed for
> free in the AWS,  particularly with Elastic MapReduce.
>
> The first release of the system will focus about rules that apply to
> individual triples,  but it’s not a difficult extension of that to build
> something that only copies records where the subjects are kings and
> queens,  about sealing wax,  whatever.
>
> As a rough idea of costs and time involved,  it takes around two hours,
> $2 in transfer cost and about $1 in CPU to package the dump for EMR.  It
> will take more EMR costs to clean the data up and probably compress it to
> speed up your Q’s
>
> A somewhat tuned system could deliver you a custom subset of DBpedia in an
> hour or two on a cluster that costs about as much to run as a minimum wage
> employee.  You might then need to transfer the files out of AMZN but
> TANSTAFFL.
>
>   *From:* Dan Gravell <[email protected]>
> *Sent:* Monday, July 15, 2013 9:34 AM
> *To:* [email protected]
> *Subject:* [Dbpedia-discussion] Strategies to download subsets of DBPedia
>
>  What is the most efficient (CPU and network time) of extracting subsets
> of DBPedia?
>
> I am only interested in <http://dbpedia.org/ontology/MusicalWork> and the
> first level of relationships.
>
> First, I want to work on the data dumps provided either by DBPedia or
> Wikipedia (via the extraction framework, maybe). I realise I could do what
> I want via http://dbpedia.org/sparql but there are a number of problems
> with this:
>
> - It adds load on dbpedia.org
> - dbpedia.org often appears to have maintenance periods
> - There are limits placed on the number of results from dbpedia.org
>
> However, the DBPedia dumps themselves have one big problem: they are so
> massive it appears to take days to do anything with them. Loading them into
> Apache Jena for instance takes ages. I also tried a little sed'ing and
> awk'ing of the file but with little success.
>
> How is everyone else dealing with subsets of the data dumps? Is it
> possible to configure the extraction framework to ignore input records, or
> maybe output to something other than text n-tuples which would then be
> easier to slice and dice (e.g. output to SQL, then perform a query?).
>
> Thanks,
> Dan
>
> ------------------------------
>
> ------------------------------------------------------------------------------
> See everything from the browser to the database with AppDynamics
> Get end-to-end visibility with application monitoring from AppDynamics
> Isolate bottlenecks and diagnose root cause in seconds.
> Start your free trial of AppDynamics Pro today!
> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>
> ------------------------------
> _______________________________________________
> Dbpedia-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>
>
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to