Thanks Dimitris. I'm making fairly good progress right now by simply brute
forcing a few scans over the .nt file and filtering out the lines of
interest. On a consumer grade SSD this takes about 7 minutes per scan, and
as this is a batch, non-interactive nor user facing job, this is
acceptable. I hope to write up and maybe publish what I've done (my awk/sed
skills fall short so I ended up scripting something in Scala).
Dan
On Thu, Jul 18, 2013 at 9:44 AM, Dimitris Kontokostas <[email protected]>wrote:
> Hi Dan,
>
> On Tue, Jul 16, 2013 at 11:26 AM, Dan Gravell <[email protected]>wrote:
>
>> Thanks Paul. The end goal of this data is import into AWS SimpleDB and
>> CloudSearch (for the strings), as a matter of fact.
>>
>> What I was doing though was having all of my data sources (also: Discogs,
>> MusicBrainz) export to a common-ish JSON structure which then gets uploaded
>> to the above services.
>>
>> I was keen on ways of just working on the dbpedia tuples from the
>> download. I'm still looking at the feasibililty of this. One grep of the nt
>> file on a consumer SSD gets through the file in just over two minutes,
>> which bodes well. I will continue with this line of investigation.
>>
>> The other thing to investigate is writing custom formatters (I think
>> they're called) for the extraction framework... not sure how 'pluggable'
>> that is yet though.
>>
>
> They 're pretty pluggable already. There are 2 extra formatters for
> DBpedia live [1] but both are used manually in the code.
> You can adapt the PolicyParser [2] class to enable them in the
> configuration file for the dump extraction.
>
> Best,
> Dimitris
>
> [1]
> https://github.com/dbpedia/extraction-framework/tree/master/live/src/main/scala/org/dbpedia/extraction/destinations/formatters
> [2]
> https://github.com/dbpedia/extraction-framework/blob/master/dump/src/main/scala/org/dbpedia/extraction/dump/extract/PolicyParser.scala
>
>>
>>
>> On Mon, Jul 15, 2013 at 5:01 PM, Paul A. Houle <[email protected]>wrote:
>>
>>> I can report my progress on this front.
>>>
>>> I’ve got a system in place that moves Freebase dumps, recompresses them
>>> and stores them in the AMZN cloud. I can suck in DBpedia data the same way.
>>>
>>> I’m hadoopifying my Infovore tools so I can do my preprocessing,
>>> parallel super eyeball and be able to run basic reports. The plan is to
>>> keep most of the results in requester-pays S3 buckets, which can be
>>> accessed for free in the AWS, particularly with Elastic MapReduce.
>>>
>>> The first release of the system will focus about rules that apply to
>>> individual triples, but it’s not a difficult extension of that to build
>>> something that only copies records where the subjects are kings and
>>> queens, about sealing wax, whatever.
>>>
>>> As a rough idea of costs and time involved, it takes around two hours,
>>> $2 in transfer cost and about $1 in CPU to package the dump for EMR. It
>>> will take more EMR costs to clean the data up and probably compress it to
>>> speed up your Q’s
>>>
>>> A somewhat tuned system could deliver you a custom subset of DBpedia in
>>> an hour or two on a cluster that costs about as much to run as a minimum
>>> wage employee. You might then need to transfer the files out of AMZN but
>>> TANSTAFFL.
>>>
>>> *From:* Dan Gravell <[email protected]>
>>> *Sent:* Monday, July 15, 2013 9:34 AM
>>> *To:* [email protected]
>>> *Subject:* [Dbpedia-discussion] Strategies to download subsets of
>>> DBPedia
>>>
>>> What is the most efficient (CPU and network time) of extracting
>>> subsets of DBPedia?
>>>
>>> I am only interested in <http://dbpedia.org/ontology/MusicalWork> and
>>> the first level of relationships.
>>>
>>> First, I want to work on the data dumps provided either by DBPedia or
>>> Wikipedia (via the extraction framework, maybe). I realise I could do what
>>> I want via http://dbpedia.org/sparql but there are a number of problems
>>> with this:
>>>
>>> - It adds load on dbpedia.org
>>> - dbpedia.org often appears to have maintenance periods
>>> - There are limits placed on the number of results from dbpedia.org
>>>
>>> However, the DBPedia dumps themselves have one big problem: they are so
>>> massive it appears to take days to do anything with them. Loading them into
>>> Apache Jena for instance takes ages. I also tried a little sed'ing and
>>> awk'ing of the file but with little success.
>>>
>>> How is everyone else dealing with subsets of the data dumps? Is it
>>> possible to configure the extraction framework to ignore input records, or
>>> maybe output to something other than text n-tuples which would then be
>>> easier to slice and dice (e.g. output to SQL, then perform a query?).
>>>
>>> Thanks,
>>> Dan
>>>
>>> ------------------------------
>>>
>>> ------------------------------------------------------------------------------
>>> See everything from the browser to the database with AppDynamics
>>> Get end-to-end visibility with application monitoring from AppDynamics
>>> Isolate bottlenecks and diagnose root cause in seconds.
>>> Start your free trial of AppDynamics Pro today!
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>>
>>> ------------------------------
>>> _______________________________________________
>>> Dbpedia-discussion mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> See everything from the browser to the database with AppDynamics
>> Get end-to-end visibility with application monitoring from AppDynamics
>> Isolate bottlenecks and diagnose root cause in seconds.
>> Start your free trial of AppDynamics Pro today!
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Dbpedia-discussion mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>
>>
>
>
> --
> Kontokostas Dimitris
>
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion