Re: [Dbpedia-gsoc] GSoC2014 - Interested in working on extraction with MapReduce

Dimitris Kontokostas Thu, 06 Mar 2014 02:33:26 -0800

Hi Nilesh & welcome to the DBpedia community,

You seem familiar with wikipedia dump processing and wikihadoop might fit
here too. However, we already have our own library that can download,
process and split (zipped) dumps so let's see if this fits the requirements.


As you probably already noticed in previous threads, we are waiting for a
mentor with MapReduce experience to join but the main idea workflow is
described here:
http://wiki.dbpedia.org/gsoc2014/ideas/ExtractionwithMapReduce/
The rational for choosing Spark sounds reasonable but I don't have
experience to comment on this

To answer your questions:
- The abstract extractor is already the bottleneck in the extraction
process on a single machine and it will not scale. Maybe with a limited
number of parallel nodes (2-3) it will work (quite slow) but if we want to
go full speed  we must skip it.
- As for the next tasks, you can get a little familiar with the extraction
process, run some sample extractions to get to know how things work and
then focus on your proposal.
You can share it privately through the melange system or public through the
mailing list, it's up to you.

Cheers,
Dimitris




On Wed, Mar 5, 2014 at 2:33 PM, Nilesh Chakraborty <nil...@nileshc.com>wrote:

> Hi Andrea, Dimitris and everyone!
>
> I'm a senior year B.Tech undergraduate majoring in Computer Science.
> Machine learning and data science excite me like nothing else. I've got
> quite some experience with Hadoop, and after studying the details of the 
> *Extraction
> using Map Reduce* project idea I figured that this would be a good match
> for my skillset and should be fun for me too.
>
> First, a bit of background:
>
> Among other things, I had worked on peta-scale graph centrality
> computation using MapReduce during a research internship a year ago - I
> built a Hadoop implementation for computing PageRank on huge graphs (it's
> ongoing, some WIP code at [1]), trying to get a lot more performance
> improvements than Pegasus [2].
>
> Last year I've been hacking on an entity suggester for wikidata, so I've
> got a good idea of the structure of wiki dumps. I needed to build a feature
> matrix for feeding into a collaborative filtering engine - so in essence I
> had to generate tuples of row, col, value (sparse matrix data) from the big
> wikidata dumps. You can find the Hadoop Streaming Python scripts at [4]. I
> used lxml and json libraries in the Python code to parse the raw dumps,
> therefore we could easily parallelize the task without needing to run a
> separate MediaWiki instance on LAMP.
>
> The Hadoop code I just mentioned also has a custom InputFormat to split
> the wiki dump XML into <page>...</page> chunks. An even better idea would
> be to use the wikihadoop [5] project - it's aimed at providing custom
> InputFormats to split Wikipedia pages into chunks. This will be done in
> Hadoop itself, automated, parallelized. And we need not even extract the
> bz2 files. Wikihadoop decompresses bz2 on-the-fly.
>
> Writing MapReduce jobs for extracting redirects would be trivial, we can
> do all sorts of things with the reducer here, creating redirect lists like
> adjancency lists etc., we can store them on the HDFS in a format that'd be
> useful for the next step. Parsing each page and generating RDF triples in
> the mappers and aggregating/joining them via the reducers - the whole thing
> should be done in 2-3 MapReduce jobs.
>
> Also, since the data isn't Peta-scale and we're still talking GBs here, I
> think using Spark instead of Hadoop could be a good option too. Spark even
> has a native Scala API, and it does disk-backed in-memory computation which
> is often faster. In any case, we'll stick to Scala or Java.
>
> It would be great if you can help me out with these questions:
>
>    - Looks like org.dbpedia.extraction.mappings.AbstractExtractor calls
>    the API on a local MediaWiki instance. We could have a single MW instance
>    on a LAMP server and use that to answer API queries from all the mappers
>    and do all the processing (the stuff that AbstractExtractor does) in the
>    mappers. That would keep it parallel, and will be faster than simple
>    sequential extraction. But the slow MediaWiki node may turn out to be a
>    bottleneck. Thoughts? Also, what is the problem with having an automated
>    script setup MediaWiki+MySQL instances one for each of the Hadoop machines?
>
>    - Could you give me some pointers as to what my next steps should be?
>    Should I start working on a prototype, draft my proposal on google-melagne,
>    or share it here first?
>
>
> Please let me know if you have any questions and I'll be glad to clarify
> them for you. :)
>
> Cheers,
> Nilesh
>
> [1] : https://github.com/nilesh-c/graphfu
> [2] : http://www.cs.cmu.edu/~ukang/papers/PegasusICDM2009.pdf
> [3] : http://www10.org/cdrom/papers/pdf/p577.pdf
> [4] : https://github.com/nilesh-c/wes/tree/master/wikiparser
> [5] : https://github.com/whym/wikihadoop
>
>
>
> A quest eternal, a life so small! So don't just play the guitar, build one.
> You can also email me at cont...@nileshc.com or visit my 
> website<http://www.nileshc.com/>
>
>
>
> ------------------------------------------------------------------------------
> Subversion Kills Productivity. Get off Subversion & Make the Move to
> Perforce.
> With Perforce, you get hassle-free workflows. Merge that actually works.
> Faster operations. Version large binaries.  Built-in WAN optimization and
> the
> freedom to use Git, Perforce or both. Make the move to Perforce.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
> _______________________________________________
> Dbpedia-gsoc mailing list
> Dbpedia-gsoc@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>
>


-- 
Kontokostas Dimitris

------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works. 
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk

_______________________________________________
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Re: [Dbpedia-gsoc] GSoC2014 - Interested in working on extraction with MapReduce

Reply via email to