Re: [Dbpedia-gsoc] MapReduce idea endorsement

Mohamed Amine MOUHOUB Thu, 06 Mar 2014 04:12:53 -0800

Hi Dimitris and Andrea

I heard that you are still looking for a mentor experienced in MapReduce.
There are some Hadoop User Group (HUG) communities and mailing lists
arround the world and lots of skilled people are members of these
communities. You would probably send a Call For Mentor in these lists.
The following link is an official hadoop page that refers to the user
communities arround the world


http://wiki.apache.org/hadoop/HadoopUserGroups

Good luck in finding a mentor before the deadline
Best regards
Amine Mouhoub


On Wed, Feb 26, 2014 at 11:58 AM, Andrea Di Menna <[email protected]> wrote:

> Hi Mohamed,
> thanks for writing back and for your questions :-)
> I am going to try and answer your questions (inline) + share my view about
> this GSoC idea.
>
> 2014-02-26 3:46 GMT+01:00 Mohamed Amine MOUHOUB <[email protected]>:
>
> Hello Dimitris, Andrea, and everybody
>>
>> Thanks for providing a better description of the project idea.
>> I have been very busy lately so I didn't have enough time for the subject,
>> First, I have a question, what do you mean by extracting redirects from
>> dumps, I was trying to find out its meaning through the documentation, and
>> It seems to me that you mean extracting internal links from the infoboxes
>> that would serve as nodes for the parser, have I understood it right ?
>>
>
> What Dimitris meant with "redirects" is "Wikipedia redirects" (i.e.
> article A redirects to article B).
> The DBpedia framework makes use of this information in all the steps of
> the extraction phase.
> If you haven't had the chance yet, I would suggest you take a look at the
> main publication that describes DBpedia and how it works [1].
>
> [1] http://svn.aksw.org/papers/2013/SWJ_DBpedia/public.pdf
>
>
>>  I also have some ideas for the parallelization through MapReduce.
>> 1) Download : Downloading isn't really a MapReduce thing, besides it
>> costs lots of bandwith and time. It can surely be parallelized
>> manually/automatically, but not through MapReduce. But for test purposes,
>> Wikipedia 2009 dumps are available on Amazon AWS as a public dataset.
>>
>
> Agreed, it would be interesting to investigate the usage of parallel
> download tools to speed this process up a bit (e.g. Axel).
>
>
>>  2) : Splitting dumps : Unless the is a particular need, Hadoop
>> automatically splits the dumps into chunks and distributes them on the
>> machines, depending mainly on the configuration used.  If there are any
>> non-configuration related needs, we can make a splitting policy of our own.
>> Anyway, I think that it is about making converting the dumps into
>> key-values sequences with pageId (for eg) as a key, and the rest of the
>> content and metadata as values so that they would be easily consumed by
>> MapReduce. (See SequenceFile - http://wiki.apache.org/hadoop/SequenceFile
>> ).
>>
>
> How this "automatic splitting of dumps" work in Hadoop?
>
> I see this GSoC idea in two different ways:
> a) Splitting dumps into chunks + running parsing and extraction separately
> and independently on multiple nodes + reducing outputs from nodes (simple
> batch merging of multiple archives)
> b) Reading dumps on one node + sending wiki pages to multiple nodes using
> a MapReduce framework + reducing outputs from the extraction of single pages
>
> Probably a) is easier to implement (through scripts? not even sure we need
> a MapReduce framework) but b) is more interesting, extensible, scalable etc.
> What do you think?
>
>
>>  3) It seems to me that the abstract extraction can be done jointly with
>> the main extraction from infoboxes, or even separately without relying on
>> local copies of wikimedia and mysql, but I need to learn further about the
>> abstract extraction and how it is done actually.
>>
>
> I think It would be extremely difficult to run the abstract extraction
> phase without using a MediaWiki instance, as it is needed to resolve
> templates (at least) in order to extract text from the article source.
>
>
>> I hope you will find a Hadoop mentor soon
>>
>> Cheers !
>> Amine Mouhoub
>>
>>
>>
> Cheers,
> Andrea
>
>
>> On Sun, Feb 23, 2014 at 12:06 PM, Dimitris Kontokostas <[email protected]
>> > wrote:
>>
>>> Hello Amine,
>>>
>>> I created a better description of the project idea here:
>>> http://wiki.dbpedia.org/gsoc2014/ideas/ExtractionwithMapReduce?v=5uv
>>>
>>> We are still looking for a (Hadoop) mentor but I am very confident that
>>> we will find one in the following days.
>>>
>>> Best,
>>> Dimitris
>>>
>>>
>>> On Sat, Feb 15, 2014 at 5:40 PM, Andrea Di Menna <[email protected]>wrote:
>>>
>>>> Hello Mohamed and welcome!
>>>> Unfortunately I have very little experience with Hadoop myself but I
>>>> would love to help with this task.
>>>> Looking forward to discussing your suggestions.
>>>> Cheers
>>>> Andrea
>>>> Il 15/feb/2014 08:42 "Dimitris Kontokostas" <[email protected]> ha
>>>> scritto:
>>>>
>>>> Hello Mouhoub and welcome to the community
>>>>>
>>>>> I thought of this idea after Andrea di Menna committed the cool
>>>>> dump-split feature. So I am a possible (co-)mentor for this project.
>>>>> The reason why I didn't put any mentor here yet (And a full
>>>>> description) is because we don't have any mentor at the moment (including
>>>>> me) with experience in MapReduce.
>>>>> I have a good idea about it but never got any hands-on experience.
>>>>>
>>>>> We will try to find someone by the application start period but in the
>>>>> meantime I can set some the requirements. You are also welcome to suggest 
>>>>> a
>>>>> mentor for this project.
>>>>> You look familiar with the DBpedia extraction framework so, in your
>>>>> application you can suggest your own idea extensions
>>>>>
>>>>> DBpedia is not accepted yet as an organization so you cannot use the
>>>>> melange system at the moment. We can continue with the public mailing list
>>>>> if you are confortable with it or otherwise wait. Either is fine for us.
>>>>> I will be traveling next week but I will try to find some time and
>>>>> extend the idea description.
>>>>>
>>>>> Best,
>>>>> Dimitis
>>>>>
>>>>>
>>>>> On Sat, Feb 15, 2014 at 7:54 AM, Mohamed Amine MOUHOUB <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am a PhD Student from University of Paris Dauphine, I work on the
>>>>>> search of linked data and linked services. Basically I am interested in
>>>>>> searching and integrating data from the LOD, and DBpedia is at the center
>>>>>> of my interests as it is at the core of the LOD graph, and is considered,
>>>>>> in my opinion as a starting point to the rest of the LOD.
>>>>>>
>>>>>> Anyway, I am particularily interested in Hadoop. I started
>>>>>> experiencing with Hadoop on 2011, and then with Amazon EMR in 2012. I 
>>>>>> have
>>>>>> worked on some data mining projects with Hadoop. As a trainee at an Open
>>>>>> Data company in Paris (Data Publica) I worked on a project to discover 
>>>>>> open
>>>>>> data sources in France using Hadoop and the internet archive of Common
>>>>>> Crawl. (120 Tb of web documents to be analyzed, clustered, etc).
>>>>>> In September 2012, my project won the Common Crawl's Code Contest.
>>>>>>
>>>>>> I am very interested in the proposed idea of extraction of using Map
>>>>>> Reduce. I think it is a very interesting contribution to the performance 
>>>>>> of
>>>>>> the extraction framework. Moreover, the nature of the wikipedia input 
>>>>>> data,
>>>>>> and the nature of the output (rdf triples), and the individuality of the
>>>>>> processing for each entry makes the extraction highly parallelisable 
>>>>>> using
>>>>>> MapReduce. I am ready to submit a proposal for this idea, but I don't see
>>>>>> any mentors attributed to the idea. The idea is not very well described 
>>>>>> in
>>>>>> the wiki page, but I can provide in the upcoming days a briefely-detailed
>>>>>> proposal for an implementation. I am also interested in co-authoring a
>>>>>> conference paper about the project.
>>>>>> Any mentor interested in ??? Should I send my described proposal via
>>>>>> this mailing list or directly submit it to the google summer code page ?
>>>>>>
>>>>>> Best regards,
>>>>>> Amine Mouhoub
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> Android apps run on BlackBerry 10
>>>>>> Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
>>>>>> Now with support for Jelly Bean, Bluetooth, Mapview and more.
>>>>>> Get your Android app in front of a whole new audience.  Start now.
>>>>>>
>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk
>>>>>> _______________________________________________
>>>>>> Dbpedia-gsoc mailing list
>>>>>> [email protected]
>>>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Kontokostas Dimitris
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Android apps run on BlackBerry 10
>>>>> Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
>>>>> Now with support for Jelly Bean, Bluetooth, Mapview and more.
>>>>> Get your Android app in front of a whole new audience.  Start now.
>>>>>
>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk
>>>>> _______________________________________________
>>>>> Dbpedia-gsoc mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>>>>
>>>>>
>>>
>>>
>>> --
>>> Kontokostas Dimitris
>>>
>>
>>
>

------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works. 
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk

_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Re: [Dbpedia-gsoc] MapReduce idea endorsement

Reply via email to