Re: [Dbpedia-gsoc] MapReduce idea endorsement

Dimitris Kontokostas Thu, 06 Mar 2014 04:43:15 -0800

We already contacted 5 people that couldn't join in the end and this pushed
us a little back.
I will make a public call in the linked-data/sw communities first and then
try the hadoop groups


Best regards,
Dimitris


On Thu, Mar 6, 2014 at 2:11 PM, Mohamed Amine MOUHOUB
<[email protected]>wrote:

> Hi Dimitris and Andrea
>
> I heard that you are still looking for a mentor experienced in MapReduce.
> There are some Hadoop User Group (HUG) communities and mailing lists
> arround the world and lots of skilled people are members of these
> communities. You would probably send a Call For Mentor in these lists.
> The following link is an official hadoop page that refers to the user
> communities arround the world
>
> http://wiki.apache.org/hadoop/HadoopUserGroups
>
> Good luck in finding a mentor before the deadline
> Best regards
> Amine Mouhoub
>
>
> On Wed, Feb 26, 2014 at 11:58 AM, Andrea Di Menna <[email protected]>wrote:
>
>> Hi Mohamed,
>> thanks for writing back and for your questions :-)
>> I am going to try and answer your questions (inline) + share my view
>> about this GSoC idea.
>>
>> 2014-02-26 3:46 GMT+01:00 Mohamed Amine MOUHOUB <[email protected]>:
>>
>> Hello Dimitris, Andrea, and everybody
>>>
>>> Thanks for providing a better description of the project idea.
>>> I have been very busy lately so I didn't have enough time for the
>>> subject,
>>> First, I have a question, what do you mean by extracting redirects from
>>> dumps, I was trying to find out its meaning through the documentation, and
>>> It seems to me that you mean extracting internal links from the infoboxes
>>> that would serve as nodes for the parser, have I understood it right ?
>>>
>>
>> What Dimitris meant with "redirects" is "Wikipedia redirects" (i.e.
>> article A redirects to article B).
>> The DBpedia framework makes use of this information in all the steps of
>> the extraction phase.
>> If you haven't had the chance yet, I would suggest you take a look at the
>> main publication that describes DBpedia and how it works [1].
>>
>> [1] http://svn.aksw.org/papers/2013/SWJ_DBpedia/public.pdf
>>
>>
>>>  I also have some ideas for the parallelization through MapReduce.
>>> 1) Download : Downloading isn't really a MapReduce thing, besides it
>>> costs lots of bandwith and time. It can surely be parallelized
>>> manually/automatically, but not through MapReduce. But for test purposes,
>>> Wikipedia 2009 dumps are available on Amazon AWS as a public dataset.
>>>
>>
>> Agreed, it would be interesting to investigate the usage of parallel
>> download tools to speed this process up a bit (e.g. Axel).
>>
>>
>>>  2) : Splitting dumps : Unless the is a particular need, Hadoop
>>> automatically splits the dumps into chunks and distributes them on the
>>> machines, depending mainly on the configuration used.  If there are any
>>> non-configuration related needs, we can make a splitting policy of our own.
>>> Anyway, I think that it is about making converting the dumps into
>>> key-values sequences with pageId (for eg) as a key, and the rest of the
>>> content and metadata as values so that they would be easily consumed by
>>> MapReduce. (See SequenceFile -
>>> http://wiki.apache.org/hadoop/SequenceFile).
>>>
>>
>> How this "automatic splitting of dumps" work in Hadoop?
>>
>> I see this GSoC idea in two different ways:
>> a) Splitting dumps into chunks + running parsing and extraction
>> separately and independently on multiple nodes + reducing outputs from
>> nodes (simple batch merging of multiple archives)
>> b) Reading dumps on one node + sending wiki pages to multiple nodes using
>> a MapReduce framework + reducing outputs from the extraction of single pages
>>
>> Probably a) is easier to implement (through scripts? not even sure we
>> need a MapReduce framework) but b) is more interesting, extensible,
>> scalable etc.
>> What do you think?
>>
>>
>>>  3) It seems to me that the abstract extraction can be done jointly
>>> with the main extraction from infoboxes, or even separately without relying
>>> on local copies of wikimedia and mysql, but I need to learn further about
>>> the abstract extraction and how it is done actually.
>>>
>>
>> I think It would be extremely difficult to run the abstract extraction
>> phase without using a MediaWiki instance, as it is needed to resolve
>> templates (at least) in order to extract text from the article source.
>>
>>
>>> I hope you will find a Hadoop mentor soon
>>>
>>> Cheers !
>>> Amine Mouhoub
>>>
>>>
>>>
>> Cheers,
>> Andrea
>>
>>
>>> On Sun, Feb 23, 2014 at 12:06 PM, Dimitris Kontokostas <
>>> [email protected]> wrote:
>>>
>>>> Hello Amine,
>>>>
>>>> I created a better description of the project idea here:
>>>> http://wiki.dbpedia.org/gsoc2014/ideas/ExtractionwithMapReduce?v=5uv
>>>>
>>>> We are still looking for a (Hadoop) mentor but I am very confident that
>>>> we will find one in the following days.
>>>>
>>>> Best,
>>>> Dimitris
>>>>
>>>>
>>>> On Sat, Feb 15, 2014 at 5:40 PM, Andrea Di Menna <[email protected]>wrote:
>>>>
>>>>> Hello Mohamed and welcome!
>>>>> Unfortunately I have very little experience with Hadoop myself but I
>>>>> would love to help with this task.
>>>>> Looking forward to discussing your suggestions.
>>>>> Cheers
>>>>> Andrea
>>>>> Il 15/feb/2014 08:42 "Dimitris Kontokostas" <[email protected]> ha
>>>>> scritto:
>>>>>
>>>>> Hello Mouhoub and welcome to the community
>>>>>>
>>>>>> I thought of this idea after Andrea di Menna committed the cool
>>>>>> dump-split feature. So I am a possible (co-)mentor for this project.
>>>>>> The reason why I didn't put any mentor here yet (And a full
>>>>>> description) is because we don't have any mentor at the moment (including
>>>>>> me) with experience in MapReduce.
>>>>>> I have a good idea about it but never got any hands-on experience.
>>>>>>
>>>>>> We will try to find someone by the application start period but in
>>>>>> the meantime I can set some the requirements. You are also welcome to
>>>>>> suggest a mentor for this project.
>>>>>> You look familiar with the DBpedia extraction framework so, in your
>>>>>> application you can suggest your own idea extensions
>>>>>>
>>>>>> DBpedia is not accepted yet as an organization so you cannot use the
>>>>>> melange system at the moment. We can continue with the public mailing 
>>>>>> list
>>>>>> if you are confortable with it or otherwise wait. Either is fine for us.
>>>>>> I will be traveling next week but I will try to find some time and
>>>>>> extend the idea description.
>>>>>>
>>>>>> Best,
>>>>>> Dimitis
>>>>>>
>>>>>>
>>>>>> On Sat, Feb 15, 2014 at 7:54 AM, Mohamed Amine MOUHOUB <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am a PhD Student from University of Paris Dauphine, I work on the
>>>>>>> search of linked data and linked services. Basically I am interested in
>>>>>>> searching and integrating data from the LOD, and DBpedia is at the 
>>>>>>> center
>>>>>>> of my interests as it is at the core of the LOD graph, and is 
>>>>>>> considered,
>>>>>>> in my opinion as a starting point to the rest of the LOD.
>>>>>>>
>>>>>>> Anyway, I am particularily interested in Hadoop. I started
>>>>>>> experiencing with Hadoop on 2011, and then with Amazon EMR in 2012. I 
>>>>>>> have
>>>>>>> worked on some data mining projects with Hadoop. As a trainee at an Open
>>>>>>> Data company in Paris (Data Publica) I worked on a project to discover 
>>>>>>> open
>>>>>>> data sources in France using Hadoop and the internet archive of Common
>>>>>>> Crawl. (120 Tb of web documents to be analyzed, clustered, etc).
>>>>>>> In September 2012, my project won the Common Crawl's Code Contest.
>>>>>>>
>>>>>>> I am very interested in the proposed idea of extraction of using Map
>>>>>>> Reduce. I think it is a very interesting contribution to the 
>>>>>>> performance of
>>>>>>> the extraction framework. Moreover, the nature of the wikipedia input 
>>>>>>> data,
>>>>>>> and the nature of the output (rdf triples), and the individuality of the
>>>>>>> processing for each entry makes the extraction highly parallelisable 
>>>>>>> using
>>>>>>> MapReduce. I am ready to submit a proposal for this idea, but I don't 
>>>>>>> see
>>>>>>> any mentors attributed to the idea. The idea is not very well described 
>>>>>>> in
>>>>>>> the wiki page, but I can provide in the upcoming days a 
>>>>>>> briefely-detailed
>>>>>>> proposal for an implementation. I am also interested in co-authoring a
>>>>>>> conference paper about the project.
>>>>>>> Any mentor interested in ??? Should I send my described proposal via
>>>>>>> this mailing list or directly submit it to the google summer code page ?
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Amine Mouhoub
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> Android apps run on BlackBerry 10
>>>>>>> Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
>>>>>>> Now with support for Jelly Bean, Bluetooth, Mapview and more.
>>>>>>> Get your Android app in front of a whole new audience.  Start now.
>>>>>>>
>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk
>>>>>>> _______________________________________________
>>>>>>> Dbpedia-gsoc mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Kontokostas Dimitris
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> Android apps run on BlackBerry 10
>>>>>> Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
>>>>>> Now with support for Jelly Bean, Bluetooth, Mapview and more.
>>>>>> Get your Android app in front of a whole new audience.  Start now.
>>>>>>
>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk
>>>>>> _______________________________________________
>>>>>> Dbpedia-gsoc mailing list
>>>>>> [email protected]
>>>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>>>>>
>>>>>>
>>>>
>>>>
>>>> --
>>>> Kontokostas Dimitris
>>>>
>>>
>>>
>>
>


-- 
Kontokostas Dimitris

------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works. 
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk

_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Re: [Dbpedia-gsoc] MapReduce idea endorsement

Reply via email to