Hello Dimitris, Andrea, and everybody

Thanks for providing a better description of the project idea.
I have been very busy lately so I didn't have enough time for the subject,
First, I have a question, what do you mean by extracting redirects from
dumps, I was trying to find out its meaning through the documentation, and
It seems to me that you mean extracting internal links from the infoboxes
that would serve as nodes for the parser, have I understood it right ?
I also have some ideas for the parallelization through MapReduce.
1) Download : Downloading isn't really a MapReduce thing, besides it costs
lots of bandwith and time. It can surely be parallelized
manually/automatically, but not through MapReduce. But for test purposes,
Wikipedia 2009 dumps are available on Amazon AWS as a public dataset.
2) : Splitting dumps : Unless the is a particular need, Hadoop
automatically splits the dumps into chunks and distributes them on the
machines, depending mainly on the configuration used.  If there are any
non-configuration related needs, we can make a splitting policy of our own.
Anyway, I think that it is about making converting the dumps into
key-values sequences with pageId (for eg) as a key, and the rest of the
content and metadata as values so that they would be easily consumed by
MapReduce. (See SequenceFile - http://wiki.apache.org/hadoop/SequenceFile).
3) It seems to me that the abstract extraction can be done jointly with the
main extraction from infoboxes, or even separately without relying on local
copies of wikimedia and mysql, but I need to learn further about the
abstract extraction and how it is done actually.

I hope you will find a Hadoop mentor soon

Cheers !
Amine Mouhoub


On Sun, Feb 23, 2014 at 12:06 PM, Dimitris Kontokostas <[email protected]>wrote:

> Hello Amine,
>
> I created a better description of the project idea here:
> http://wiki.dbpedia.org/gsoc2014/ideas/ExtractionwithMapReduce?v=5uv
>
> We are still looking for a (Hadoop) mentor but I am very confident that we
> will find one in the following days.
>
> Best,
> Dimitris
>
>
> On Sat, Feb 15, 2014 at 5:40 PM, Andrea Di Menna <[email protected]>wrote:
>
>> Hello Mohamed and welcome!
>> Unfortunately I have very little experience with Hadoop myself but I
>> would love to help with this task.
>> Looking forward to discussing your suggestions.
>> Cheers
>> Andrea
>> Il 15/feb/2014 08:42 "Dimitris Kontokostas" <[email protected]> ha
>> scritto:
>>
>> Hello Mouhoub and welcome to the community
>>>
>>> I thought of this idea after Andrea di Menna committed the cool
>>> dump-split feature. So I am a possible (co-)mentor for this project.
>>> The reason why I didn't put any mentor here yet (And a full description)
>>> is because we don't have any mentor at the moment (including me) with
>>> experience in MapReduce.
>>> I have a good idea about it but never got any hands-on experience.
>>>
>>> We will try to find someone by the application start period but in the
>>> meantime I can set some the requirements. You are also welcome to suggest a
>>> mentor for this project.
>>> You look familiar with the DBpedia extraction framework so, in your
>>> application you can suggest your own idea extensions
>>>
>>> DBpedia is not accepted yet as an organization so you cannot use the
>>> melange system at the moment. We can continue with the public mailing list
>>> if you are confortable with it or otherwise wait. Either is fine for us.
>>> I will be traveling next week but I will try to find some time and
>>> extend the idea description.
>>>
>>> Best,
>>> Dimitis
>>>
>>>
>>> On Sat, Feb 15, 2014 at 7:54 AM, Mohamed Amine MOUHOUB <
>>> [email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am a PhD Student from University of Paris Dauphine, I work on the
>>>> search of linked data and linked services. Basically I am interested in
>>>> searching and integrating data from the LOD, and DBpedia is at the center
>>>> of my interests as it is at the core of the LOD graph, and is considered,
>>>> in my opinion as a starting point to the rest of the LOD.
>>>>
>>>> Anyway, I am particularily interested in Hadoop. I started experiencing
>>>> with Hadoop on 2011, and then with Amazon EMR in 2012. I have worked on
>>>> some data mining projects with Hadoop. As a trainee at an Open Data company
>>>> in Paris (Data Publica) I worked on a project to discover open data sources
>>>> in France using Hadoop and the internet archive of Common Crawl. (120 Tb of
>>>> web documents to be analyzed, clustered, etc).
>>>> In September 2012, my project won the Common Crawl's Code Contest.
>>>>
>>>> I am very interested in the proposed idea of extraction of using Map
>>>> Reduce. I think it is a very interesting contribution to the performance of
>>>> the extraction framework. Moreover, the nature of the wikipedia input data,
>>>> and the nature of the output (rdf triples), and the individuality of the
>>>> processing for each entry makes the extraction highly parallelisable using
>>>> MapReduce. I am ready to submit a proposal for this idea, but I don't see
>>>> any mentors attributed to the idea. The idea is not very well described in
>>>> the wiki page, but I can provide in the upcoming days a briefely-detailed
>>>> proposal for an implementation. I am also interested in co-authoring a
>>>> conference paper about the project.
>>>> Any mentor interested in ??? Should I send my described proposal via
>>>> this mailing list or directly submit it to the google summer code page ?
>>>>
>>>> Best regards,
>>>> Amine Mouhoub
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Android apps run on BlackBerry 10
>>>> Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
>>>> Now with support for Jelly Bean, Bluetooth, Mapview and more.
>>>> Get your Android app in front of a whole new audience.  Start now.
>>>>
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk
>>>> _______________________________________________
>>>> Dbpedia-gsoc mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>>>
>>>>
>>>
>>>
>>> --
>>> Kontokostas Dimitris
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Android apps run on BlackBerry 10
>>> Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
>>> Now with support for Jelly Bean, Bluetooth, Mapview and more.
>>> Get your Android app in front of a whole new audience.  Start now.
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Dbpedia-gsoc mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>>
>>>
>
>
> --
> Kontokostas Dimitris
>
------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Reply via email to