Hi Mohamed,
thanks for writing back and for your questions :-)
I am going to try and answer your questions (inline) + share my view about
this GSoC idea.
2014-02-26 3:46 GMT+01:00 Mohamed Amine MOUHOUB <[email protected]>:
> Hello Dimitris, Andrea, and everybody
>
> Thanks for providing a better description of the project idea.
> I have been very busy lately so I didn't have enough time for the subject,
> First, I have a question, what do you mean by extracting redirects from
> dumps, I was trying to find out its meaning through the documentation, and
> It seems to me that you mean extracting internal links from the infoboxes
> that would serve as nodes for the parser, have I understood it right ?
>
What Dimitris meant with "redirects" is "Wikipedia redirects" (i.e. article
A redirects to article B).
The DBpedia framework makes use of this information in all the steps of the
extraction phase.
If you haven't had the chance yet, I would suggest you take a look at the
main publication that describes DBpedia and how it works [1].
[1] http://svn.aksw.org/papers/2013/SWJ_DBpedia/public.pdf
> I also have some ideas for the parallelization through MapReduce.
> 1) Download : Downloading isn't really a MapReduce thing, besides it costs
> lots of bandwith and time. It can surely be parallelized
> manually/automatically, but not through MapReduce. But for test purposes,
> Wikipedia 2009 dumps are available on Amazon AWS as a public dataset.
>
Agreed, it would be interesting to investigate the usage of parallel
download tools to speed this process up a bit (e.g. Axel).
> 2) : Splitting dumps : Unless the is a particular need, Hadoop
> automatically splits the dumps into chunks and distributes them on the
> machines, depending mainly on the configuration used. If there are any
> non-configuration related needs, we can make a splitting policy of our own.
> Anyway, I think that it is about making converting the dumps into
> key-values sequences with pageId (for eg) as a key, and the rest of the
> content and metadata as values so that they would be easily consumed by
> MapReduce. (See SequenceFile - http://wiki.apache.org/hadoop/SequenceFile
> ).
>
How this "automatic splitting of dumps" work in Hadoop?
I see this GSoC idea in two different ways:
a) Splitting dumps into chunks + running parsing and extraction separately
and independently on multiple nodes + reducing outputs from nodes (simple
batch merging of multiple archives)
b) Reading dumps on one node + sending wiki pages to multiple nodes using a
MapReduce framework + reducing outputs from the extraction of single pages
Probably a) is easier to implement (through scripts? not even sure we need
a MapReduce framework) but b) is more interesting, extensible, scalable etc.
What do you think?
> 3) It seems to me that the abstract extraction can be done jointly with
> the main extraction from infoboxes, or even separately without relying on
> local copies of wikimedia and mysql, but I need to learn further about the
> abstract extraction and how it is done actually.
>
I think It would be extremely difficult to run the abstract extraction
phase without using a MediaWiki instance, as it is needed to resolve
templates (at least) in order to extract text from the article source.
> I hope you will find a Hadoop mentor soon
>
> Cheers !
> Amine Mouhoub
>
>
>
Cheers,
Andrea
> On Sun, Feb 23, 2014 at 12:06 PM, Dimitris Kontokostas
> <[email protected]>wrote:
>
>> Hello Amine,
>>
>> I created a better description of the project idea here:
>> http://wiki.dbpedia.org/gsoc2014/ideas/ExtractionwithMapReduce?v=5uv
>>
>> We are still looking for a (Hadoop) mentor but I am very confident that
>> we will find one in the following days.
>>
>> Best,
>> Dimitris
>>
>>
>> On Sat, Feb 15, 2014 at 5:40 PM, Andrea Di Menna <[email protected]>wrote:
>>
>>> Hello Mohamed and welcome!
>>> Unfortunately I have very little experience with Hadoop myself but I
>>> would love to help with this task.
>>> Looking forward to discussing your suggestions.
>>> Cheers
>>> Andrea
>>> Il 15/feb/2014 08:42 "Dimitris Kontokostas" <[email protected]> ha
>>> scritto:
>>>
>>> Hello Mouhoub and welcome to the community
>>>>
>>>> I thought of this idea after Andrea di Menna committed the cool
>>>> dump-split feature. So I am a possible (co-)mentor for this project.
>>>> The reason why I didn't put any mentor here yet (And a full
>>>> description) is because we don't have any mentor at the moment (including
>>>> me) with experience in MapReduce.
>>>> I have a good idea about it but never got any hands-on experience.
>>>>
>>>> We will try to find someone by the application start period but in the
>>>> meantime I can set some the requirements. You are also welcome to suggest a
>>>> mentor for this project.
>>>> You look familiar with the DBpedia extraction framework so, in your
>>>> application you can suggest your own idea extensions
>>>>
>>>> DBpedia is not accepted yet as an organization so you cannot use the
>>>> melange system at the moment. We can continue with the public mailing list
>>>> if you are confortable with it or otherwise wait. Either is fine for us.
>>>> I will be traveling next week but I will try to find some time and
>>>> extend the idea description.
>>>>
>>>> Best,
>>>> Dimitis
>>>>
>>>>
>>>> On Sat, Feb 15, 2014 at 7:54 AM, Mohamed Amine MOUHOUB <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am a PhD Student from University of Paris Dauphine, I work on the
>>>>> search of linked data and linked services. Basically I am interested in
>>>>> searching and integrating data from the LOD, and DBpedia is at the center
>>>>> of my interests as it is at the core of the LOD graph, and is considered,
>>>>> in my opinion as a starting point to the rest of the LOD.
>>>>>
>>>>> Anyway, I am particularily interested in Hadoop. I started
>>>>> experiencing with Hadoop on 2011, and then with Amazon EMR in 2012. I have
>>>>> worked on some data mining projects with Hadoop. As a trainee at an Open
>>>>> Data company in Paris (Data Publica) I worked on a project to discover
>>>>> open
>>>>> data sources in France using Hadoop and the internet archive of Common
>>>>> Crawl. (120 Tb of web documents to be analyzed, clustered, etc).
>>>>> In September 2012, my project won the Common Crawl's Code Contest.
>>>>>
>>>>> I am very interested in the proposed idea of extraction of using Map
>>>>> Reduce. I think it is a very interesting contribution to the performance
>>>>> of
>>>>> the extraction framework. Moreover, the nature of the wikipedia input
>>>>> data,
>>>>> and the nature of the output (rdf triples), and the individuality of the
>>>>> processing for each entry makes the extraction highly parallelisable using
>>>>> MapReduce. I am ready to submit a proposal for this idea, but I don't see
>>>>> any mentors attributed to the idea. The idea is not very well described in
>>>>> the wiki page, but I can provide in the upcoming days a briefely-detailed
>>>>> proposal for an implementation. I am also interested in co-authoring a
>>>>> conference paper about the project.
>>>>> Any mentor interested in ??? Should I send my described proposal via
>>>>> this mailing list or directly submit it to the google summer code page ?
>>>>>
>>>>> Best regards,
>>>>> Amine Mouhoub
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Android apps run on BlackBerry 10
>>>>> Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
>>>>> Now with support for Jelly Bean, Bluetooth, Mapview and more.
>>>>> Get your Android app in front of a whole new audience. Start now.
>>>>>
>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk
>>>>> _______________________________________________
>>>>> Dbpedia-gsoc mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Kontokostas Dimitris
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Android apps run on BlackBerry 10
>>>> Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
>>>> Now with support for Jelly Bean, Bluetooth, Mapview and more.
>>>> Get your Android app in front of a whole new audience. Start now.
>>>>
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk
>>>> _______________________________________________
>>>> Dbpedia-gsoc mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>>>
>>>>
>>
>>
>> --
>> Kontokostas Dimitris
>>
>
>
------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc