Hi gsoc2014 fellows,

I will paste my thoughts from my proposal about 4.3 -- Extraction
Parallelization here. Maybe it is of some use for whoever will work on
this task.


      1. download step:

You mention that Wikimedia are "are capping the number of per-ip
connections to 2". This means that if the compute nodes dont have
individual ip addresses, this step does not need to be parallelized. One
node can easily handle 2 download threads. If the nodes have different
ip addresses the download could actually be mapped to multiple nodes.


      2. pre-processing:

>From the description it remains unclear what "calculating the redirects"
means and if this step can be parallelized and if it introduces cross
dependencies between the chunks. I did not have time to read more about
how DBPedia and extraction works so this could originate from my lack of
knowledge. I assume the redirects are used to build up the relations
between classes/articles.


      3. chunking:

Some organizational work has to be done distributing the workparts. This
depends on whether the nodes share a filesystem for example. I assume
that the chunks must be supplied with their respective redirect "part"
in order to facilitate RDF extraction.


      4. extraction:

>From a top level point of view this step can be executed using a
master/worker approach to distribute work evenly. Depending on the
number of nodes/cores available on could try to parallelize the
extraction algorithm itself.
The data-parallel part should be as easy as having a worker on each node
executing the extraction script for the chunks it gets assigned.


      5. reduction:

In this step the extraction results for the individual languages must be
joined. The influence of the redirects is not clear to me here too. But
one should be able to read this up.


Greetz!
Simon


Am 20.03.2014 19:19, schrieb Dimitris Kontokostas:
> On Mar 19, 2014 4:41 PM, "Abhijit Pratap Singh Tomar" <apt...@nyu.edu>
> wrote:
>> Hi Dimitris,
>>
>> I was looking over at the github code mentioned in the discussion. First
> of all, can you give me an idea about how big a handicap would it be to not
> know Scala for this project ? I am not familiar with Scala at all but if we
> will be using only a specific subset of the language then I think I can
> pick it up. Also, the map-reduce programming that I know is all based on
> Java. It was mentioned in the discussion that
>> "We would prefer a Scala implementation but depending on the application
> we might fall back to Java too."
>> So, is it easy to interchange the Java code and the Scala code ? In the
> github code I can see several instances where Java classes have been
> imported and Java code is used in tandem with Scala. Again, as I am not
> familiar with Scala, you might find these queries as trivial.
>
> Java and Scala are binary compatible and can coexist in the same project.
> So we would prefer Scala code but java should do as well if the student
> application is good.
>
>> Now, regarding the feasibility of map-reduce for our project:
>>
>> In the Download task, I could make out that we are downloading some
> information and then doing updates of some sort, possibly on a data store
> somewhere. Am I correct ? If so, we can utilize map-reduce with ease as
> long as the information being downloaded, for separate updates lets say, is
> relatively independent. We can download different stuff on different nodes,
> do whatever processing necessary and then push the updates as required.
>> I could not quite figure out what was going on in the Extraction task.
> What are we extracting and from where. Is it conceptually very different
> from the download task ? Does extraction also involve pulling some data,
> doing some processing and then pushing it back ? Is this the same as
> extracting RDF triples from Wkipedia ? If so, then what does the Dump
> Splitter task do ?
>
> In guess you did try the warm up tasks ;) "experiment with extraction
> configurations"
> You should also read the latest DBpedia article under
> dbpedia.org/publications to understand how dbpedia works.
>
> Best,
> Dimitris
>
>> Thanks,
>>
>> Abhijit
>>
>>
>> On Mon, Mar 17, 2014 at 2:11 AM, Dimitris Kontokostas <jimk...@gmail.com>
> wrote:
>>> Hello Abhijit,
>>>
>>> (ccing the gsoc list)
>>>
>>> As I mentioned in my previous mail to the list [1] we have a list of
> mentors (4 at the time). After the selection period, one will be the *main*
> mentor.
>>> With dumps, we refer to wikipedia language dumps [2]. This is what
> DBpedia processes to extract RDF.
>>> Regarding tutorials / materials, I mention some tasks on the update
> email [1] and can can also read the latest DBpedia paper [3] ( DBpedia -- A
> Large-scale, Multilingual Knowledge Base Extracted from Wikipedia. To
> appear in the Semantic Web Journal.)
>>> Best,
>>> Dimitris
>>>
>>> [1] http://sourceforge.net/p/dbpedia/mailman/message/32088549/
>>> [2] dumps.wikimedia.org
>>> [3] wiki.dbpedia.org/Publications
>>>
>>>
>>> On Mon, Mar 17, 2014 at 5:12 AM, Abhijit Pratap Singh Tomar <
> apt...@nyu.edu> wrote:
>>>> Hi Dimitris,
>>>>
>>>> I must apologize for not responding sooner. Have you finalized a mentor
> with Map Reduce experience for this project ? I was going over the links
> you sent me and I would like to know more specifically about the
> parallelization step. Could you shed further light on what is meant by
>>>> 'For every splitted dump of a language, we are given the redirects prom
> the previous step and process them to get the RDF'
>>>> What does a dump comprise ? What is the processing that you need to
> employ to each dump ?
>>>> Finally, please provide any tutorials or material that I should need to
> go over in order to write my proposal.
>>>> Thanks,
>>>>
>>>> Abhijit
>>>>
>>>>
>>>> On Thu, Mar 6, 2014 at 4:57 AM, Dimitris Kontokostas <jimk...@gmail.com>
> wrote:
>>>>> Hello Abhijit and welcome to the DBpedia community,
>>>>>
>>>>> please take a look at the following pages for details and feel free to
> ask questions
>>>>> http://wiki.dbpedia.org/gsoc2014/ideas/ExtractionwithMapReduce/
>>>>>
> http://sourceforge.net/p/dbpedia/mailman/dbpedia-gsoc/thread/CA%2Bu4%2Ba3_VayThCxW%2Bj2ODsGY06mj7asvKH3pFxPhNQEEqmMOLQ%40mail.gmail.com/#msg31980399
>>>>> Please note that we already set the requirements of this project but
> we are waiting for someone with MapReduce experience to join the mentor team
>>>>> Best,
>>>>> Dimitris
>>>>>
>>>>>
>>>>> On Thu, Mar 6, 2014 at 12:08 AM, Abhijit Pratap Singh Tomar <
> apt...@nyu.edu> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> My name is Abhijit. I am a Computer Science graduate student at NYU.
> I am really interested in knowing more about the project Extraction using
> Map Reduce .
>>>>>> I have studied Big Data Analysis using Hadoop and Pig in my last
> semester. This semester I am taking two courses; Machine Learning and
> Computational Geometry. I would like to implement those techniques, if
> needed.
>>>>>> Below is a link to my work on Big Data Analysis on my github profile.
>>>>>>
>>>>>> https://github.com/abtpst/Big-Data-Analytics
>>>>>>
>>>>>> Kindly provide me with more information about this project and please
> let me know if you need some more information on my background.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Abhijit
>>>>>>
>>>>>>
>>>>>>
> ------------------------------------------------------------------------------
>>>>>> Subversion Kills Productivity. Get off Subversion & Make the Move to
> Perforce.
>>>>>> With Perforce, you get hassle-free workflows. Merge that actually
> works.
>>>>>> Faster operations. Version large binaries.  Built-in WAN optimization
> and the
>>>>>> freedom to use Git, Perforce or both. Make the move to Perforce.
>>>>>>
> http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
>>>>>> _______________________________________________
>>>>>> Dbpedia-gsoc mailing list
>>>>>> Dbpedia-gsoc@lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Kontokostas Dimitris
>>>>
>>>
>>>
>>> --
>>> Kontokostas Dimitris
>>
>
>
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and their
> applications. Written by three acclaimed leaders in the field,
> this first edition is now available. Download your free book today!
> http://p.sf.net/sfu/13534_NeoTech
>
>
> _______________________________________________
> Dbpedia-gsoc mailing list
> Dbpedia-gsoc@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Reply via email to