Re: [Dbpedia-gsoc] Apply for Dbpedia GSoC project

Jona Christopher Sahnwaldt Wed, 24 Apr 2013 11:23:31 -0700

Hi Haiyang,

On 24 April 2013 18:31, haiyang liu <[email protected]> wrote:
> Hi Dimitris,
> Thanks for the info, I would like to start on the continuous extraction
> project.
> Based on the info on the idea page, there are a lot of sub-projects,

Just for the record, here's the link again:
http://wiki.dbpedia.org/gsoc2013/ideas/ContinuousExtraction

> do they need to be done in orders?

Not necessarily, but it would probably make things easier for you. For
example, a unified configuration does not depend on any other steps,
but without it, implementing the other steps will be difficult.

> Can you help direct me a point to start, like a warm up task for this
> project?

A first warm up task would be to simply get acquainted with the
DBpedia extraction framework, i.e. mostly with the core, dump and
scripts modules, but also to some extent with the others, at least to
know what they do and which parts of the core and/or dump modules they
depend on. Then use the download script to download Wikipedia dumps.
Run some extractions.

Then look at the configuration files: there are several different
.properties files with different syntax and semantics, there are
parameters in pom.xml, some parameters must be given on the command
line, and so on. Then look at Spring and some other dependency
injection (DI) frameworks. In a perfect world, someone would have
written a DI framework that leverages Scala's abilities. Find that
perfect framework. Or build it. ;-)

You may want to think about an architecture for this project: should
all the different extraction steps run in the same JVM, or should they
run in separate processes? This decision has huge implications for
memory consumption, modularity, communication protocols between the
extraction steps, etc. Should different languages use the same
extractor objects? And so on.

>>> I an a big fan of machine learning and data analysis and just did a
>>> research project in news processing.

It would be great if you took on the continuous extraction project,
and it is an interesting and challenging project - you will need /
acquire knowledge about large multi-threaded Java server applications
with complex interactions between modules. But I think that it
probably doesn't have as much to do with machine learning etc. than
some other project ideas. At least that's my point of view, maybe the
other developers can shed a different light on this question. I'm just
telling you this so you're not disappointed when you find out that
there are no statistics or heuristics involved in this project.

Remember that project ideas are just that - ideas. In this case, I
have been thinking about this idea for several months, so I have
relatively precise thoughts about how it could or should be
implemented. But in the end, you are welcome to come up with some
totally new ideas that we have never thought of.

Cheers,
Christopher

> Thanks a lot!
>
>
> On Wed, Apr 24, 2013 at 2:03 AM, Dimitris Kontokostas <[email protected]>
> wrote:
>>
>> Hi Haiyang & welcome,
>>
>> Coming late give you less time to prepare but we still have ~10 days left.
>>
>> We have a few students interested in ideas #1 & #2 but this doesn't
>> necessarily mean that they will apply in the end. All discussion here are
>> public and happen on this mailing list so you can read the archives and
>> judge for yourself.
>> Depending on what idea(s) you choose there can be different warm-up tasks
>> so, feel free to ask :)
>>
>> Cheers,
>> Dimitris
>>
>>
>> On Wed, Apr 24, 2013 at 1:38 AM, Haiyang Liu <[email protected]>
>> wrote:
>>>
>>> Hi,
>>> My name is Haiyang, I am a raising senior student from Rice University.
>>> I an a big fan of machine learning and data analysis and just did a
>>> research project in news processing.
>>> I am very interesting in the Dbpedia project and just heard from a friend
>>> that it has this GSoC project that I can join in.
>>> I know it is kind of late since the deadline is approaching so I want to
>>> know if it is still possible for me to apply at this time.
>>> I am really interested in the 3 topics in the idea list:
>>> 1) Massive extraction of triples from Media wikis
>>> 2) Wikitionary 2 RDF Assistance GUI
>>> 3) Continuous Extraction
>>> I am wondering if there are anyone already start working on these
>>> projects and I should choose others that are less competitive to apply.
>>>  Right now I am looking through the doc and warm-up exercises and hope
>>> some one can help me with my questions.
>>> thanks a lot!
>>>
>>> Haiyang Liu
>>> Rice University
>>> CS 2014
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Try New Relic Now & We'll Send You this Cool Shirt
>>> New Relic is the only SaaS-based application performance monitoring
>>> service
>>> that delivers powerful full stack analytics. Optimize and monitor your
>>> browser, app, & servers with just a few lines of code. Try New Relic
>>> and get this awesome Nerd Life shirt!
>>> http://p.sf.net/sfu/newrelic_d2d_apr
>>> _______________________________________________
>>> Dbpedia-gsoc mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>>
>>
>>
>>
>> --
>> Kontokostas Dimitris
>
>
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
> _______________________________________________
> Dbpedia-gsoc mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Re: [Dbpedia-gsoc] Apply for Dbpedia GSoC project

Reply via email to