Hi Wei,
Thanks for your interest. Can you share with us (e.g. via links to github)
the results from your warm up tasks so far?
I have another WarmUp task proposal. If you know anything about Chinese (I
have Chinese friends with family name Wang, so sorry if I make incorrect
assumptions), you could try to run the Indexing (DB core) for Chinese
language, or identify the reasons why this process would not work.
https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Internationalization-(DB-backed-core)
You could send us questions that you may have, so that we can improve the
documentation on the page above.
Please note that this does not guarantee your acceptance to GSoC. However,
doing well in open source community participation is usually a great plus
during the selection phase. So the advice to start participating early goes
for all prospective applicants.
Cheers,
Pablo
On Sun, Apr 14, 2013 at 4:00 PM, Wang Wei <[email protected]> wrote:
> Hi all,
>
> I am Wei Wang. After trying some warm up tasks, I think It is time to make
> some noise here.
>
> A few months ago, I worked on a project, in which I annotated 6 million
> Tweets and Facebook posts with the Wikipedia concept(each Wikipedia article
> is regarded as a concept) . Specifically, with the help of Wikipedia-Miner(
> http://wikipedia-miner.cms.waikato.ac.nz/), I processed the English dump
> of Wikipedia of October 2012 on a Hadoop cluster to extract some meta data.
> Then I built a concept dictionary(with 9 million entities) which maps
> phrases(concept mentions) to target concept(Wikipedia article). For each
> tweet and post, the concept mentions were recognized by looking up the
> dictionary. Then disambiguation is conducted by context analysis.
>
> In fact, what I did is just one of the functions provided by DBpedia
> Spotlight. But, through this project, I realized the importance and
> challenges of DBpedia Spotlight. For example, by annotating text with
> concepts, computers are able to understand the semantics of the text.
> However, there are two challenges for this annotation work. Firstly, how to
> recognize the concept mentions? Phrases are often false positively
> recognized as concept mentions. E.g., given the sentence "It's late, I have
> to go now", since there is an article about a song named "It's late" in
> Wikipedia, it is likely the phrase in this sentence would be linked to that
> article. It is also possible that some true concept mentions are not
> recognized due to the dictionary coverage and text noise. Secondly, it is
> well known that some mentions are ambiguous. Thus, how to disambiguate them
> accurately and efficiently is another challenge, especially for short text.
> This is still an on-going research topics.
>
> Regarding the idea 3.1(Google mention corpus), I think the overlap of
> google mention corpus and wikipedia dump may be a point that should be
> considered. Otherwise we may index some redundant data. (pls correct me
> since I have little knowledge about this part. And I am reading the related
> code)
>
> For 3.2, I think it's really important. From my experience, the
> disambiguation procedure is time consuming , because content analysis is
> usually involved.
>
> So far, I have done some warm up tasks like documentation. I am trying the
> software and learning Scala. I will share my thoughts regarding the two
> ideas later. Thanks.
>
> Best Regards,
> Wei Wang
>
>
>
> ------------------------------------------------------------------------------
> Precog is a next-generation analytics platform capable of advanced
> analytics on semi-structured data. The platform includes APIs for building
> apps and a phenomenal toolset for data science. Developers can use
> our toolset for easy data analysis & visualization. Get a free account!
> http://www2.precog.com/precogplatform/slashdotnewsletter
> _______________________________________________
> Dbpedia-gsoc mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>
>
--
Pablo N. Mendes
http://pablomendes.com
------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc