Hi all,
I am Wei Wang. After trying some warm up tasks, I think It is time to make
some noise here.
A few months ago, I worked on a project, in which I annotated 6 million
Tweets and Facebook posts with the Wikipedia concept(each Wikipedia article
is regarded as a concept) . Specifically, with the help of Wikipedia-Miner(
http://wikipedia-miner.cms.waikato.ac.nz/), I processed the English dump of
Wikipedia of October 2012 on a Hadoop cluster to extract some meta data.
Then I built a concept dictionary(with 9 million entities) which maps
phrases(concept mentions) to target concept(Wikipedia article). For each
tweet and post, the concept mentions were recognized by looking up the
dictionary. Then disambiguation is conducted by context analysis.
In fact, what I did is just one of the functions provided by DBpedia
Spotlight. But, through this project, I realized the importance and
challenges of DBpedia Spotlight. For example, by annotating text with
concepts, computers are able to understand the semantics of the text.
However, there are two challenges for this annotation work. Firstly, how to
recognize the concept mentions? Phrases are often false positively
recognized as concept mentions. E.g., given the sentence "It's late, I have
to go now", since there is an article about a song named "It's late" in
Wikipedia, it is likely the phrase in this sentence would be linked to that
article. It is also possible that some true concept mentions are not
recognized due to the dictionary coverage and text noise. Secondly, it is
well known that some mentions are ambiguous. Thus, how to disambiguate them
accurately and efficiently is another challenge, especially for short text.
This is still an on-going research topics.
Regarding the idea 3.1(Google mention corpus), I think the overlap of
google mention corpus and wikipedia dump may be a point that should be
considered. Otherwise we may index some redundant data. (pls correct me
since I have little knowledge about this part. And I am reading the related
code)
For 3.2, I think it's really important. From my experience, the
disambiguation procedure is time consuming , because content analysis is
usually involved.
So far, I have done some warm up tasks like documentation. I am trying the
software and learning Scala. I will share my thoughts regarding the two
ideas later. Thanks.
Best Regards,
Wei Wang
------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc