Re: [Dbpedia-gsoc] Experience and thoughts for DBpedia Spotlight Ideas

Jona Christopher Sahnwaldt Mon, 15 Apr 2013 07:41:17 -0700

On 15 April 2013 16:31, Wang Wei <[email protected]> wrote:
> Hi Pablo,
> I have updated the Internationalization-(DB-backed-core) page. There are
> some inconsistencies between the webpage and the index_db.sh script, e.g.,
> the paths. I thinks there are also some problems for the index_db.sh. I'll
> check it after finishing downloading the wikipedia dump. I have already set
> up the clusters and environment.
> https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Internationalization-(DB-backed-core)
>
> I know  the Chinese language. But the current getRedirectPatterns() in
> https://github.com/dbpedia-spotlight/pignlproc/blob/master/src/main/java/pignlproc/markup/AnnotatingMarkupParser.java
> does not support Chinese. Anyway, I will try to added it.


Hi everyone @Spotlight,

if you want, you could use
https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/impl/wikipedia/Redirect.scala
for a list of redirect tags. That class is generated
semi-automatically from downloaded Wikipedia settings. I have no idea
how much effort it would be to integrate that class (or the generating
process) into DBpedia Spotlight and if it would be worth the effort.

Cheers,
JC

>
> I also moved the user's manual page from wiki to github:
> https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/User's-manual.
> But there are much overlap between this page with the web service
> page(https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Web-service).
> In fact, this page seems to be a little simple as a user's manual page. Some
> details should be added, e.g., the programmatic usage part.
>
> I will report my progress later.
> Thanks for your guidance! People from Open source community are really
> nice.. It would be my pleasure to contribute in this community.
>
> Best Regards,
> Wei Wang
>
>
> On Mon, Apr 15, 2013 at 6:28 PM, Pablo N. Mendes <[email protected]>
> wrote:
>>
>> Hi Wei,
>> Thanks for your interest. Can you share with us (e.g. via links to github)
>> the results from your warm up tasks so far?
>>
>> I have another WarmUp task proposal. If you know anything about Chinese (I
>> have Chinese friends with family name Wang, so sorry if I make incorrect
>> assumptions), you could try to run the Indexing (DB core) for Chinese
>> language, or identify the reasons why this process would not work.
>>
>> https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Internationalization-(DB-backed-core)
>>
>> You could send us questions that you may have, so that we can improve the
>> documentation on the page above.
>>
>> Please note that this does not guarantee your acceptance to GSoC. However,
>> doing well in open source community participation is usually a great plus
>> during the selection phase. So the advice to start participating early goes
>> for all prospective applicants.
>>
>> Cheers,
>> Pablo
>>
>>
>> On Sun, Apr 14, 2013 at 4:00 PM, Wang Wei <[email protected]> wrote:
>>>
>>> Hi all,
>>>
>>> I am Wei Wang. After trying some warm up tasks, I think It is time to
>>> make some noise here.
>>>
>>> A few months ago, I worked on a project, in which I annotated 6 million
>>> Tweets and Facebook posts with the Wikipedia concept(each Wikipedia article
>>> is regarded as a concept) . Specifically, with the help of
>>> Wikipedia-Miner(http://wikipedia-miner.cms.waikato.ac.nz/), I processed the
>>> English dump of Wikipedia of October 2012 on a Hadoop cluster to extract
>>> some meta data. Then I built a concept dictionary(with 9 million entities)
>>> which maps phrases(concept mentions) to target concept(Wikipedia article).
>>> For each tweet and post, the concept mentions were recognized by looking up
>>> the dictionary. Then disambiguation is conducted by context analysis.
>>>
>>> In fact, what I did is just one of the functions provided by DBpedia
>>> Spotlight. But, through this project, I realized the importance and
>>> challenges of DBpedia Spotlight. For example, by annotating text with
>>> concepts, computers are able to understand the semantics of the text.
>>> However, there are two challenges for this annotation work. Firstly, how to
>>> recognize the concept mentions? Phrases are  often false positively
>>> recognized as concept mentions. E.g., given the sentence "It's late, I have
>>> to go now", since there is an article about a song named "It's late" in
>>> Wikipedia, it is likely the phrase in this sentence would be linked to that
>>> article. It is also possible that some true concept mentions are not
>>> recognized due to the dictionary coverage and text noise. Secondly, it is
>>> well known that some mentions are ambiguous. Thus, how to disambiguate them
>>> accurately and efficiently is another challenge, especially for short text.
>>> This is still an on-going research topics.
>>>
>>> Regarding the idea 3.1(Google mention corpus), I think the overlap of
>>> google mention corpus and wikipedia dump may be a point that should be
>>> considered. Otherwise we may index some redundant data. (pls correct me
>>> since I have little knowledge about this part. And I am reading the related
>>> code)
>>>
>>> For 3.2, I think it's really important. From my experience, the
>>> disambiguation procedure is  time consuming , because content analysis is
>>> usually involved.
>>>
>>> So far, I have done some warm up tasks like documentation. I am trying
>>> the software and learning Scala. I will share my thoughts regarding the two
>>> ideas later. Thanks.
>>>
>>> Best Regards,
>>> Wei Wang
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Precog is a next-generation analytics platform capable of advanced
>>> analytics on semi-structured data. The platform includes APIs for
>>> building
>>> apps and a phenomenal toolset for data science. Developers can use
>>> our toolset for easy data analysis & visualization. Get a free account!
>>> http://www2.precog.com/precogplatform/slashdotnewsletter
>>> _______________________________________________
>>> Dbpedia-gsoc mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>>
>>
>>
>>
>> --
>>
>> Pablo N. Mendes
>> http://pablomendes.com
>
>
>
> ------------------------------------------------------------------------------
> Precog is a next-generation analytics platform capable of advanced
> analytics on semi-structured data. The platform includes APIs for building
> apps and a phenomenal toolset for data science. Developers can use
> our toolset for easy data analysis & visualization. Get a free account!
> http://www2.precog.com/precogplatform/slashdotnewsletter
> _______________________________________________
> Dbpedia-gsoc mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>

------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Re: [Dbpedia-gsoc] Experience and thoughts for DBpedia Spotlight Ideas

Reply via email to