On 15 April 2013 16:31, Wang Wei <[email protected]> wrote: > Hi Pablo, > I have updated the Internationalization-(DB-backed-core) page. There are > some inconsistencies between the webpage and the index_db.sh script, e.g., > the paths. I thinks there are also some problems for the index_db.sh. I'll > check it after finishing downloading the wikipedia dump. I have already set > up the clusters and environment. > https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Internationalization-(DB-backed-core) > > I know the Chinese language. But the current getRedirectPatterns() in > https://github.com/dbpedia-spotlight/pignlproc/blob/master/src/main/java/pignlproc/markup/AnnotatingMarkupParser.java > does not support Chinese. Anyway, I will try to added it.
Hi everyone @Spotlight, if you want, you could use https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/impl/wikipedia/Redirect.scala for a list of redirect tags. That class is generated semi-automatically from downloaded Wikipedia settings. I have no idea how much effort it would be to integrate that class (or the generating process) into DBpedia Spotlight and if it would be worth the effort. Cheers, JC > > I also moved the user's manual page from wiki to github: > https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/User's-manual. > But there are much overlap between this page with the web service > page(https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Web-service). > In fact, this page seems to be a little simple as a user's manual page. Some > details should be added, e.g., the programmatic usage part. > > I will report my progress later. > Thanks for your guidance! People from Open source community are really > nice.. It would be my pleasure to contribute in this community. > > Best Regards, > Wei Wang > > > On Mon, Apr 15, 2013 at 6:28 PM, Pablo N. Mendes <[email protected]> > wrote: >> >> Hi Wei, >> Thanks for your interest. Can you share with us (e.g. via links to github) >> the results from your warm up tasks so far? >> >> I have another WarmUp task proposal. If you know anything about Chinese (I >> have Chinese friends with family name Wang, so sorry if I make incorrect >> assumptions), you could try to run the Indexing (DB core) for Chinese >> language, or identify the reasons why this process would not work. >> >> https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Internationalization-(DB-backed-core) >> >> You could send us questions that you may have, so that we can improve the >> documentation on the page above. >> >> Please note that this does not guarantee your acceptance to GSoC. However, >> doing well in open source community participation is usually a great plus >> during the selection phase. So the advice to start participating early goes >> for all prospective applicants. >> >> Cheers, >> Pablo >> >> >> On Sun, Apr 14, 2013 at 4:00 PM, Wang Wei <[email protected]> wrote: >>> >>> Hi all, >>> >>> I am Wei Wang. After trying some warm up tasks, I think It is time to >>> make some noise here. >>> >>> A few months ago, I worked on a project, in which I annotated 6 million >>> Tweets and Facebook posts with the Wikipedia concept(each Wikipedia article >>> is regarded as a concept) . Specifically, with the help of >>> Wikipedia-Miner(http://wikipedia-miner.cms.waikato.ac.nz/), I processed the >>> English dump of Wikipedia of October 2012 on a Hadoop cluster to extract >>> some meta data. Then I built a concept dictionary(with 9 million entities) >>> which maps phrases(concept mentions) to target concept(Wikipedia article). >>> For each tweet and post, the concept mentions were recognized by looking up >>> the dictionary. Then disambiguation is conducted by context analysis. >>> >>> In fact, what I did is just one of the functions provided by DBpedia >>> Spotlight. But, through this project, I realized the importance and >>> challenges of DBpedia Spotlight. For example, by annotating text with >>> concepts, computers are able to understand the semantics of the text. >>> However, there are two challenges for this annotation work. Firstly, how to >>> recognize the concept mentions? Phrases are often false positively >>> recognized as concept mentions. E.g., given the sentence "It's late, I have >>> to go now", since there is an article about a song named "It's late" in >>> Wikipedia, it is likely the phrase in this sentence would be linked to that >>> article. It is also possible that some true concept mentions are not >>> recognized due to the dictionary coverage and text noise. Secondly, it is >>> well known that some mentions are ambiguous. Thus, how to disambiguate them >>> accurately and efficiently is another challenge, especially for short text. >>> This is still an on-going research topics. >>> >>> Regarding the idea 3.1(Google mention corpus), I think the overlap of >>> google mention corpus and wikipedia dump may be a point that should be >>> considered. Otherwise we may index some redundant data. (pls correct me >>> since I have little knowledge about this part. And I am reading the related >>> code) >>> >>> For 3.2, I think it's really important. From my experience, the >>> disambiguation procedure is time consuming , because content analysis is >>> usually involved. >>> >>> So far, I have done some warm up tasks like documentation. I am trying >>> the software and learning Scala. I will share my thoughts regarding the two >>> ideas later. Thanks. >>> >>> Best Regards, >>> Wei Wang >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> Precog is a next-generation analytics platform capable of advanced >>> analytics on semi-structured data. The platform includes APIs for >>> building >>> apps and a phenomenal toolset for data science. Developers can use >>> our toolset for easy data analysis & visualization. Get a free account! >>> http://www2.precog.com/precogplatform/slashdotnewsletter >>> _______________________________________________ >>> Dbpedia-gsoc mailing list >>> [email protected] >>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc >>> >> >> >> >> -- >> >> Pablo N. Mendes >> http://pablomendes.com > > > > ------------------------------------------------------------------------------ > Precog is a next-generation analytics platform capable of advanced > analytics on semi-structured data. The platform includes APIs for building > apps and a phenomenal toolset for data science. Developers can use > our toolset for easy data analysis & visualization. Get a free account! > http://www2.precog.com/precogplatform/slashdotnewsletter > _______________________________________________ > Dbpedia-gsoc mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc > ------------------------------------------------------------------------------ Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis & visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter _______________________________________________ Dbpedia-gsoc mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
