On 15 April 2013 16:49, Andrea Di Menna <[email protected]> wrote: > Hi Pablo, > > there is already a DBpedia Core module actually: > > https://github.com/dbpedia/extraction-framework/tree/master/core > > Is that correct JC?
Sure, but it's rather large. I guess what Pablo means is a module that only contains a few classes that are the 'real core', and I totally agree that t would be nice to have something like that, or even several smaller modules where each one covers only a certain aspect that some kind of Wikipedia extraction may need. The current 'core' module contains many classes that are actually rather specific and not 'basic' at all. Cheers, JC > > Cheers, > Andrea > > > 2013/4/15 Pablo N. Mendes <[email protected]> > >> >> Hi Jona, >> thanks! IIRC, we decided for not adding the DEF (DBpedia Extraction >> Framework) as a dependency to pignlproc in order to reduce the size of the >> jar that has to be shipped to each hadoop node. So I think somebody just >> snagged the code into our codebase. >> >> It would be very neat if these reusable classes would be somehow separated >> into a "DBpedia Core" module that could be imported by any project that >> depends on DBpedia. We also use the similar Disambiguation class, and the >> WikiUtil for encoding/decoding URIs. >> >> Cheers, >> Pablo >> >> >> On Mon, Apr 15, 2013 at 4:40 PM, Jona Christopher Sahnwaldt >> <[email protected]> wrote: >>> >>> On 15 April 2013 16:31, Wang Wei <[email protected]> wrote: >>> > Hi Pablo, >>> > I have updated the Internationalization-(DB-backed-core) page. There >>> > are >>> > some inconsistencies between the webpage and the index_db.sh script, >>> > e.g., >>> > the paths. I thinks there are also some problems for the index_db.sh. >>> > I'll >>> > check it after finishing downloading the wikipedia dump. I have already >>> > set >>> > up the clusters and environment. >>> > >>> > https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Internationalization-(DB-backed-core) >>> > >>> > I know the Chinese language. But the current getRedirectPatterns() in >>> > >>> > https://github.com/dbpedia-spotlight/pignlproc/blob/master/src/main/java/pignlproc/markup/AnnotatingMarkupParser.java >>> > does not support Chinese. Anyway, I will try to added it. >>> >>> Hi everyone @Spotlight, >>> >>> if you want, you could use >>> >>> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/impl/wikipedia/Redirect.scala >>> for a list of redirect tags. That class is generated >>> semi-automatically from downloaded Wikipedia settings. I have no idea >>> how much effort it would be to integrate that class (or the generating >>> process) into DBpedia Spotlight and if it would be worth the effort. >>> >>> Cheers, >>> JC >>> >>> > >>> > I also moved the user's manual page from wiki to github: >>> > >>> > https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/User's-manual. >>> > But there are much overlap between this page with the web service >>> > >>> > page(https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Web-service). >>> > In fact, this page seems to be a little simple as a user's manual page. >>> > Some >>> > details should be added, e.g., the programmatic usage part. >>> > >>> > I will report my progress later. >>> > Thanks for your guidance! People from Open source community are really >>> > nice.. It would be my pleasure to contribute in this community. >>> > >>> > Best Regards, >>> > Wei Wang >>> > >>> > >>> > On Mon, Apr 15, 2013 at 6:28 PM, Pablo N. Mendes >>> > <[email protected]> >>> > wrote: >>> >> >>> >> Hi Wei, >>> >> Thanks for your interest. Can you share with us (e.g. via links to >>> >> github) >>> >> the results from your warm up tasks so far? >>> >> >>> >> I have another WarmUp task proposal. If you know anything about >>> >> Chinese (I >>> >> have Chinese friends with family name Wang, so sorry if I make >>> >> incorrect >>> >> assumptions), you could try to run the Indexing (DB core) for Chinese >>> >> language, or identify the reasons why this process would not work. >>> >> >>> >> >>> >> https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Internationalization-(DB-backed-core) >>> >> >>> >> You could send us questions that you may have, so that we can improve >>> >> the >>> >> documentation on the page above. >>> >> >>> >> Please note that this does not guarantee your acceptance to GSoC. >>> >> However, >>> >> doing well in open source community participation is usually a great >>> >> plus >>> >> during the selection phase. So the advice to start participating early >>> >> goes >>> >> for all prospective applicants. >>> >> >>> >> Cheers, >>> >> Pablo >>> >> >>> >> >>> >> On Sun, Apr 14, 2013 at 4:00 PM, Wang Wei <[email protected]> >>> >> wrote: >>> >>> >>> >>> Hi all, >>> >>> >>> >>> I am Wei Wang. After trying some warm up tasks, I think It is time to >>> >>> make some noise here. >>> >>> >>> >>> A few months ago, I worked on a project, in which I annotated 6 >>> >>> million >>> >>> Tweets and Facebook posts with the Wikipedia concept(each Wikipedia >>> >>> article >>> >>> is regarded as a concept) . Specifically, with the help of >>> >>> Wikipedia-Miner(http://wikipedia-miner.cms.waikato.ac.nz/), I >>> >>> processed the >>> >>> English dump of Wikipedia of October 2012 on a Hadoop cluster to >>> >>> extract >>> >>> some meta data. Then I built a concept dictionary(with 9 million >>> >>> entities) >>> >>> which maps phrases(concept mentions) to target concept(Wikipedia >>> >>> article). >>> >>> For each tweet and post, the concept mentions were recognized by >>> >>> looking up >>> >>> the dictionary. Then disambiguation is conducted by context analysis. >>> >>> >>> >>> In fact, what I did is just one of the functions provided by DBpedia >>> >>> Spotlight. But, through this project, I realized the importance and >>> >>> challenges of DBpedia Spotlight. For example, by annotating text with >>> >>> concepts, computers are able to understand the semantics of the text. >>> >>> However, there are two challenges for this annotation work. Firstly, >>> >>> how to >>> >>> recognize the concept mentions? Phrases are often false positively >>> >>> recognized as concept mentions. E.g., given the sentence "It's late, >>> >>> I have >>> >>> to go now", since there is an article about a song named "It's late" >>> >>> in >>> >>> Wikipedia, it is likely the phrase in this sentence would be linked >>> >>> to that >>> >>> article. It is also possible that some true concept mentions are not >>> >>> recognized due to the dictionary coverage and text noise. Secondly, >>> >>> it is >>> >>> well known that some mentions are ambiguous. Thus, how to >>> >>> disambiguate them >>> >>> accurately and efficiently is another challenge, especially for short >>> >>> text. >>> >>> This is still an on-going research topics. >>> >>> >>> >>> Regarding the idea 3.1(Google mention corpus), I think the overlap of >>> >>> google mention corpus and wikipedia dump may be a point that should >>> >>> be >>> >>> considered. Otherwise we may index some redundant data. (pls correct >>> >>> me >>> >>> since I have little knowledge about this part. And I am reading the >>> >>> related >>> >>> code) >>> >>> >>> >>> For 3.2, I think it's really important. From my experience, the >>> >>> disambiguation procedure is time consuming , because content >>> >>> analysis is >>> >>> usually involved. >>> >>> >>> >>> So far, I have done some warm up tasks like documentation. I am >>> >>> trying >>> >>> the software and learning Scala. I will share my thoughts regarding >>> >>> the two >>> >>> ideas later. Thanks. >>> >>> >>> >>> Best Regards, >>> >>> Wei Wang >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> >>> Precog is a next-generation analytics platform capable of advanced >>> >>> analytics on semi-structured data. The platform includes APIs for >>> >>> building >>> >>> apps and a phenomenal toolset for data science. Developers can use >>> >>> our toolset for easy data analysis & visualization. Get a free >>> >>> account! >>> >>> http://www2.precog.com/precogplatform/slashdotnewsletter >>> >>> _______________________________________________ >>> >>> Dbpedia-gsoc mailing list >>> >>> [email protected] >>> >>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc >>> >>> >>> >> >>> >> >>> >> >>> >> -- >>> >> >>> >> Pablo N. Mendes >>> >> http://pablomendes.com >>> > >>> > >>> > >>> > >>> > ------------------------------------------------------------------------------ >>> > Precog is a next-generation analytics platform capable of advanced >>> > analytics on semi-structured data. The platform includes APIs for >>> > building >>> > apps and a phenomenal toolset for data science. Developers can use >>> > our toolset for easy data analysis & visualization. Get a free account! >>> > http://www2.precog.com/precogplatform/slashdotnewsletter >>> > _______________________________________________ >>> > Dbpedia-gsoc mailing list >>> > [email protected] >>> > https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc >>> > >> >> >> >> >> -- >> >> Pablo N. Mendes >> http://pablomendes.com >> >> >> ------------------------------------------------------------------------------ >> Precog is a next-generation analytics platform capable of advanced >> analytics on semi-structured data. The platform includes APIs for building >> apps and a phenomenal toolset for data science. Developers can use >> our toolset for easy data analysis & visualization. Get a free account! >> http://www2.precog.com/precogplatform/slashdotnewsletter >> _______________________________________________ >> Dbpedia-gsoc mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc >> > ------------------------------------------------------------------------------ Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis & visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter _______________________________________________ Dbpedia-gsoc mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
