Re: [Dbpedia-gsoc] Experience and thoughts for DBpedia Spotlight Ideas

Andrea Di Menna Mon, 15 Apr 2013 07:49:41 -0700

Hi Pablo,

there is already a DBpedia Core module actually:


https://github.com/dbpedia/extraction-framework/tree/master/core

Is that correct JC?

Cheers,
Andrea


2013/4/15 Pablo N. Mendes <[email protected]>

>
> Hi Jona,
> thanks! IIRC, we decided for not adding the DEF (DBpedia Extraction
> Framework) as a dependency to pignlproc in order to reduce the size of the
> jar that has to be shipped to each hadoop node. So I think somebody just
> snagged the code into our codebase.
>
> It would be very neat if these reusable classes would be somehow separated
> into a "DBpedia Core" module that could be imported by any project that
> depends on DBpedia. We also use the similar Disambiguation class, and the
> WikiUtil for encoding/decoding URIs.
>
> Cheers,
> Pablo
>
>
> On Mon, Apr 15, 2013 at 4:40 PM, Jona Christopher Sahnwaldt <
> [email protected]> wrote:
>
>> On 15 April 2013 16:31, Wang Wei <[email protected]> wrote:
>> > Hi Pablo,
>> > I have updated the Internationalization-(DB-backed-core) page. There are
>> > some inconsistencies between the webpage and the index_db.sh script,
>> e.g.,
>> > the paths. I thinks there are also some problems for the index_db.sh.
>> I'll
>> > check it after finishing downloading the wikipedia dump. I have already
>> set
>> > up the clusters and environment.
>> >
>> https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Internationalization-(DB-backed-core)
>> >
>> > I know  the Chinese language. But the current getRedirectPatterns() in
>> >
>> https://github.com/dbpedia-spotlight/pignlproc/blob/master/src/main/java/pignlproc/markup/AnnotatingMarkupParser.java
>> > does not support Chinese. Anyway, I will try to added it.
>>
>> Hi everyone @Spotlight,
>>
>> if you want, you could use
>>
>> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/impl/wikipedia/Redirect.scala
>> for a list of redirect tags. That class is generated
>> semi-automatically from downloaded Wikipedia settings. I have no idea
>> how much effort it would be to integrate that class (or the generating
>> process) into DBpedia Spotlight and if it would be worth the effort.
>>
>> Cheers,
>> JC
>>
>> >
>> > I also moved the user's manual page from wiki to github:
>> >
>> https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/User's-manual
>> .
>> > But there are much overlap between this page with the web service
>> > page(
>> https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Web-service).
>> > In fact, this page seems to be a little simple as a user's manual page.
>> Some
>> > details should be added, e.g., the programmatic usage part.
>> >
>> > I will report my progress later.
>> > Thanks for your guidance! People from Open source community are really
>> > nice.. It would be my pleasure to contribute in this community.
>> >
>> > Best Regards,
>> > Wei Wang
>> >
>> >
>> > On Mon, Apr 15, 2013 at 6:28 PM, Pablo N. Mendes <[email protected]
>> >
>> > wrote:
>> >>
>> >> Hi Wei,
>> >> Thanks for your interest. Can you share with us (e.g. via links to
>> github)
>> >> the results from your warm up tasks so far?
>> >>
>> >> I have another WarmUp task proposal. If you know anything about
>> Chinese (I
>> >> have Chinese friends with family name Wang, so sorry if I make
>> incorrect
>> >> assumptions), you could try to run the Indexing (DB core) for Chinese
>> >> language, or identify the reasons why this process would not work.
>> >>
>> >>
>> https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Internationalization-(DB-backed-core)
>> >>
>> >> You could send us questions that you may have, so that we can improve
>> the
>> >> documentation on the page above.
>> >>
>> >> Please note that this does not guarantee your acceptance to GSoC.
>> However,
>> >> doing well in open source community participation is usually a great
>> plus
>> >> during the selection phase. So the advice to start participating early
>> goes
>> >> for all prospective applicants.
>> >>
>> >> Cheers,
>> >> Pablo
>> >>
>> >>
>> >> On Sun, Apr 14, 2013 at 4:00 PM, Wang Wei <[email protected]>
>> wrote:
>> >>>
>> >>> Hi all,
>> >>>
>> >>> I am Wei Wang. After trying some warm up tasks, I think It is time to
>> >>> make some noise here.
>> >>>
>> >>> A few months ago, I worked on a project, in which I annotated 6
>> million
>> >>> Tweets and Facebook posts with the Wikipedia concept(each Wikipedia
>> article
>> >>> is regarded as a concept) . Specifically, with the help of
>> >>> Wikipedia-Miner(http://wikipedia-miner.cms.waikato.ac.nz/), I
>> processed the
>> >>> English dump of Wikipedia of October 2012 on a Hadoop cluster to
>> extract
>> >>> some meta data. Then I built a concept dictionary(with 9 million
>> entities)
>> >>> which maps phrases(concept mentions) to target concept(Wikipedia
>> article).
>> >>> For each tweet and post, the concept mentions were recognized by
>> looking up
>> >>> the dictionary. Then disambiguation is conducted by context analysis.
>> >>>
>> >>> In fact, what I did is just one of the functions provided by DBpedia
>> >>> Spotlight. But, through this project, I realized the importance and
>> >>> challenges of DBpedia Spotlight. For example, by annotating text with
>> >>> concepts, computers are able to understand the semantics of the text.
>> >>> However, there are two challenges for this annotation work. Firstly,
>> how to
>> >>> recognize the concept mentions? Phrases are  often false positively
>> >>> recognized as concept mentions. E.g., given the sentence "It's late,
>> I have
>> >>> to go now", since there is an article about a song named "It's late"
>> in
>> >>> Wikipedia, it is likely the phrase in this sentence would be linked
>> to that
>> >>> article. It is also possible that some true concept mentions are not
>> >>> recognized due to the dictionary coverage and text noise. Secondly,
>> it is
>> >>> well known that some mentions are ambiguous. Thus, how to
>> disambiguate them
>> >>> accurately and efficiently is another challenge, especially for short
>> text.
>> >>> This is still an on-going research topics.
>> >>>
>> >>> Regarding the idea 3.1(Google mention corpus), I think the overlap of
>> >>> google mention corpus and wikipedia dump may be a point that should be
>> >>> considered. Otherwise we may index some redundant data. (pls correct
>> me
>> >>> since I have little knowledge about this part. And I am reading the
>> related
>> >>> code)
>> >>>
>> >>> For 3.2, I think it's really important. From my experience, the
>> >>> disambiguation procedure is  time consuming , because content
>> analysis is
>> >>> usually involved.
>> >>>
>> >>> So far, I have done some warm up tasks like documentation. I am trying
>> >>> the software and learning Scala. I will share my thoughts regarding
>> the two
>> >>> ideas later. Thanks.
>> >>>
>> >>> Best Regards,
>> >>> Wei Wang
>> >>>
>> >>>
>> >>>
>> >>>
>> ------------------------------------------------------------------------------
>> >>> Precog is a next-generation analytics platform capable of advanced
>> >>> analytics on semi-structured data. The platform includes APIs for
>> >>> building
>> >>> apps and a phenomenal toolset for data science. Developers can use
>> >>> our toolset for easy data analysis & visualization. Get a free
>> account!
>> >>> http://www2.precog.com/precogplatform/slashdotnewsletter
>> >>> _______________________________________________
>> >>> Dbpedia-gsoc mailing list
>> >>> [email protected]
>> >>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >>
>> >> Pablo N. Mendes
>> >> http://pablomendes.com
>> >
>> >
>> >
>> >
>> ------------------------------------------------------------------------------
>> > Precog is a next-generation analytics platform capable of advanced
>> > analytics on semi-structured data. The platform includes APIs for
>> building
>> > apps and a phenomenal toolset for data science. Developers can use
>> > our toolset for easy data analysis & visualization. Get a free account!
>> > http://www2.precog.com/precogplatform/slashdotnewsletter
>> > _______________________________________________
>> > Dbpedia-gsoc mailing list
>> > [email protected]
>> > https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>> >
>>
>
>
>
> --
>
> Pablo N. Mendes
> http://pablomendes.com
>
>
> ------------------------------------------------------------------------------
> Precog is a next-generation analytics platform capable of advanced
> analytics on semi-structured data. The platform includes APIs for building
> apps and a phenomenal toolset for data science. Developers can use
> our toolset for easy data analysis & visualization. Get a free account!
> http://www2.precog.com/precogplatform/slashdotnewsletter
> _______________________________________________
> Dbpedia-gsoc mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>
>

------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter

_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Re: [Dbpedia-gsoc] Experience and thoughts for DBpedia Spotlight Ideas

Reply via email to