Re: [Dbpedia-gsoc] Experience and thoughts for DBpedia Spotlight Ideas

Pablo N. Mendes Mon, 15 Apr 2013 07:47:18 -0700

Hi Jona,
thanks! IIRC, we decided for not adding the DEF (DBpedia Extraction
Framework) as a dependency to pignlproc in order to reduce the size of the
jar that has to be shipped to each hadoop node. So I think somebody just
snagged the code into our codebase.


It would be very neat if these reusable classes would be somehow separated
into a "DBpedia Core" module that could be imported by any project that
depends on DBpedia. We also use the similar Disambiguation class, and the
WikiUtil for encoding/decoding URIs.

Cheers,
Pablo


On Mon, Apr 15, 2013 at 4:40 PM, Jona Christopher Sahnwaldt <[email protected]
> wrote:

> On 15 April 2013 16:31, Wang Wei <[email protected]> wrote:
> > Hi Pablo,
> > I have updated the Internationalization-(DB-backed-core) page. There are
> > some inconsistencies between the webpage and the index_db.sh script,
> e.g.,
> > the paths. I thinks there are also some problems for the index_db.sh.
> I'll
> > check it after finishing downloading the wikipedia dump. I have already
> set
> > up the clusters and environment.
> >
> https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Internationalization-(DB-backed-core)
> >
> > I know  the Chinese language. But the current getRedirectPatterns() in
> >
> https://github.com/dbpedia-spotlight/pignlproc/blob/master/src/main/java/pignlproc/markup/AnnotatingMarkupParser.java
> > does not support Chinese. Anyway, I will try to added it.
>
> Hi everyone @Spotlight,
>
> if you want, you could use
>
> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/impl/wikipedia/Redirect.scala
> for a list of redirect tags. That class is generated
> semi-automatically from downloaded Wikipedia settings. I have no idea
> how much effort it would be to integrate that class (or the generating
> process) into DBpedia Spotlight and if it would be worth the effort.
>
> Cheers,
> JC
>
> >
> > I also moved the user's manual page from wiki to github:
> >
> https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/User's-manual.
> > But there are much overlap between this page with the web service
> > page(
> https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Web-service).
> > In fact, this page seems to be a little simple as a user's manual page.
> Some
> > details should be added, e.g., the programmatic usage part.
> >
> > I will report my progress later.
> > Thanks for your guidance! People from Open source community are really
> > nice.. It would be my pleasure to contribute in this community.
> >
> > Best Regards,
> > Wei Wang
> >
> >
> > On Mon, Apr 15, 2013 at 6:28 PM, Pablo N. Mendes <[email protected]>
> > wrote:
> >>
> >> Hi Wei,
> >> Thanks for your interest. Can you share with us (e.g. via links to
> github)
> >> the results from your warm up tasks so far?
> >>
> >> I have another WarmUp task proposal. If you know anything about Chinese
> (I
> >> have Chinese friends with family name Wang, so sorry if I make incorrect
> >> assumptions), you could try to run the Indexing (DB core) for Chinese
> >> language, or identify the reasons why this process would not work.
> >>
> >>
> https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Internationalization-(DB-backed-core)
> >>
> >> You could send us questions that you may have, so that we can improve
> the
> >> documentation on the page above.
> >>
> >> Please note that this does not guarantee your acceptance to GSoC.
> However,
> >> doing well in open source community participation is usually a great
> plus
> >> during the selection phase. So the advice to start participating early
> goes
> >> for all prospective applicants.
> >>
> >> Cheers,
> >> Pablo
> >>
> >>
> >> On Sun, Apr 14, 2013 at 4:00 PM, Wang Wei <[email protected]>
> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> I am Wei Wang. After trying some warm up tasks, I think It is time to
> >>> make some noise here.
> >>>
> >>> A few months ago, I worked on a project, in which I annotated 6 million
> >>> Tweets and Facebook posts with the Wikipedia concept(each Wikipedia
> article
> >>> is regarded as a concept) . Specifically, with the help of
> >>> Wikipedia-Miner(http://wikipedia-miner.cms.waikato.ac.nz/), I
> processed the
> >>> English dump of Wikipedia of October 2012 on a Hadoop cluster to
> extract
> >>> some meta data. Then I built a concept dictionary(with 9 million
> entities)
> >>> which maps phrases(concept mentions) to target concept(Wikipedia
> article).
> >>> For each tweet and post, the concept mentions were recognized by
> looking up
> >>> the dictionary. Then disambiguation is conducted by context analysis.
> >>>
> >>> In fact, what I did is just one of the functions provided by DBpedia
> >>> Spotlight. But, through this project, I realized the importance and
> >>> challenges of DBpedia Spotlight. For example, by annotating text with
> >>> concepts, computers are able to understand the semantics of the text.
> >>> However, there are two challenges for this annotation work. Firstly,
> how to
> >>> recognize the concept mentions? Phrases are  often false positively
> >>> recognized as concept mentions. E.g., given the sentence "It's late, I
> have
> >>> to go now", since there is an article about a song named "It's late" in
> >>> Wikipedia, it is likely the phrase in this sentence would be linked to
> that
> >>> article. It is also possible that some true concept mentions are not
> >>> recognized due to the dictionary coverage and text noise. Secondly, it
> is
> >>> well known that some mentions are ambiguous. Thus, how to disambiguate
> them
> >>> accurately and efficiently is another challenge, especially for short
> text.
> >>> This is still an on-going research topics.
> >>>
> >>> Regarding the idea 3.1(Google mention corpus), I think the overlap of
> >>> google mention corpus and wikipedia dump may be a point that should be
> >>> considered. Otherwise we may index some redundant data. (pls correct me
> >>> since I have little knowledge about this part. And I am reading the
> related
> >>> code)
> >>>
> >>> For 3.2, I think it's really important. From my experience, the
> >>> disambiguation procedure is  time consuming , because content analysis
> is
> >>> usually involved.
> >>>
> >>> So far, I have done some warm up tasks like documentation. I am trying
> >>> the software and learning Scala. I will share my thoughts regarding
> the two
> >>> ideas later. Thanks.
> >>>
> >>> Best Regards,
> >>> Wei Wang
> >>>
> >>>
> >>>
> >>>
> ------------------------------------------------------------------------------
> >>> Precog is a next-generation analytics platform capable of advanced
> >>> analytics on semi-structured data. The platform includes APIs for
> >>> building
> >>> apps and a phenomenal toolset for data science. Developers can use
> >>> our toolset for easy data analysis & visualization. Get a free account!
> >>> http://www2.precog.com/precogplatform/slashdotnewsletter
> >>> _______________________________________________
> >>> Dbpedia-gsoc mailing list
> >>> [email protected]
> >>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
> >>>
> >>
> >>
> >>
> >> --
> >>
> >> Pablo N. Mendes
> >> http://pablomendes.com
> >
> >
> >
> >
> ------------------------------------------------------------------------------
> > Precog is a next-generation analytics platform capable of advanced
> > analytics on semi-structured data. The platform includes APIs for
> building
> > apps and a phenomenal toolset for data science. Developers can use
> > our toolset for easy data analysis & visualization. Get a free account!
> > http://www2.precog.com/precogplatform/slashdotnewsletter
> > _______________________________________________
> > Dbpedia-gsoc mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
> >
>



-- 

Pablo N. Mendes
http://pablomendes.com

------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter

_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Re: [Dbpedia-gsoc] Experience and thoughts for DBpedia Spotlight Ideas

Reply via email to