Re: [Dbpedia-gsoc] Woring on Idea "Generalize input formats and add support for Google mention corpus"

Cai Zhiwei Tue, 16 Apr 2013 01:16:24 -0700

Hey,

I understand what I should do now.I am excited about my first task.Thanks
again for your guidance.People in DBpedia are really nice.



Best regards,
Zhiwei


2013/4/16 Joachim Daiber <[email protected]>

> Hey,
>
> I just updated the Wiki page for this task. Yes, have a look at
> CreateSpotlightModel. You would only have to implement the *Source objects
> for this corpus. While we are at this, making those interfaces more general
> is of course something we would like to see eventually (e.g. not having
> Scala objects but classes with different implementations).
>
> Best,
> Jo
>
>
> On Tue, Apr 16, 2013 at 9:54 AM, Pablo N. Mendes <[email protected]>wrote:
>
>>
>> Hi Zhiwei,
>> Sorry, I seem to have missed that the annotations already come with TOKEN
>> annotations. My suggestion was not to crawl the Wikipedia URLs (which are
>> entity mention annotations), but to crawl the original URLs where this
>> content came from (the URL field). But since they already provide the
>> tokens, we would not need to do that.
>>
>> So you would only have to parse that input and build a model from it (see
>> CreateSpotlightModel class).
>>
>> https://github.com/dbpedia-spotlight/dbpedia-spotlight/blob/master/index/src/main/scala/org/dbpedia/spotlight/db/CreateSpotlightModel.scala
>>
>> You would be extending DBpediaResourceSource, CandidateMapSource and
>> TokenSource implementations for this corpus. Right, Jo?
>>
>> Cheers,
>> Pablo
>>
>>
>>
>> On Tue, Apr 16, 2013 at 9:42 AM, Cai Zhiwei <[email protected]>wrote:
>>
>>> HI Dimitris,Pablo,
>>> I have a few questions about the task:
>>> 1.We only need to crawl the original URL,not the
>>> mention wikipedia URL,considering crawling large number of html pages of
>>> wikipedia is not friendly to wikipedia and wikipedia pages are provided by
>>> mediawiki dumps.Am I correct?
>>> 2.A part of the URLs is pointed to an pdf,doc or other file formats
>>> rather than html.Should I omit them or use some kind of tools to read them?
>>>
>>>
>>>
>>> Thanks for your guidance.
>>>
>>> Zhiwei
>>> South China University of Technology
>>>
>>>
>>> 2013/4/16 Dimitris Kontokostas <[email protected]>
>>>
>>>> Great!
>>>>
>>>> The "Google Corpus" idea description page [1] is updated, you can also
>>>> look at the related thread for more information.
>>>> For warm up tasks on DBpedia Spotlight you can take a look here [2]
>>>> first. Then Pablo, Max or Jo can give you a more specific task.
>>>>
>>>> Best,
>>>> Dimitris
>>>>
>>>> [1] http://wiki.dbpedia.org/gsoc2013/ideas/GoogleCorpus
>>>> [2]
>>>> https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Warm-up-tasks
>>>>
>>>>
>>>> On Sat, Apr 13, 2013 at 9:03 PM, 蔡志威 <[email protected]> wrote:
>>>>
>>>>> In the last 4 days,I've finish the following work:
>>>>> 1.Read most wiki pages in github of depedia extraction framework and
>>>>> depedia-spotlight and also added some noted and correct some basic 
>>>>> mistakes.
>>>>> 2.Downloaded  the code and  tested some data in my computer.
>>>>> 3.Set up my dev enviroment with intelliJ IDEA.
>>>>> 4.Learnt maven and scala,so I could get a basic idea of the
>>>>> constructure of the whole project.
>>>>> 5.I found I might prefer the idea "Generalize input formats and add
>>>>> support for Google mention corpus" so I try to get familliar with 
>>>>> wikipedia
>>>>> dumps format and google memtion corpus.
>>>>>
>>>>> I would be grateful if you could give me some suggestion for the
>>>>> following days.Codes and some materials to read,some issues to solve or
>>>>> other things that can help me get a deeper understanding of this idea.
>>>>>
>>>>> Thanks for your time,
>>>>> Cai Zhiwei
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Precog is a next-generation analytics platform capable of advanced
>>>>> analytics on semi-structured data. The platform includes APIs for
>>>>> building
>>>>> apps and a phenomenal toolset for data science. Developers can use
>>>>> our toolset for easy data analysis & visualization. Get a free account!
>>>>> http://www2.precog.com/precogplatform/slashdotnewsletter
>>>>> _______________________________________________
>>>>> Dbpedia-gsoc mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Kontokostas Dimitris
>>>>
>>>
>>>
>>
>>
>> --
>>
>> Pablo N. Mendes
>> http://pablomendes.com
>>
>>
>> ------------------------------------------------------------------------------
>> Precog is a next-generation analytics platform capable of advanced
>> analytics on semi-structured data. The platform includes APIs for building
>> apps and a phenomenal toolset for data science. Developers can use
>> our toolset for easy data analysis & visualization. Get a free account!
>> http://www2.precog.com/precogplatform/slashdotnewsletter
>> _______________________________________________
>> Dbpedia-gsoc mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>
>>
>

------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter

_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Re: [Dbpedia-gsoc] Woring on Idea "Generalize input formats and add support for Google mention corpus"

Reply via email to