Re: [Dbpedia-gsoc] Woring on Idea "Generalize input formats and add support for Google mention corpus"

Joachim Daiber Tue, 16 Apr 2013 00:57:38 -0700

Hey,

I just updated the Wiki page for this task. Yes, have a look at
CreateSpotlightModel. You would only have to implement the *Source objects
for this corpus. While we are at this, making those interfaces more general
is of course something we would like to see eventually (e.g. not having
Scala objects but classes with different implementations).


Best,
Jo


On Tue, Apr 16, 2013 at 9:54 AM, Pablo N. Mendes <[email protected]>wrote:

>
> Hi Zhiwei,
> Sorry, I seem to have missed that the annotations already come with TOKEN
> annotations. My suggestion was not to crawl the Wikipedia URLs (which are
> entity mention annotations), but to crawl the original URLs where this
> content came from (the URL field). But since they already provide the
> tokens, we would not need to do that.
>
> So you would only have to parse that input and build a model from it (see
> CreateSpotlightModel class).
>
> https://github.com/dbpedia-spotlight/dbpedia-spotlight/blob/master/index/src/main/scala/org/dbpedia/spotlight/db/CreateSpotlightModel.scala
>
> You would be extending DBpediaResourceSource, CandidateMapSource and
> TokenSource implementations for this corpus. Right, Jo?
>
> Cheers,
> Pablo
>
>
>
> On Tue, Apr 16, 2013 at 9:42 AM, Cai Zhiwei <[email protected]> wrote:
>
>> HI Dimitris,Pablo,
>> I have a few questions about the task:
>> 1.We only need to crawl the original URL,not the
>> mention wikipedia URL,considering crawling large number of html pages of
>> wikipedia is not friendly to wikipedia and wikipedia pages are provided by
>> mediawiki dumps.Am I correct?
>> 2.A part of the URLs is pointed to an pdf,doc or other file formats
>> rather than html.Should I omit them or use some kind of tools to read them?
>>
>>
>>
>> Thanks for your guidance.
>>
>> Zhiwei
>> South China University of Technology
>>
>>
>> 2013/4/16 Dimitris Kontokostas <[email protected]>
>>
>>> Great!
>>>
>>> The "Google Corpus" idea description page [1] is updated, you can also
>>> look at the related thread for more information.
>>> For warm up tasks on DBpedia Spotlight you can take a look here [2]
>>> first. Then Pablo, Max or Jo can give you a more specific task.
>>>
>>> Best,
>>> Dimitris
>>>
>>> [1] http://wiki.dbpedia.org/gsoc2013/ideas/GoogleCorpus
>>> [2]
>>> https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Warm-up-tasks
>>>
>>>
>>> On Sat, Apr 13, 2013 at 9:03 PM, 蔡志威 <[email protected]> wrote:
>>>
>>>> In the last 4 days,I've finish the following work:
>>>> 1.Read most wiki pages in github of depedia extraction framework and
>>>> depedia-spotlight and also added some noted and correct some basic 
>>>> mistakes.
>>>> 2.Downloaded  the code and  tested some data in my computer.
>>>> 3.Set up my dev enviroment with intelliJ IDEA.
>>>> 4.Learnt maven and scala,so I could get a basic idea of the
>>>> constructure of the whole project.
>>>> 5.I found I might prefer the idea "Generalize input formats and add
>>>> support for Google mention corpus" so I try to get familliar with wikipedia
>>>> dumps format and google memtion corpus.
>>>>
>>>> I would be grateful if you could give me some suggestion for the
>>>> following days.Codes and some materials to read,some issues to solve or
>>>> other things that can help me get a deeper understanding of this idea.
>>>>
>>>> Thanks for your time,
>>>> Cai Zhiwei
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Precog is a next-generation analytics platform capable of advanced
>>>> analytics on semi-structured data. The platform includes APIs for
>>>> building
>>>> apps and a phenomenal toolset for data science. Developers can use
>>>> our toolset for easy data analysis & visualization. Get a free account!
>>>> http://www2.precog.com/precogplatform/slashdotnewsletter
>>>> _______________________________________________
>>>> Dbpedia-gsoc mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>>>
>>>>
>>>
>>>
>>> --
>>> Kontokostas Dimitris
>>>
>>
>>
>
>
> --
>
> Pablo N. Mendes
> http://pablomendes.com
>
>
> ------------------------------------------------------------------------------
> Precog is a next-generation analytics platform capable of advanced
> analytics on semi-structured data. The platform includes APIs for building
> apps and a phenomenal toolset for data science. Developers can use
> our toolset for easy data analysis & visualization. Get a free account!
> http://www2.precog.com/precogplatform/slashdotnewsletter
> _______________________________________________
> Dbpedia-gsoc mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>
>

------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter

_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Re: [Dbpedia-gsoc] Woring on Idea "Generalize input formats and add support for Google mention corpus"

Reply via email to