HI Dimitris,Pablo,
I have a few questions about the task:
1.We only need to crawl the original URL,not the
mention wikipedia URL,considering crawling large number of html pages of
wikipedia is not friendly to wikipedia and wikipedia pages are provided by
mediawiki dumps.Am I correct?
2.A part of the URLs is pointed to an pdf,doc or other file formats rather
than html.Should I omit them or use some kind of tools to read them?
Thanks for your guidance.
Zhiwei
South China University of Technology
2013/4/16 Dimitris Kontokostas <[email protected]>
> Great!
>
> The "Google Corpus" idea description page [1] is updated, you can also
> look at the related thread for more information.
> For warm up tasks on DBpedia Spotlight you can take a look here [2] first.
> Then Pablo, Max or Jo can give you a more specific task.
>
> Best,
> Dimitris
>
> [1] http://wiki.dbpedia.org/gsoc2013/ideas/GoogleCorpus
> [2]
> https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Warm-up-tasks
>
>
> On Sat, Apr 13, 2013 at 9:03 PM, 蔡志威 <[email protected]> wrote:
>
>> In the last 4 days,I've finish the following work:
>> 1.Read most wiki pages in github of depedia extraction framework and
>> depedia-spotlight and also added some noted and correct some basic mistakes.
>> 2.Downloaded the code and tested some data in my computer.
>> 3.Set up my dev enviroment with intelliJ IDEA.
>> 4.Learnt maven and scala,so I could get a basic idea of the constructure
>> of the whole project.
>> 5.I found I might prefer the idea "Generalize input formats and add
>> support for Google mention corpus" so I try to get familliar with wikipedia
>> dumps format and google memtion corpus.
>>
>> I would be grateful if you could give me some suggestion for the
>> following days.Codes and some materials to read,some issues to solve or
>> other things that can help me get a deeper understanding of this idea.
>>
>> Thanks for your time,
>> Cai Zhiwei
>>
>>
>> ------------------------------------------------------------------------------
>> Precog is a next-generation analytics platform capable of advanced
>> analytics on semi-structured data. The platform includes APIs for building
>> apps and a phenomenal toolset for data science. Developers can use
>> our toolset for easy data analysis & visualization. Get a free account!
>> http://www2.precog.com/precogplatform/slashdotnewsletter
>> _______________________________________________
>> Dbpedia-gsoc mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>
>>
>
>
> --
> Kontokostas Dimitris
>
------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc