Ah, good that you spotted this! Well, this might take a while to crawl :)
Maybe we could also extract the relevant pages from the common crawl corpus
if crawling ourselves takes too long.
Best,
Jo
On Tue, Apr 16, 2013 at 11:16 AM, Pablo N. Mendes <[email protected]>wrote:
>
> Oh, right! Yes! Thanks, Max. I thought I was going crazy for suggesting
> using Nutch, but now I remember why. Now I think I am going crazy for going
> back on my suggestion of using Nutch. :) Damn, I'm doing too much stuff at
> the same time! Great to have awesome people around that pick up the ball
> (and score a goal) when I drop it! :)
>
> Cheers,
> Pablo
>
>
> On Tue, Apr 16, 2013 at 11:10 AM, Max Jakob <[email protected]> wrote:
>
>> Hi all,
>>
>> On Tue, Apr 16, 2013 at 9:54 AM, Pablo N. Mendes <[email protected]>
>> wrote:
>> > I seem to have missed that the annotations already come with TOKEN
>> > annotations.
>>
>> I'm afraid these TOKEN annotations are not usable for our context
>> models, because they are "The byte offset of the 10 least frequent
>> words on the page, to act as a signature to ensure that the underlying
>> text hasn’t changed -- think of this as a version, or fingerprint, of
>> the page." [1]
>>
>> The blog post goes on to say that there are "Software tools (on the
>> UMass site [2]) to: download the web pages; extract the mentions,
>> [...]; select the text around the mentions as local context; and
>> compute evaluation metrics over predicted entities." [1]
>>
>> But [2] says that "We are currently writing code to download the
>> webpages listed in the above dataset, to find the relevant links from
>> these webpages, and to extract the context around the links. The
>> resulting dataset will also be released when ready, and will be linked
>> here."
>> Only a bash command that downloads all required web pages is given at
>> this point in time.
>>
>> Maybe it is a good idea to write our own extractors for this?
>>
>> Cheers,
>> Max
>>
>>
>> [1]
>> http://googleresearch.blogspot.nl/2013/03/learning-from-big-data-40-million.html
>> [2] http://www.iesl.cs.umass.edu/data/wiki-links
>>
>
>
>
> --
>
> Pablo N. Mendes
> http://pablomendes.com
>
------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc