Re: [Dbpedia-gsoc] Woring on Idea "Generalize input formats and add support for Google mention corpus"

Max Jakob Tue, 16 Apr 2013 02:11:33 -0700

Hi all,

On Tue, Apr 16, 2013 at 9:54 AM, Pablo N. Mendes <[email protected]> wrote:
> I seem to have missed that the annotations already come with TOKEN
> annotations.


I'm afraid these TOKEN annotations are not usable for our context
models, because they are "The byte offset of the 10 least frequent
words on the page, to act as a signature to ensure that the underlying
text hasn’t changed -- think of this as a version, or fingerprint, of
the page." [1]

The blog post goes on to say that there are "Software tools (on the
UMass site [2]) to: download the web pages; extract the mentions,
[...]; select the text around the mentions as local context; and
compute evaluation metrics over predicted entities." [1]

But [2] says that "We are currently writing code to download the
webpages listed in the above dataset, to find the relevant links from
these webpages, and to extract the context around the links. The
resulting dataset will also be released when ready, and will be linked
here."
Only a bash command that downloads all required web pages is given at
this point in time.

Maybe it is a good idea to write our own extractors for this?

Cheers,
Max


[1] 
http://googleresearch.blogspot.nl/2013/03/learning-from-big-data-40-million.html
[2] http://www.iesl.cs.umass.edu/data/wiki-links

------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Re: [Dbpedia-gsoc] Woring on Idea "Generalize input formats and add support for Google mention corpus"

Reply via email to