Hi all, On Tue, Apr 16, 2013 at 9:54 AM, Pablo N. Mendes <[email protected]> wrote: > I seem to have missed that the annotations already come with TOKEN > annotations.
I'm afraid these TOKEN annotations are not usable for our context models, because they are "The byte offset of the 10 least frequent words on the page, to act as a signature to ensure that the underlying text hasn’t changed -- think of this as a version, or fingerprint, of the page." [1] The blog post goes on to say that there are "Software tools (on the UMass site [2]) to: download the web pages; extract the mentions, [...]; select the text around the mentions as local context; and compute evaluation metrics over predicted entities." [1] But [2] says that "We are currently writing code to download the webpages listed in the above dataset, to find the relevant links from these webpages, and to extract the context around the links. The resulting dataset will also be released when ready, and will be linked here." Only a bash command that downloads all required web pages is given at this point in time. Maybe it is a good idea to write our own extractors for this? Cheers, Max [1] http://googleresearch.blogspot.nl/2013/03/learning-from-big-data-40-million.html [2] http://www.iesl.cs.umass.edu/data/wiki-links ------------------------------------------------------------------------------ Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis & visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter _______________________________________________ Dbpedia-gsoc mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
