Antonio David Pérez Morales created STANBOL-1141:
----------------------------------------------------
Summary: Wikilinks Parser and TDB Generator
Key: STANBOL-1141
URL: https://issues.apache.org/jira/browse/STANBOL-1141
Project: Stanbol
Issue Type: Sub-task
Reporter: Antonio David Pérez Morales
Cross-document coreference resolution is the task of grouping the entity
mentions in a collection of documents into sets that each represent a distinct
entity. It is central to knowledge base construction and also useful for joint
inference with other NLP components.
Wikilinks is one of the result of this task.
Wikilinks dataset comprising of 40 million mentions over 3 million entities.
The method is based on finding hyperlinks to Wikipedia from a web crawl and
using anchor text as mentions. In addition to providing large-scale labeled
data without human effort, we are able to include many styles of text beyond
newswire and many entity types beyond people.
UMass has created expanded versions of the dataset containing the following
extra features:
* Complete webpage content (with cleaned DOM structure)
* Extracted context for the mentions
* Alignment to Freebase entities
The expanded dataset can be downloaded from
http://iesl.cs.umass.edu/downloads/wiki-link/context-only/
A tool is needed in order to parser this information and store it in any type
of storage like Jena TDB.
Wikilinks provides information of documents with mentions to Freebase entities
and this information can be used both to desambiguate and to merge with the
Freebase information in order to have a large set of valuable data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira