[
https://issues.apache.org/jira/browse/STANBOL-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antonio David Pérez Morales resolved STANBOL-1141.
--------------------------------------------------
Resolution: Fixed
WikiLinks Parser and TDB Generator using the WikiLinks extended dataset [1]
serialized with Apache Thrift[2]
The aim of this tool is to create a Jena TDB Database[3] with the information
of the dataset in order to be used in other tasks like Entity Tagging, Entity
Disambiguation, Entity Linking etc.
Moreover this tool provides a service to query and retrieve such information.
To download the code and see an example on how to use the provided service, go
to https://github.com/adperezmorales/gsoc-wikilinks/tree/master/gsoc-wikilinks
[1]: http://www.iesl.cs.umass.edu/data/wiki-links
[2]: http://blueprints.tinkerpop.com
[3]: http://jena.apache.org/documentation/tdb/
> Wikilinks Parser and TDB Generator
> ----------------------------------
>
> Key: STANBOL-1141
> URL: https://issues.apache.org/jira/browse/STANBOL-1141
> Project: Stanbol
> Issue Type: Sub-task
> Components: Enhancer, Entityhub
> Reporter: Antonio David Pérez Morales
> Labels: freebase, jenatdb, wikilinks
> Original Estimate: 336h
> Remaining Estimate: 336h
>
> Cross-document coreference resolution is the task of grouping the entity
> mentions in a collection of documents into sets that each represent a
> distinct entity. It is central to knowledge base construction and also useful
> for joint inference with other NLP components.
> Wikilinks is one of the result of this task.
> Wikilinks dataset comprising of 40 million mentions over 3 million entities.
> The method is based on finding hyperlinks to Wikipedia from a web crawl and
> using anchor text as mentions. In addition to providing large-scale labeled
> data without human effort, we are able to include many styles of text beyond
> newswire and many entity types beyond people.
> UMass has created expanded versions of the dataset containing the following
> extra features:
> * Complete webpage content (with cleaned DOM structure)
> * Extracted context for the mentions
> * Alignment to Freebase entities
> The expanded dataset can be downloaded from
> http://iesl.cs.umass.edu/downloads/wiki-link/context-only/
> A tool is needed in order to parser this information and store it in any type
> of storage like Jena TDB.
> Wikilinks provides information of documents with mentions to Freebase
> entities and this information can be used both to desambiguate and to merge
> with the Freebase information in order to have a large set of valuable data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira