[GSoC] First Milestone: Freebase Disambiguation in Stanbol

Antonio Perez Mon, 01 Jul 2013 02:37:34 -0700

Hi all

According to the schedule of the project, last friday was the first
milestone of the project 'Complete the integration of Freebase as EntityHub
ReferencedSite in Stanbol'.
The steps to achieve this task are the following:


- Download the freebase indexing tool (based on the Apache Stanbol Indexing
Tool) from
https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/freebase/
- Generate the jar with maven and obtain the
org.apache.stanbol.entityhub.indexing.freebase-*.jar in the target
directory.
- Download the freebase dump from http://download.freebaseapps.com
- Rename the freebase dump from *.gz to *.ttl.gz (necessary for the
indexing tool, to treat the dump as N-Turtle)
- Initialize the configuration generating the directory structure using the
command:

java -jar org.apache.stanbol.entityhub.indexing.freebase-*.jar init

- Generate the scoring file using the fbrankings.sh script and put it in
'indexing/resources' directory
- Apply the fixit tool (http://people.apache.org/~andy/Freebase20121223/)
using the command:

 gunzip -c ${FB_DUMP} | fixit | gzip > ${FB_DUMP_fixed}

- Move the fixed data dump (*.ttl.gz) to 'indexing/resources/rdfdata'

- Configure the mappings and some other information contained in
'indexing/conf' directory

- Run the Freebase indexing tool:

java -jar -Xmx32g org.apache.stanbol.entityhub.indexing.freebase-*.jar index

(to avoid 'Bad IRI...' warnings in log, append 'grep -v "Bad IRI" to
the previous command)

- The indexing tool generates two files in 'indexing/dist' directory:

  * freebase.solrindex.zip must be copied to stanbol/datafiles

  * org.apache.stanbol.data.site.freebase-*.jar must be copied to
stanbol/fileinstall

(If Stanbol stable launcher is being used, add
'commons.solr.extras.kuromoji' and 'commons.solr.extras.smartcn' to
stanbol/fileinstall directory)


The indexing tool takes too much time in a standard computer, so in order
to execute this process, you'll need either a computer with SSD or
 a computer with 200GB of RAM in order to deal with the whole Freebase data
dump in memory.


For the next milestone (midterm evaluation) the following tasks need to be
done:
1.  Convert wiki-links data dump to RDF
    * Wiki-links contains a lot of disambiguation information which it is
wanted to incorporate to the Entityhub Freebase site.
    * The wiki-link data dump will be converted to RDF to be easier to
process by the new Stanbol Freebase indexing tool (point 2)
    * The wiki-link expanded dataset [1] will be used because it contains
information like extracted context for the mentions, alignment to Freebase
entities, etc.
2.  Develop a new stanbol indexer to join Freebase and wiki-links
information
3.  Generate a graph with the links in Freebase
    * To support Graph-based disambiguation algorithms in Stanbol, a graph
will be generated using Blueprints Neo4j and every node in the graph will
be associated to entries in the EntityHub to later be used to position
directly in a node on the graph.

Comments are more than welcome

Regards

[1] http://www.iesl.cs.umass.edu/data/wiki-links

-- 

------------------------------
This message should be regarded as confidential. If you have received this 
email in error please notify the sender and destroy it immediately. 
Statements of intent shall only become binding when confirmed in hard copy 
by an authorised signatory.

Zaizi Ltd is registered in England and Wales with the registration number 
6440931. The Registered Office is Brook House, 229 Shepherds Bush Road, 
London W6 7AN.

[GSoC] First Milestone: Freebase Disambiguation in Stanbol

Reply via email to