Hi all According to the schedule of the project, last friday was the first milestone of the project 'Complete the integration of Freebase as EntityHub ReferencedSite in Stanbol'. The steps to achieve this task are the following:
- Download the freebase indexing tool (based on the Apache Stanbol Indexing Tool) from https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/freebase/ - Generate the jar with maven and obtain the org.apache.stanbol.entityhub.indexing.freebase-*.jar in the target directory. - Download the freebase dump from http://download.freebaseapps.com - Rename the freebase dump from *.gz to *.ttl.gz (necessary for the indexing tool, to treat the dump as N-Turtle) - Initialize the configuration generating the directory structure using the command: java -jar org.apache.stanbol.entityhub.indexing.freebase-*.jar init - Generate the scoring file using the fbrankings.sh script and put it in 'indexing/resources' directory - Apply the fixit tool (http://people.apache.org/~andy/Freebase20121223/) using the command: gunzip -c ${FB_DUMP} | fixit | gzip > ${FB_DUMP_fixed} - Move the fixed data dump (*.ttl.gz) to 'indexing/resources/rdfdata' - Configure the mappings and some other information contained in 'indexing/conf' directory - Run the Freebase indexing tool: java -jar -Xmx32g org.apache.stanbol.entityhub.indexing.freebase-*.jar index (to avoid 'Bad IRI...' warnings in log, append 'grep -v "Bad IRI" to the previous command) - The indexing tool generates two files in 'indexing/dist' directory: * freebase.solrindex.zip must be copied to stanbol/datafiles * org.apache.stanbol.data.site.freebase-*.jar must be copied to stanbol/fileinstall (If Stanbol stable launcher is being used, add 'commons.solr.extras.kuromoji' and 'commons.solr.extras.smartcn' to stanbol/fileinstall directory) The indexing tool takes too much time in a standard computer, so in order to execute this process, you'll need either a computer with SSD or a computer with 200GB of RAM in order to deal with the whole Freebase data dump in memory. For the next milestone (midterm evaluation) the following tasks need to be done: 1. Convert wiki-links data dump to RDF * Wiki-links contains a lot of disambiguation information which it is wanted to incorporate to the Entityhub Freebase site. * The wiki-link data dump will be converted to RDF to be easier to process by the new Stanbol Freebase indexing tool (point 2) * The wiki-link expanded dataset [1] will be used because it contains information like extracted context for the mentions, alignment to Freebase entities, etc. 2. Develop a new stanbol indexer to join Freebase and wiki-links information 3. Generate a graph with the links in Freebase * To support Graph-based disambiguation algorithms in Stanbol, a graph will be generated using Blueprints Neo4j and every node in the graph will be associated to entries in the EntityHub to later be used to position directly in a node on the graph. Comments are more than welcome Regards [1] http://www.iesl.cs.umass.edu/data/wiki-links -- ------------------------------ This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. Zaizi Ltd is registered in England and Wales with the registration number 6440931. The Registered Office is Brook House, 229 Shepherds Bush Road, London W6 7AN.