Hi Antonio, all

As the mentor of Antonio I will try to give some additional
information about the progress and results of this GSoC project.

On Mon, Jul 29, 2013 at 10:15 AM, Antonio Perez <ape...@zaizi.com> wrote:
> Hi all
>
> Since this week is the midterm evaluation of the GSoC projects, I want to
> tell you the status of this project.
>
> I began my project trying to index Freebase data using the Freebase indexer
> in Stanbol but this process was too expensive to be done in a normal
> computer (with about 8 GB RAM and non SSD hard disk).
>
> I was able to create a Referenced Site with Freebase data using Rupert's
> index (generated using an SSD hard disk)
>

The very same issue also affected dileepaj (the other GSoC project) a
lot. The reason for that is that importing RDF Triples to an triple
store is very expensive IO wise. So for every RDF triple the following
steps need to be performed:

* existence check for every imported RDF resource (subject, predicate, object)
* existence check for the RDF triple already
* creation of non-existing nodes and triple incl. update of the index.

Jena TDB uses memory-mapped files for holding those information. That
means it scales well as long as those lookup tables to fit into memory
(the files can be mapped to RAM). If you exceed those limit a lot of
disc IO is generated. In case of an SSD you can still expect about
1-5k Triples/sec to be imported, but in case of a normal disc import
speeds are no longer within an expected range for importing large
datasets.

As Antonio had not access to a server with enough resources (even the
biggest Server you can rent on Amazon is not sufficient) we decided to
NOT continue this part within the GSoC project. However as this is a
problem that also affects other users we decided that we still would
like to have a workaround for this limitation.

> Currently, Rafa Haro is working on the Jena TDB part of the indexer in
> order to speed up the process of index Freebase data.

Because of that Rafa volunteered to create an IndexingSource for the
Entityhub Indexing Tool that directly "streams" triples without adding
them beforehand to a triple store. This indexing source will have some
limitations:

* Triples need to be sorted by SPO - but this is true for all Turtle
and N3 formatted RDF dumps
* This indexing source can not support LDPath what limits to some
extend the processing that can be done to entities

However this implementation will have very low RAM requirements. Disc
IO will also be limited to those required by Solr to store the indexed
entities.

Excluding this Task from the GSoC project and assigning it to Rafa
allowed us to continue without much delay. A big thx to Rafa
(currently on vacation) for taking over this!

>
> The next task was to parse the Wikilinks extended dataset [1] and store it
> in a Jena TDB database, in order to take advantage of the contained
> information to be used in some tasks, like disambiguation.
> Moreover a service has been created (along with the parser tool) in order
> to query the data and retrieve information about Wikilink items. The code
> and more information about this library can be found at [2]
>

We where also discussion to implement a more generic TrainingSet
service based on this work. This service would allow to

* publish manually annotated content
* consume annotations for the training of NLP components or to build
knowledge bases for disambiguation algorithms.

This service would allow to manage existing annotation sets (such as
Wikilinks) but also to publish user made annotations (e.g. by
confirming/rejecting suggestions provided by Stanbol).

This service would also replace the TrainingSet service currently part
of the TopicEngine.

The implementation of such a Service will not be part of this GSoC
project as it is off-topic to the original goals of as stated by the
proposal. Rafa and myself will work with some help of Antonio on an
initial design when we are back from vacation (2nd half of August). As
soon as the Jira Issue is created I will link it with this thread.

> Ideally, when the new Freebase indexer is finished and tested, I would like
> to integrate Freebase data and Wikilinks data in the same referenced site,
> because the Wikilinks extended dataset contains references to Freebase
> entities, so it's relatively easy to link both informations. But for now,
> we can use the Wikilinks information to perform other tasks.

Here the Idea is to build a Dataset for Freebase that can be used with
the existing Solr MLT based enhancement engine (the result of last
years GSoC project). However for me this does not has a high priority
as I would like Antonio keep a focus on the Graph based
Disambiguation.

>
> In order to finish the work for the midterm, I have develop a tool to
> import Freebase data using the BaseKBLime data dump [3] into a graph
> database (Neo4j right now using the Tinkerpop Blueprints interfaces [4]).
> Moreover, a simple algortihm to "weight" the graph is done during the
> import process.
> The code and more information about this tool can be obtained in [5].
>
> With this information, I have got a Knowledge Base which can be used to
> develop new graph-based disambiguation algorithms.
>

FYI: BaseKBLime is a cleaned up version of the Freebase RDF dump.
Tinkerpop Blueprints do define a generic Java API for Property Graphs.
Property Graphs do allow key/value properties for both Nodes and Edges
of an Graph. In this case we do use those properties to store
information about the origin and the weights of relations between
concepts.

Note that while Blueprints is ASL compatible (BSD), the currently used
database - Neo4j is not. However there are a lot of other data base
options that are compatible with the Apache License.

@Antonio: can you please extend the documentation of [5] with an real
example for the "Algorithm used to create the relations". Especially
for "mediated relation"s. The current examples with aaa.bbb.ccc are
really hard to follow. A real example of a Freebase topic would make
the idea easier to understand.

> So far it is the work done for the midterm
>
> The expected work for the second part is to develop a disambiguation
> algorithm using the generated graph. To do this, I am taking a look two
> papers ([6] and [7]) to take some ideas to develop a new algorithm.
>

As mentioned the current plan for the 2nd part of the GSoC project is
to focus on graph algorithm based disambiguation(s) and implementation
of a such based on the Tinkerpop API.

>
> This is all folks, so please feel free to comment. Comments are more than
> welcome.

Thx Antonio for all the work you have put into your GSoC project. Your
topic is very hard and you had to overcome or workaround a lot of
problems until now. I am sure that this work can be a major
contribution to the Stanbol Community. To ensure this to happen we
need to inform the community more frequently. Especially about topics
like

* Streaming RDF indexing Source
* Link to the Wikilinks RDF file as created by [2]
* TrainingSet service
* The Graph as imported by [5] (similar as the Entityhub Indexing Tool
creates an Index that can be loaded to Stanbol)
* the disambiguation engine(s) to be implemented based on the graph
generated by [5]

Looking forward to a exciting 2nd part of GSoC 2013
best
Rupert

>
> Best regards
>
> --
>
> ------------------------------
> This message should be regarded as confidential. If you have received this
> email in error please notify the sender and destroy it immediately.
> Statements of intent shall only become binding when confirmed in hard copy
> by an authorised signatory.
>
> Zaizi Ltd is registered in England and Wales with the registration number
> 6440931. The Registered Office is Brook House, 229 Shepherds Bush Road,
> London W6 7AN.



-- 
| Rupert Westenthaler             rupert.westentha...@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Reply via email to