Hi all,

Today I committed the first version of the FST linking engine. This
engine implements entity linking functionality based on Lucene (Finite
State Transducer) technology. This allows this engine to perform the
label based Entity lookup fully in-memory. Only Entity specific
information (URI, labels, types and ranking) for tagged Entities need
to be loaded from disc (or retrieved from an in-memory cache).

This Engine does not directly use Lucenes FST API, but re-uses the
OpenSextant SolrTextTagger [1] module implemented by David Smiley (in
cc).

To give users some Idea on how efficient FST can be used to hold
information this are the statistics for FST models required for Entity
Linking against [Freebase](http://freebase.com):

* Number of Entities: ~40 million
* FST for English labels: < 200MByte
* FST for other major languages are all < 20MByte
* FST for all ~200 used language codes are about 500MByte

That means that multi lingual in-memory entityLinking against Freebase
can be done with 500MByte of RAM!

The engine is currently not included in the default build as one of
its dependencies (version 1.2 of the SolrTextTagger is not yet
released). So to test it you will need to go to
`enhancement-engines/lucenefstlinking` and follow the the steps
described in the README.md [2]

The README.md [2] also provides details on how to configure the Solr
Index used with the Engine and the Engine itself.

Performance characteristic changes (over the current EntityLinking engine):
-----

Most important: With the FST linking engine the matching of entity
labels with occurrences in the text is fully done in-memory. No disc
IO is needed for that part. The current EntityLinkingEngine does the
same by using Solr queries.

However the FST linking engine does get the int Lucene document IDs as
result of the linking process. Therefore it needs to load linking
relevant information for those IDs (URI, labels, types and rankings)
from the Solr Index. This does require disc IO. To reduce the impact
of this the FST linking engine includes an LRU cache over those
information. The EntityLinking engine gets those information "for
free" in the result lists of the Solr queries.

So to sum up: While the EntityLinking engine spends about 95% of its
time to execute the Solr queries the FST linking engine spends most
time in loading the Entity information from disc.

Initial Performance Tests:
----

I performed a Test on my MacBook Pro Core i7 2.6GHz, SSD with sending
5k dbpedia long abstracts with 10 concurrent threads with the Enhancer
Stress Test Tool [3] to chains that included Language detection,
OpenNLP Token, Sentence and POS tagging and

(A) FST linking engine configured for Freebase with a Document Cache
size of 1 million  vs.
(B) EntityLinking engine also configured for freebase.

with

(A) average of 70ms for FST linking (with 100% CPU)
(B) average of 390ms for EntityLinking

when doing the test with ProperNoun linking deactivated (basically
also linking Common Nouns to simulate longer texts) it gives the
following results:

(A) average of 267ms for FST linking (with 100% CPU)
(B) average of 1417ms for EntityLinking

In both cases the FST linking engine is about 5 times faster as the
currently used EntityLinking engine.

best
Rupert


[1] https://github.com/OpenSextant/SolrTextTagger/
[2] 
http://svn.apache.org/repos/asf/stanbol/trunk/enhancement-engines/lucenefstlinking/README.md
[3] http://stanbol.apache.org/docs/trunk/utils/enhancerstresstest

-- 
| Rupert Westenthaler             rupert.westentha...@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Reply via email to