svn commit: r1528830 - /stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/

rwesten Thu, 03 Oct 2013 05:43:14 -0700

Author: rwesten
Date: Thu Oct  3 12:41:43 2013
New Revision: 1528830

URL: http://svn.apache.org/r1528830
Log:
STANBOL-1128: Documentation for the FST linking engine


Added:
    
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/fstengine-config-addfields.png
   (with props)
    
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/fstengine-config-fstconfig.png
   (with props)
    
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/fstengine-config-fstfolder.png
   (with props)
    
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/fstengine-config-indexlayout.png
   (with props)
    
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/fstengine-config-solrcore.png
   (with props)
    
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/fstengine-config.png
   (with props)
    
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/lucenefstlinking.mdtext
Modified:
    
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/list.mdtext

Added: 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/fstengine-config-addfields.png
URL: 
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/fstengine-config-addfields.png?rev=1528830&view=auto
==============================================================================
Binary file - no diff available.

Propchange: 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/fstengine-config-addfields.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/fstengine-config-fstconfig.png
URL: 
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/fstengine-config-fstconfig.png?rev=1528830&view=auto
==============================================================================
Binary file - no diff available.

Propchange: 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/fstengine-config-fstconfig.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/fstengine-config-fstfolder.png
URL: 
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/fstengine-config-fstfolder.png?rev=1528830&view=auto
==============================================================================
Binary file - no diff available.

Propchange: 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/fstengine-config-fstfolder.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/fstengine-config-indexlayout.png
URL: 
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/fstengine-config-indexlayout.png?rev=1528830&view=auto
==============================================================================
Binary file - no diff available.

Propchange: 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/fstengine-config-indexlayout.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/fstengine-config-solrcore.png
URL: 
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/fstengine-config-solrcore.png?rev=1528830&view=auto
==============================================================================
Binary file - no diff available.

Propchange: 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/fstengine-config-solrcore.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/fstengine-config.png
URL: 
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/fstengine-config.png?rev=1528830&view=auto
==============================================================================
Binary file - no diff available.

Propchange: 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/fstengine-config.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Modified: 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/list.mdtext
URL: 
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/list.mdtext?rev=1528830&r1=1528829&r2=1528830&view=diff
==============================================================================
--- 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/list.mdtext 
(original)
+++ 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/list.mdtext 
Thu Oct  3 12:41:43 2013
@@ -132,6 +132,13 @@ This category covers enhancement engines
        * Links Entities managed by the Entityhub, ReferencedSites or 
ManagedSites
        * Supports any language however quality/performance depends on NLP 
processing support
 
+* __[FST Linking Engine](lucenefstlinking):__
+       * Entity Linking Engine based on Lucene FST (Finit State Transducer) 
technology
+       * Links Entities indexed in a Solr index (e.g. an Entityhub Site backed 
by a SolrYard)
+       * Provides better linking performance as the [Entityhub Linking 
Engine](entityhublinking)
+       * Requires a lot of CPU after changes of the vocabulary to re-create 
the FST models.
+
+
 * __DBpedia Spotlight Annotation Engine:__ Integration of the DBpedia 
Spotlight with the Stanbol Enhancer (see 
[STANBOL-706](https://issues.apache.org/jira/browse/STANBOL-706))
        * includes NLP, Entity Linking and Disambiguation of Entities using 
[DBpedia](http://dbpedia.org) as knowledge base
        * accesses a remote service

Added: 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/lucenefstlinking.mdtext
URL: 
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/lucenefstlinking.mdtext?rev=1528830&view=auto
==============================================================================
--- 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/lucenefstlinking.mdtext
 (added)
+++ 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/lucenefstlinking.mdtext
 Thu Oct  3 12:41:43 2013
@@ -0,0 +1,164 @@
+Title: The FST Linking Engine: Linking NLP processed Text with Vocabularies 
indexed in a Solr index
+
+The __Lucene FST Linking Engine__ is an Entity Linking Engine based on the 
[Lucene](http://lucene.apache.org) FST (Finite State Transducer) technology. 
FST provides a very efficient way to hold Entity labels in-memory. This avoids 
the need of disc IO for such as required by the other entity linking engines.
+
+This engine is build on top of the OpenSextant 
[Solr-Text-Tagger](https://github.com/OpenSextant/SolrTextTagger/) that 
implements the building of the FST models as well as the tagging of the 
processed text.
+
+
+## Configuration
+
+The configuration of the FST linking engine consists of several parts 
explained in detail by the following sub-sections.
+Configurations can be created by using the [Configuration 
Dialog](fstengine-config.png) provided by the Apache Felix Webconsole (search 
for "FST Linking" in the configuration tab). However NOTE that his dialog dos 
not include all supported configuration options. Options not included in the 
dialog can be configured by directly using OSGi configuration (*.config) files.
+
+### Engine Name and Service Ranking
+
+As all Stanbol Enhancement Engines this engine support the following two 
properties
+
+* __Name__ _(stanbol.enhancer.engine.name)_: The name of the Enhancement 
Engine. This name is used to refer an [EnhancementEngine](index.html) in 
[EnhancementChain](../chains)s
+* __ServiceRankging__ _(service.ranking)_: In case multiple enhancement 
engines do use the same name, than only the one with the higher ranking will 
get uses.
+
+### Configuration of the Solr Index
+
+![SolrCore configuration](fstengine-config-solrcore.png "The configuration 
option used to configure the SolrCore")
+
+The Solr index is configured by using the 
`enhancer.engines.linking.lucenefst.solrcore` configuration property of the 
Engine. This property needs to point to a Solr index that runs embedded in the 
same JVM as Apache Stanbol. The Stanbol Commons Solr modules provide two 
Components that allow to configure embedded Solr Indexes:
+
+1. 
__[ReferencedSolrServer](/docs/trunk/utils/commons-solr#referencedsolrserver)__:
 This components allows uses to configure a directory containing a SolrServer 
configuration (the directory with the solr.xml file). All Solr indexes defined 
by the Solr.xml will be initialized and published as OSGI services to Apache 
Stanbol. Such indexes can be configured to the engine by using 
{server-name}:{index-name}. {server-name} is the name of the 
ReferencedSolrServer as provided in the configuration. {index-name} is the name 
of the Solr index as defined in the solr.xml.
+1. __[ManagedSolrServer](/docs/trunk/utils/commons-solr#managedsolrserver)__: 
This component allows to have a Solr server that is fully managed by Apache 
Stanbol. Indexes can be installed by copying '{name-name}.solrindex.zip' files 
to the 'stanbol/datafiles'. Solr indexes initialized like that will be 
available under '{index-name}' and 'default:{index-name}'.
+
+Used Solr indexes need also confirm to the requirements of the 
[SolrTextTagger](https://github.com/OpenSextant/SolrTextTagger/) module. That 
means that fields used for FST linking MUST use field analyzers that produce 
consecutive positions (i.e. the position increment of each term must always be 
1). This means that typical field analyzers as sued for searches will not work.
+
+The SolrTextTagger README provides an example for a Field Analyzer 
configuration that does work. To make things easier this engine includes this 
[XML file](fst_field_types.xml) that includes a schema.xml fragment with FST 
tagging compatible configurations for most languages supported by Solr.
+
+
+### Solr Index Layout Configuration
+
+![Solr core index layout configuration](fstengine-config-indexlayout.png "The 
configuration option used to configure the Solr Index Layout")
+
+This part of the configuration is used to specify the layout if the used Solr 
index. It specifies how Entity information are stored in the Solr index.
+
+#### Field Name Encoding 
+
+The Field Name Encoding configuration 
`enhancer.engines.linking.lucenefst.fieldEncoding` specifies how Solr fields 
for multiple languages are encoded. As an example a Vocabulary with labels in 
multiple languages might use "en_label" for the English language labels and 
"de_label" for the German language labels. In this case users should set this 
property to `UnderscorePrefix` and simple use "label" when configuring the FST 
field name. 
+
+The Field Name Encodings work well with Solr dynamic field configurations that 
allow to map language specific FieldType specifications to prefixes and 
suffixes such as
+
+   <dynamicField name="en_*" type="text_en_fst" indexed="true" stored="true" 
multiValued="true" omitNorms="false"/>
+   <dynamicField name="de_*" type="text_en_fst" indexed="true" stored="true" 
multiValued="true" omitNorms="false"/>
+
+This is the full list of supported Field encodings:
+
+* SolrYard: This supports the encoding use by the Stanbol Entityhub SolrYard 
implementation to encode RDF data types and language literals. If you configure 
the FST Linking Engine for a Solr index build for the SolrYard you need to use 
this encoding
+* MinusPrefix: {lang}-{field} (e.g. "en-name")
+* UnderscorePrefix: {lang}_{field} (e.g. "en_name")
+* AtPrefix: {lang}@{field} (e.g. "en@name")
+* MinusSuffix: {field}-{lang} (e.g. "name-en")
+* UnderscoreSuffix: {field}-{lang} (e.g. "name_en")
+* AtSuffix: {field}-{lang} (e.g. "name@en")
+* None: In this case no prefix/suffix rewriting of configured `field` and 
`store` values is done. This means that the FST Configuration MUST define the 
exact field names in the Solr index for every configured language.
+
+#### FST Tagging Configuration
+
+![FST configuration](fstengine-config-fstconfig.png "The configuration used to 
configure the languages and fields FST models are build for")
+
+
+The FST Tagging Configuration `enhancer.engines.linking.lucenefst.fstconfig` 
defines several things:
+
+1. for what languages FST models should be build. This configuration is 
basically a list of language codes but also supports wildcards '*' and 
exclusions '!{en}'
+2. what fields in the Solr Index are used to build FST models. Two fields per 
language are required: a) an 'Indexed Field' (_field_ parameter) and b) a 
'Stored Field' (_stored_ parameter). Both the indexed and stored field might 
refer to the same field in the Solr index. In that case this field needs to use 
`indexed="true" stored="true"`.
+3. if FST models can be build by the Engine at runtime as well as the name of 
the serialized models.
+
+This configuration is line based (multi valued) and uses the following generic 
syntax:
+
+    {language};{param}={value};{param1}={value1};
+    !{language}
+
+`{language}` is either the name of the language (e.g. 'en'), '*' for all 
languages or '' (empty string) for defining default parameter values without 
including all languages. Lines that do start with '!' do explicitly exclude a 
language. Those lines do not allow parameters.
+
+The following parameters are supported by the Engine:
+
+* __field__: The indexed field in the configured Solr index. In multilingual 
scenarios this might be the 'base name' of the field that is extended by a 
prefix or suffix to get the actual field name in the Solr index (see also the 
field encoding configuration)
+* __stored__ (default: _field_ value) : The field in the Solr index with the 
stored label information. This parameter is optional. If not present `stored` 
is assumed to be equals to `field`.
+* __fst__ (default based on _field_ value): Optionally allows to manually 
specify the base file name of the FST models. Those files are assumed within 
the data directory of the configured Solr index under `fst/{fst}.{lang}.fst`. 
By default the configured `field` name is used (with non alpha-numeric chars 
replaced by '_').If runtime creation is enabled those files will be created if 
not present.
+* __generate__ (default: false): If enabled the Engine will generate missing 
FST models. If this is enabled the engine will also be able to update FST 
models after changes to the Solr Index. __NOTE__ that the creation of FST 
models is an expensive operation (both CPU and memory wise). The FST engine 
uses a pool of low priority threads to create FST models. The size of the pool 
can be configured by using the 
`enhancer.engines.linking.lucenefst.fstThreadPoolSize` parameter. Because of 
this the default is `false`.
+
+A more advanced Configuration might look like:
+
+    ;field=fise:fstTagging;stored=rdfs:label;generate=true
+    en
+    de
+    es
+    fr
+    it
+
+This would set the index field to "fise:fstTagging", the stored field to 
"rdfs:label" and allow runtime generation. It would also enable to process 
English, German, Spanish, French and Italian texts. A similar configuration 
that would build FST models for all languages would look as follows 
+
+    *;field=fise:fstTagging;stored=rdfs:label;generate=true
+
+#### Additional Entity Information
+
+![Additional Fields config](fstengine-config-addfields.png "Fields the types 
and rankings of entities are read from")
+
+In addition to the URI and the labels of Entities the EntityLinking process 
also uses entity type and ranking information.
+
+* __Entity Type Field__ _(enhancer.engines.linking.lucenefst.typeField)_: This 
field specifies the Solr field name holding entity type information of 
Entities. In case 'SolrYard' is used as _Field Name Encoding_ one can use the 
the QNAME of the property (typically 'rdf:type'). Otherwise the value must be 
the exact field name holding the type information. Values are expected to be 
URIs.
+* __Entity Ranking Field__ 
_(enhancer.engines.linking.lucenefst.rankingField)_: This is an __ADDITIONAL__ 
property used to configure the name of the Field storing the floating point 
value of the ranking for the Entity. Entities with higher ranking will get a 
slightly better `fise:confidence` value if labels of several Entities do match 
the text.
+
+NOTE that type and ranking information are optional.
+
+### Runtime FST generation Thread Pool
+
+The `enhancer.engines.linking.lucenefst.fstThreadPoolSize` parameter can be 
used to configure the size of the thread pool used for the runtime generation 
of FST models. The default size of the thread pool is `1`. Threads do use the 
lowest possible priority to reduce the performance impact on enhancements as 
much as possible.
+
+When configuring the size of the thread pool users need to be aware that the 
generation of FST models does need a lot more memory as the resulting model. So 
having to manny parallel threads might require to increase the memory settings 
of the JVM. On typical machines FST creation threads will consume 100% CPU. 
That means that the number of threads should be configured to the number of CPU 
cores that can be spared for FST generation.
+
+_NOTE_ that the `generate` parameter of the FST Tagging Configuration needs to 
be set to `true` to enable runtime generation.
+
+### FST storage location
+
+![FST folder](fstengine-config-fstfolder.png "Configuration of the storage 
location for FST modles")
+
+FST models are not only kept in memory but also serialized to disc. This 
avoids rebuilding the model after a restart of the Stanbol Server. By default 
the models are stored within the data folder of the SolrCore. However in some 
scenarios users might want to store FST models in a different location. This 
can be achieved by using the `enhancer.engines.linking.lucenefst.fstfolder` 
property.
+
+The configuration options does support property substitution with OSGI and 
System properties. In addition it supports the following additional properties 
(all relative to the configured SolrCore.
+
+* `solr-data-dir` : the data directory of the SolrCore
+* `solr-index-dir`: the index directory of the SolrCore
+* `solr-server-name`: the name of the 
[ReferencedSolrServer](/docs/trunk/utils/commons-solr#referencedsolrserver) or 
[ManagedSolrServer](/docs/trunk/utils/commons-solr#managedsolrserver) holding 
the SolrCore (see also [Configuration of the Solr Index]
+* `solr-core-name` : the name of the SolrCore
+
+The default value of this property is `${solr-data-dir}/fst`. To manage FST 
models within the Stanbol folder you can us e.g. 
`${sling.home}/fst/${solr-server-name}/solr-core-name`.
+
+
+### Entity Cache Configuration
+
+While FST tagging is fully done in-memory the FST linking engine needs to read 
information of matching Entities from the Solr index. This requires disc IO and 
is typically the part of the process that consumes the most time. The Entity 
Cache tries to prevent such disc level IO by caching SolrDocuments containing 
only fields required for the linking process (labels, types and (if available) 
entity rankings).  To further reduce memory requirements only labels in 
languages requested by processed ContentItems are stored in the cache. The 
Cache uses the LRU semantic and is based on the Solr cache implementation.
+
+The size of the cache can be configured by using the 
`enhancer.engines.linking.lucenefst.entityCacheSize` parameter. The default 
size is ~65k entities. Increasing the maximum size of the cache will improve 
performance. For small and medium sized vocabularies the cache can be 
configured 
+
+
+### Text Processing Configuration
+
+With the extension of the SolrTextTagger with a 
[TaggingAttribute](https://github.com/OpenSextant/SolrTextTagger/pull/7) the 
FST linking engine can support the exact same text processing functionality as 
the other Entity Linking Engine.
+
+For the configuration please see the [Text Processing 
configuration](entitylinking#text-processing-configuration) section of the 
Entity Linking Engine.
+
+### Entity Linking Configuration
+
+The Entity Linking Configuration of this Engine is very similar as the one for 
the [EntityLinking 
engine](http://stanbol.apache.org/docs/trunk/components/enhancer/engines/entitylinking#entity-linker-configuration).
 The configuration does use the exact same keys, but it does not support all 
properties and some do have a slightly different meaning. In the following only 
the differences are described. For the all other things please refer to the 
linked section of the documentation of the EntityLinking engine.
+
+* <s>__Label Field__ _(enhancer.engines.linking.labelField)_</s>: The label 
field is __IGNORED__ as the field holding the labels is anyway provided by the 
[FST Tagging Configuration]. That means that the field defined by the _stored_ 
parameter is used. If the _stored_ parameter is not present it fallbacks to the 
_field_ parameter.
+* <s>__Type Field__ _(enhancer.engines.linking.typeField)_</s>: This 
configuration gets __IGNORED__ in favor of the 
`enhancer.engines.linking.lucenefst.typeField`. See the [Additional Entity 
Information] section for details. 
+* __Redirect Field__ _(enhancer.engines.linking.redirectField)_</s>: Note 
implemented. __NOTE__ This might not be possible to efficiently implement. When 
those redirects need already be considered when building the FST models.
+* <s>__Use EntityRankings (enhancer.engines.linking.useEntityRankings)_</s>: 
This configuration gets __IGNORED__. EntityRanking based sorting is enabled as 
soon as the _Entity Ranking Field_ is configured.
+* <s>__Lemma based Matching__ _(enhancer.engines.linking.lemmaMatching)_</s>: 
Not Yet implemented
+* <s>__Min Match Score__ _(enhancer.engines.linking.minMatchScore)_</s>: Not 
Yet Implemented. Currently all linked Entities are added regardless of their 
score. However the way the Tagging is done makes it very unlikely to have 
suggestions with `fise:confidence` values less as 0.5.
+
+In addition the following properties are __IGNORED__ as they are not relevant 
for the FST Linking Engine:
+
+* <s>__Max Search Token Distance__ 
_(enhancer.engines.linking.maxSearchTokenDistance)_</s>
+* <s>__Max Search Tokens__ _(enhancer.engines.linking.maxSearchTokens)_</s>
+* <s>__Min Matched Tokens__ _(enhancer.engines.linking.minFoundTokens)_</s>
+* <s>__Min Text Score__ _(enhancer.engines.linking.minTextScore)_</s>
+
+

svn commit: r1528830 - /stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/

Reply via email to