Author: rwesten
Date: Tue Sep 3 05:55:17 2013
New Revision: 1519565
URL: http://svn.apache.org/r1519565
Log:
STANBOL-1128: Fixed a NPO if no default FST configuration was present;
Corrected some errors in the README; changed the ordering of the config
properties;
Modified:
stanbol/trunk/enhancement-engines/lucenefstlinking/README.md
stanbol/trunk/enhancement-engines/lucenefstlinking/src/main/java/org/apache/stanbol/enhancer/engines/lucenefstlinking/FstLinkingEngineComponent.java
Modified: stanbol/trunk/enhancement-engines/lucenefstlinking/README.md
URL:
http://svn.apache.org/viewvc/stanbol/trunk/enhancement-engines/lucenefstlinking/README.md?rev=1519565&r1=1519564&r2=1519565&view=diff
==============================================================================
--- stanbol/trunk/enhancement-engines/lucenefstlinking/README.md (original)
+++ stanbol/trunk/enhancement-engines/lucenefstlinking/README.md Tue Sep 3
05:55:17 2013
@@ -40,7 +40,11 @@ Used Solr indexes need also confirm to t
The SolrTextTagger README provides an example for a Field Analyzer
configuration that does work. To make things easier this engine includes this
[XML file](fst_field_types.xml) that includes a schema.xml fragment with FST
tagging compatible configurations for most languages supported by Solr.
-### Field Name Encoding
+### Solr Index Layout Configuration
+
+This part of the configuration is used to specify the layout if the used Solr
index. It specifies how Entity information are stored in the Solr index.
+
+#### Field Name Encoding
The Field Name Encoding configuration
`enhancer.engines.linking.solrfst.fieldEncoding` specifies how Solr fields for
multiple languages are encoded. As an example a Vocabulary with labels in
multiple languages might use "en_label" for the English language labels and
"de_label" for the German language labels. In this case users should set this
property to `UnderscorePrefix` and simple use "label" when configuring the FST
field name.
@@ -60,7 +64,7 @@ This is the full list of supported Field
* AtSuffix: {field}-{lang} (e.g. "name@en")
* None: In this case no prefix/suffix rewriting of configured `field` and
`store` values is done. This means that the FST Configuration MUST define the
exact field names in the Solr index for every configured language.
-### FST Tagging Configuration
+#### FST Tagging Configuration
The FST Tagging Configuration `enhancer.engines.linking.solrfst.fstconfig`
defines several things:
@@ -95,7 +99,12 @@ This would set the index field to "fise:
*;field=fise:fstTagging;stored=rdfs:label;generate=true
-__Runtime FST generation Thread Pool__
+#### Additional Entity Information
+
+* __Entity Type Field__ _(enhancer.engines.linking.solrfst.typeField)_: This
field specifies the Solr field name holding entity type information of
Entities. In case 'SolrYard' is used as _Field Name Encoding_ one can use the
the QNAME of the property (typically 'rdf:type'). Otherwise the value must be
the exact field name holding the type information. Values are expected to be
URIs.
+* __Entity Ranking Field__ _(enhancer.engines.linking.solrfst.rankingField)_:
This is an __ADDITIONAL__ property used to configure the name of the Field
storing the floating point value of the ranking for the Entity. Entities with
higher ranking will get a slightly better `fise:confidence` value if labels of
several Entities do match the text.
+
+### Runtime FST generation Thread Pool
The `enhancer.engines.linking.solrfst.fstThreadPoolSize` parameter can be used
to configure the size of the thread pool used for the runtime generation of FST
models. The default size of the thread pool is `1`. Threads do use the lowest
possible priority to reduce the performance impact on enhancements as much as
possible.
@@ -103,6 +112,7 @@ When configuring the size of the thread
_NOTE_ that the `generate` parameter of the FST Tagging Configuration needs to
be set to `true` to enable runtime generation.
+
### Entity Cache Configuration
While FST tagging is fully done in-memory the FST linking engine needs to read
information of matching Entities from the Solr index. This requires disc IO and
is typically the part of the process that consumes the most time. The Entity
Cache tries to prevent such disc level IO by caching SolrDocuments containing
only fields required for the linking process (labels, types and (if available)
entity rankings). To further reduce memory requirements only labels in
languages requested by processed ContentItems are stored in the cache. The
Cache uses the LRU semantic and is based on the Solr cache implementation.
@@ -120,11 +130,9 @@ For now this engine uses the exact same
The Entity Linking Configuration of this Engine is very similar as the one for
the [EntityLinking
engine](http://stanbol.apache.org/docs/trunk/components/enhancer/engines/entitylinking#entity-linker-configuration).
The configuration does use the exact same keys, but it does not support all
properties and some do have a slightly different meaning. In the following only
the differences are described. For the all other things please refer to the
linked section of the documentation of the EntityLinking engine.
-
-* <s>__Label Field__ _(enhancer.engines.linking.labelField)_</s>: The label
field is __IGNORED__ as the field holding the labels is anyway provided by the
FST Tagging configuration. That means that the field defined by the _stored_
parameter is used. If the _stored_ parameter is not present it fallbacks to the
_field_ parameter.
-* __Type Field__ _(enhancer.engines.linking.typeField)_: This must be the name
of the Solr field holding the Entity type information. In case 'SolrYard' is
used as _Field Name Encoding_ one can use the the QNAME of the property
(typically 'rdf:type')
+* <s>__Label Field__ _(enhancer.engines.linking.labelField)_</s>: The label
field is __IGNORED__ as the field holding the labels is anyway provided by the
[FST Tagging Configuration]. That means that the field defined by the _stored_
parameter is used. If the _stored_ parameter is not present it fallbacks to the
_field_ parameter.
+* <s>__Type Field__ _(enhancer.engines.linking.typeField)_</s>: This
configuration gets __IGNORED__ in favor of the
`enhancer.engines.linking.solrfst.typeField`. See the [Additional Entity
Information] section for details.
* __Redirect Field__ _(enhancer.engines.linking.redirectField)_</s>: Note
implemented. __NOTE__ This might not be possible to efficiently implement. When
those redirects need already be considered when building the FST models.
-* __Entity Ranking Field__ _(enhancer.engines.linking.solrfst.rankingField)_:
This is an __ADDITIONAL__ property used to configure the name of the Field
storing the floating point value of the ranking for the Entity. Entities with
higher ranking will get a slightly better `fise:confidence` value if labels of
several Entities do match the text.
* <s>__Use EntityRankings (enhancer.engines.linking.useEntityRankings)_</s>:
This configuration gets __IGNORED__. EntityRanking based sorting is enabled as
soon as the _Entity Ranking Field_ is configured.
* <s>__Lemma based Matching__ _(enhancer.engines.linking.lemmaMatching)_</s>:
Not Yet implemented
* <s>__Min Match Score__ _(enhancer.engines.linking.minMatchScore)_</s>: Not
Yet Implemented. Currently all linked Entities are added regardless of their
score. However the way the Tagging is done makes it very unlikely to have
suggestions with `fise:confidence` values less as 0.5.
Modified:
stanbol/trunk/enhancement-engines/lucenefstlinking/src/main/java/org/apache/stanbol/enhancer/engines/lucenefstlinking/FstLinkingEngineComponent.java
URL:
http://svn.apache.org/viewvc/stanbol/trunk/enhancement-engines/lucenefstlinking/src/main/java/org/apache/stanbol/enhancer/engines/lucenefstlinking/FstLinkingEngineComponent.java?rev=1519565&r1=1519564&r2=1519565&view=diff
==============================================================================
---
stanbol/trunk/enhancement-engines/lucenefstlinking/src/main/java/org/apache/stanbol/enhancer/engines/lucenefstlinking/FstLinkingEngineComponent.java
(original)
+++
stanbol/trunk/enhancement-engines/lucenefstlinking/src/main/java/org/apache/stanbol/enhancer/engines/lucenefstlinking/FstLinkingEngineComponent.java
Tue Sep 3 05:55:17 2013
@@ -143,26 +143,24 @@ import com.google.common.util.concurrent
name="AtSuffix")
},value="SolrYard"),
@Property(name=FstLinkingEngineComponent.FST_CONFIG,
cardinality=Integer.MAX_VALUE),
+ @Property(name=FstLinkingEngineComponent.SOLR_TYPE_FIELD,
value="rdf:type"),
+ @Property(name=FstLinkingEngineComponent.SOLR_RANKING_FIELD,
value="entityhub:entityRank"),
+// @Property(name=REDIRECT_FIELD,value="rdfs:seeAlso"),
+// @Property(name=REDIRECT_MODE,options={
+// @PropertyOption(
+// value='%'+REDIRECT_MODE+".option.ignore",
+// name="IGNORE"),
+// @PropertyOption(
+// value='%'+REDIRECT_MODE+".option.addValues",
+// name="ADD_VALUES"),
+// @PropertyOption(
+// value='%'+REDIRECT_MODE+".option.follow",
+// name="FOLLOW")
+// },value="IGNORE"),
@Property(name=FstLinkingEngineComponent.FST_THREAD_POOL_SIZE,
intValue=FstLinkingEngineComponent.DEFAULT_FST_THREAD_POOL_SIZE),
@Property(name=FstLinkingEngineComponent.ENTITY_CACHE_SIZE,
intValue=FstLinkingEngineComponent.DEFAULT_ENTITY_CACHE_SIZE),
- @Property(name=FstLinkingEngineComponent.SOLR_TYPE_FIELD,
value="rdf:type"),
- @Property(name=FstLinkingEngineComponent.SOLR_RANKING_FIELD,
value="entityhub:entityRank"),
-// @Property(name=REDIRECT_FIELD,value="rdfs:seeAlso"),
-// @Property(name=REDIRECT_MODE,options={
-// @PropertyOption(
-// value='%'+REDIRECT_MODE+".option.ignore",
-// name="IGNORE"),
-// @PropertyOption(
-// value='%'+REDIRECT_MODE+".option.addValues",
-// name="ADD_VALUES"),
-// @PropertyOption(
-// value='%'+REDIRECT_MODE+".option.follow",
-// name="FOLLOW")
-// },value="IGNORE"),
- @Property(name=TYPE_FIELD,value="rdf:type"),
- @Property(name=ENTITY_TYPES,cardinality=Integer.MAX_VALUE),
@Property(name=SUGGESTIONS, intValue=DEFAULT_SUGGESTIONS),
@Property(name=CASE_SENSITIVE,boolValue=DEFAULT_CASE_SENSITIVE_MATCHING_STATE),
@Property(name=PROCESS_ONLY_PROPER_NOUNS_STATE,
boolValue=DEFAULT_PROCESS_ONLY_PROPER_NOUNS_STATE),
@@ -172,6 +170,7 @@ import com.google.common.util.concurrent
"es;lc=Noun", //the OpenNLP POS tagger for Spanish does not
support ProperNouns
"nl;lc=Noun"}), //same for Dutch
@Property(name=DEFAULT_MATCHING_LANGUAGE,value=""),
+ @Property(name=ENTITY_TYPES,cardinality=Integer.MAX_VALUE),
@Property(name=TYPE_MAPPINGS,cardinality=Integer.MAX_VALUE, value={
"dbp-ont:Organisation; dbp-ont:Newspaper; schema:Organization >
dbp-ont:Organisation",
"dbp-ont:Person; foaf:Person; schema:Person > dbp-ont:Person",
@@ -709,8 +708,14 @@ public class FstLinkingEngineComponent {
log.info(" - default config");
Map<String,String> defaultParams = fstConfig.getDefaultParameters();
String fstName = defaultParams.get(PARAM_FST);
- final String indexField = defaultParams.get(PARAM_FIELD);
- final String storeField = defaultParams.get(PARAM_STORE_FIELD);
+ String indexField = defaultParams.get(PARAM_FIELD);
+ if(indexField == null){ //apply the defaults if null
+ indexField = DEFAULT_FIELD;
+ }
+ String storeField = defaultParams.get(PARAM_STORE_FIELD);
+ if(storeField == null){ //apply the defaults if null
+ storeField = indexField;
+ }
if(fstName == null){ //use default
fstName = getDefaultFstFileName(indexField);
}