Author: rwesten
Date: Tue Aug 27 07:08:54 2013
New Revision: 1517757
URL: http://svn.apache.org/r1517757
Log:
STANBOL-1128: The FST models are now updated after changes to the SolrIndex.
The long indexVersion as reported by the DirectoryReader of the SolrIndex is
now used as version indicator (for both FST models and the EntiyCache); Threads
of the ThreadPool are now created with the lowest priority; Improved loggings
and the README
Modified:
stanbol/trunk/enhancement-engines/lucenefstlinking/README.md
stanbol/trunk/enhancement-engines/lucenefstlinking/src/main/java/org/apache/stanbol/enhancer/engines/lucenefstlinking/CorpusInfo.java
stanbol/trunk/enhancement-engines/lucenefstlinking/src/main/java/org/apache/stanbol/enhancer/engines/lucenefstlinking/FstLinkingEngine.java
stanbol/trunk/enhancement-engines/lucenefstlinking/src/main/java/org/apache/stanbol/enhancer/engines/lucenefstlinking/FstLinkingEngineComponent.java
stanbol/trunk/enhancement-engines/lucenefstlinking/src/main/java/org/apache/stanbol/enhancer/engines/lucenefstlinking/TaggingSession.java
Modified: stanbol/trunk/enhancement-engines/lucenefstlinking/README.md
URL:
http://svn.apache.org/viewvc/stanbol/trunk/enhancement-engines/lucenefstlinking/README.md?rev=1517757&r1=1517756&r2=1517757&view=diff
==============================================================================
--- stanbol/trunk/enhancement-engines/lucenefstlinking/README.md (original)
+++ stanbol/trunk/enhancement-engines/lucenefstlinking/README.md Tue Aug 27
07:08:54 2013
@@ -78,9 +78,9 @@ This configuration is line based (multi
The following parameters are supported by the Engine:
* __field__: The indexed field in the configured Solr index. In multilingual
scenarios this might be the 'base name' of the field that is extended by a
prefix or suffix to get the actual field name in the Solr index (see also the
field encoding configuration)
-* __stored__: The field in the Solr index with the stored label information.
This parameter is optional. If not present `stored` is assumed to be equals to
`field`.
-* __fst__: Optionally allows to manually specify the base file name of the FST
models. Those files are assumed within the data directory of the configured
Solr index under `fst/{fst}.{lang}.fst`. By default the configured `field` name
is used (with non alpha-numeric chars replaced by '_').If runtime creation is
enabled those files will be created if not present.
-* __generate__: If enabled the Engine will generate missing FST models. NOTE
that the creation of FST models is an expensive operation. Because of this the
default is `false`.
+* __stored__ (default: _field_ value) : The field in the Solr index with the
stored label information. This parameter is optional. If not present `stored`
is assumed to be equals to `field`.
+* __fst__ (default based on _field_ value): Optionally allows to manually
specify the base file name of the FST models. Those files are assumed within
the data directory of the configured Solr index under `fst/{fst}.{lang}.fst`.
By default the configured `field` name is used (with non alpha-numeric chars
replaced by '_').If runtime creation is enabled those files will be created if
not present.
+* __generate__ (default: false): If enabled the Engine will generate missing
FST models. If this is enabled the engine will also be able to update FST
models after changes to the Solr Index. __NOTE__ that the creation of FST
models is an expensive operation (both CPU and memory wise). The FST engine
uses a pool of low priority threads to create FST models. The size of the pool
can be configured by using the
`enhancer.engines.linking.solrfst.fstThreadPoolSize` parameter. Because of this
the default is `false`.
A more advanced Configuration might look like:
@@ -95,6 +95,20 @@ This would set the index field to "fise:
*;field=fise:fstTagging;stored=rdfs:label;generate=true
+__Runtime FST generation Thread Pool__
+
+The `enhancer.engines.linking.solrfst.fstThreadPoolSize` parameter can be used
to configure the size of the thread pool used for the runtime generation of FST
models. The default size of the thread pool is `1`. Threads do use the lowest
possible priority to reduce the performance impact on enhancements as much as
possible.
+
+When configuring the size of the thread pool users need to be aware that the
generation of FST models does need a lot more memory as the resulting model. So
having to manny parallel threads might require to increase the memory settings
of the JVM. On typical machines FST creation threads will consume 100% CPU.
That means that the number of threads should be configured to the number of CPU
cores that can be spared for FST generation.
+
+_NOTE_ that the `generate` parameter of the FST Tagging Configuration needs to
be set to `true` to enable runtime generation.
+
+### Entity Cache Configuration
+
+While FST tagging is fully done in-memory the FST linking engine needs to read
information of matching Entities from the Solr index. This requires disc IO and
is typically the part of the process that consumes the most time. The Entity
Cache tries to prevent such disc level IO by caching SolrDocuments containing
only fields required for the linking process (labels, types and (if available)
entity rankings). To further reduce memory requirements only labels in
languages requested by processed ContentItems are stored in the cache. The
Cache uses the LRU semantic and is based on the Solr cache implementation.
+
+The size of the cache can be configured by using the
`enhancer.engines.linking.solrfst.entityCacheSize` parameter. The default size
is ~65k entities. Increasing the maximum size of the cache will improve
performance. For small and medium sized vocabularies the cache can be
configured in a way that all entities are cached in memory.
+
### Text Processing Configuration
During the development of this Engine the SolrTextTagger was extended by a
feature that allows to only lookup some tokens in the text (see this [Pull
Request](https://github.com/OpenSextant/SolrTextTagger/pull/7) for details).
This feature is used to integrate the [Stanbol NLP Processing
API](http://stanbol.apache.org/docs/trunk/components/enhancer/nlp/) with the
SolrTextTagger. Meaning that NLP processing results (such as POS tags, Chunks
and Named Entities) can be used to tell the SOlrTextTagger what tokens to
lookup in the Vocabulary.
@@ -122,6 +136,17 @@ In addition the following properties are
* <s>__Min Matched Tokens__ _(enhancer.engines.linking.minFoundTokens)_</s>
* <s>__Min Text Score__ _(enhancer.engines.linking.minTextScore)_</s>
+
+## Further Information
+
+### Runtime generation of FST models
+
+The `generate`
+
+### FST model updates
+
+The FST Model
+
## TODOs:
__Making existing Entityhub SolrYard indexes Compatible with FST linking:__
@@ -147,6 +172,7 @@ __Feature related__
__Other__
+* Not tested with enabled SecurityManager
* Implementation of an own Entity Dereferencing Engine: This is required as
the FST Linking Engine can not dereference Entity data (as the EntityLinking
and the EntityTagging engine).
@@ -154,8 +180,7 @@ __Other__
As the first version of the FST Linking Engine is still in active development
their are some know issues:
-* Currently FST models are not updated if the Solr index is changed. This
means that this Engine currently only works for read-only indexes. If a Index
is changed users will need to delete the FST file and restart the Engine to
trigger the recreation of the FST model
-* the Japanese FieldType as specified in the
[fst_field_types.xml](fst_field_types.xml) file does produce position
increments != 1
+* The Japanese FieldType as specified in the
[fst_field_types.xml](fst_field_types.xml) file does produce position
increments != 1. This is caused by Kuromoji's
[JapaneseTokenizer](http://lucene.apache.org/core/3_6_0/api/contrib-kuromoji/org/apache/lucene/analysis/ja/JapaneseTokenizer.html)
outputting several tokens for the same position (posInc=0). The implementation
of [Issue10](https://github.com/OpenSextant/SolrTextTagger/issues/10) will
solve this by adding support for such TokenStream configurations.
* the RefCounted EntityCache is not destroyed prior to finalise(). This means
that at some point the reference count is not correctly dereferenced.
Modified:
stanbol/trunk/enhancement-engines/lucenefstlinking/src/main/java/org/apache/stanbol/enhancer/engines/lucenefstlinking/CorpusInfo.java
URL:
http://svn.apache.org/viewvc/stanbol/trunk/enhancement-engines/lucenefstlinking/src/main/java/org/apache/stanbol/enhancer/engines/lucenefstlinking/CorpusInfo.java?rev=1517757&r1=1517756&r2=1517757&view=diff
==============================================================================
---
stanbol/trunk/enhancement-engines/lucenefstlinking/src/main/java/org/apache/stanbol/enhancer/engines/lucenefstlinking/CorpusInfo.java
(original)
+++
stanbol/trunk/enhancement-engines/lucenefstlinking/src/main/java/org/apache/stanbol/enhancer/engines/lucenefstlinking/CorpusInfo.java
Tue Aug 27 07:08:54 2013
@@ -160,7 +160,7 @@ public class CorpusInfo {
}
//check if the set version is the most current one
if(enqueued == this.enqueued){ //if so
- enqueued = -1; //mark this one as up-to-date
+ this.enqueued = -1; //mark this one as up-to-date
}
}
Modified:
stanbol/trunk/enhancement-engines/lucenefstlinking/src/main/java/org/apache/stanbol/enhancer/engines/lucenefstlinking/FstLinkingEngine.java
URL:
http://svn.apache.org/viewvc/stanbol/trunk/enhancement-engines/lucenefstlinking/src/main/java/org/apache/stanbol/enhancer/engines/lucenefstlinking/FstLinkingEngine.java?rev=1517757&r1=1517756&r2=1517757&view=diff
==============================================================================
---
stanbol/trunk/enhancement-engines/lucenefstlinking/src/main/java/org/apache/stanbol/enhancer/engines/lucenefstlinking/FstLinkingEngine.java
(original)
+++
stanbol/trunk/enhancement-engines/lucenefstlinking/src/main/java/org/apache/stanbol/enhancer/engines/lucenefstlinking/FstLinkingEngine.java
Tue Aug 27 07:08:54 2013
@@ -196,20 +196,8 @@ public class FstLinkingEngine implements
log.info(" - sum fst: {} ms", taggingEnd - taggingStart);
}
}
- log.debug("Process Matches for {} extragted Tags:",tags.size());
int matches = match(at,tags.values());
- if(log.isTraceEnabled()){
- String text = at.getSpan();
- for(Tag tag : tags.values()){
- log.trace(" {}: '{}'", tag,
text.subSequence(tag.getStart(), tag.getEnd()));
- int i=1;
- for(Match match : tag.getSuggestions()){
- log.trace(" {}. {} - {} ({})", new Object[]{
- i++, match.getScore(), match.getMatchLabel(),
match.getUri()});
- }
- }
- }
- log.info(" - loaded {} ({} loaded, {} cached, {} appended) Matches
in {} ms",
+ log.debug(" - loaded {} ({} loaded, {} cached, {} appended)
Matches in {} ms",
new Object[]{matches, session.getSessionDocLoaded(),
session.getSessionDocCached(),
session.getSessionDocAppended(),
System.currentTimeMillis()-taggingEnd});
@@ -238,12 +226,14 @@ public class FstLinkingEngine implements
}
private int match(AnalysedText at, Collection<Tag> tags) {
+ log.trace(" ... process matches for {} extracted Tags:",tags.size());
int matchCount = 0;
String text = at.getSpan();
Iterator<Tag> tagIt = tags.iterator();
while(tagIt.hasNext()){
Tag tag = tagIt.next();
String anchor = text.substring(tag.getStart(), tag.getEnd());
+ log.trace(" {}: '{}'", tag, anchor);
tag.setAnchor(anchor);
if(!elConfig.isCaseSensitiveMatching()){
anchor = anchor.toLowerCase(Locale.ROOT);
@@ -251,7 +241,12 @@ public class FstLinkingEngine implements
int alength = anchor.length();
List<Match> suggestions = new
ArrayList<Match>(tag.getMatches().size());
+ int i=1; //only for trace level debugging
for(Match match : tag.getMatches()){
+ if(log.isTraceEnabled()){
+ log.trace(" {}. {} - {} ({})", new Object[]{
+ i++, match.getScore(), match.getMatchLabel(),
match.getUri()});
+ }
matchCount++;
if(!filterEntityByType(match.getTypes().iterator())){
int distance = Integer.MAX_VALUE;
@@ -275,8 +270,12 @@ public class FstLinkingEngine implements
double length = Math.max(alength,
matchLabel.getLexicalForm().length());
match.setMatch(1d -
((double)distance/length),matchLabel);
}
+ log.trace(" ... add suggestion: label: '{}'; conf: {}",
+ matchLabel, match.getScore());
suggestions.add(match);
- } //else the type of the current Entity is blacklisted
+ } else { //the type of the current Entity is blacklisted
+ log.trace(" ... filtered because of entity types");
+ }
}
if(suggestions.isEmpty()){
tagIt.remove(); // remove this tag as no match is left
@@ -435,7 +434,7 @@ public class FstLinkingEngine implements
*/
private void adaptScoreForEntityRankings(List<Match> equalScoreList,
double nextScore) {
double score = equalScoreList.get(0).getScore();
- log.debug(" > Adapt Score of multiple Suggestions "
+ log.trace(" > Adapt Score of multiple Suggestions "
+ "with '{}' based on EntityRanking",score);
//Adapt the score to reflect the entity ranking
//but do not change order with entities of different
@@ -443,17 +442,17 @@ public class FstLinkingEngine implements
//TODO: make the max change (0.1) configurable
double dif = (Math.min(0.1, score-nextScore))/equalScoreList.size();
Collections.sort(equalScoreList,Match.ENTITY_RANK_COMPARATOR);
- log.debug(" - keep socre of {} at {}",
equalScoreList.get(0).getUri(), score);
+ log.trace(" - keep socre of {} at {}",
equalScoreList.get(0).getUri(), score);
for(int i=1;i<equalScoreList.size();i++){
score = score-dif;
if(Match.ENTITY_RANK_COMPARATOR.compare(equalScoreList.get(i-1),
equalScoreList.get(i)) != 0){
equalScoreList.get(i).updateScore(score);
- log.debug(" - set score of {} to {}",
equalScoreList.get(i).getUri(), score);
+ log.trace(" - set score of {} to {}",
equalScoreList.get(i).getUri(), score);
} else {
double lastScore = equalScoreList.get(i-1).getScore();
equalScoreList.get(i).updateScore(lastScore);
- log.debug(" - set score of {} to {}",
equalScoreList.get(i).getUri(), lastScore);
+ log.trace(" - set score of {} to {}",
equalScoreList.get(i).getUri(), lastScore);
}
}
}
Modified:
stanbol/trunk/enhancement-engines/lucenefstlinking/src/main/java/org/apache/stanbol/enhancer/engines/lucenefstlinking/FstLinkingEngineComponent.java
URL:
http://svn.apache.org/viewvc/stanbol/trunk/enhancement-engines/lucenefstlinking/src/main/java/org/apache/stanbol/enhancer/engines/lucenefstlinking/FstLinkingEngineComponent.java?rev=1517757&r1=1517756&r2=1517757&view=diff
==============================================================================
---
stanbol/trunk/enhancement-engines/lucenefstlinking/src/main/java/org/apache/stanbol/enhancer/engines/lucenefstlinking/FstLinkingEngineComponent.java
(original)
+++
stanbol/trunk/enhancement-engines/lucenefstlinking/src/main/java/org/apache/stanbol/enhancer/engines/lucenefstlinking/FstLinkingEngineComponent.java
Tue Aug 27 07:08:54 2013
@@ -97,6 +97,8 @@ import org.osgi.service.component.Compon
import org.osgi.util.tracker.ServiceTracker;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
+
+import com.google.common.util.concurrent.ThreadFactoryBuilder;
/**
* This is the OSGI component for the {@link FstLinkingEngine}. It is used to
* manage the service configuration, tracks dependencies and handles the
@@ -458,11 +460,12 @@ public class FstLinkingEngineComponent {
if(tpSize <= 0){ //if configured value <= 0 we use the default
tpSize = DEFAULT_FST_THREAD_POOL_SIZE;
}
- //now initialise the ThreadPool (and shutdown the existing one if
present)
- //we use the Lucene utils ThreadFactory to have nice names for created
threads
- ThreadFactory tf = new
NamedThreadFactory(engineName+"-FST-RuntimeCreation");
- //TODO: maybe use the more advanced
- // com.google.common.util.concurrent.ThreadFactoryBuilder
+ //build a ThreadFactoryBuilder for low priority daemon threads that
+ //do use a meaningful name
+ ThreadFactoryBuilder tfBuilder = new ThreadFactoryBuilder();
+ tfBuilder.setDaemon(true);//should be stopped if the VM closes
+ tfBuilder.setPriority(Thread.MIN_PRIORITY); //low priority
+ tfBuilder.setNameFormat(engineName+"-FstRuntimeCreation-thread-%d");
if(fstCreatorService != null && !fstCreatorService.isTerminated()){
//NOTE: We can not call terminateNow, because to interrupt threads
// here would also close FileChannels used by the SolrCore
@@ -475,7 +478,7 @@ public class FstLinkingEngineComponent {
log.warn("some items in a previouse FST Runtime Creation
Threadpool have "
+ "still not finished!");
}
- fstCreatorService = Executors.newFixedThreadPool(tpSize,tf);
+ fstCreatorService =
Executors.newFixedThreadPool(tpSize,tfBuilder.build());
//(6) Parse the EntityCache config
int ecSize;
Modified:
stanbol/trunk/enhancement-engines/lucenefstlinking/src/main/java/org/apache/stanbol/enhancer/engines/lucenefstlinking/TaggingSession.java
URL:
http://svn.apache.org/viewvc/stanbol/trunk/enhancement-engines/lucenefstlinking/src/main/java/org/apache/stanbol/enhancer/engines/lucenefstlinking/TaggingSession.java?rev=1517757&r1=1517756&r2=1517757&view=diff
==============================================================================
---
stanbol/trunk/enhancement-engines/lucenefstlinking/src/main/java/org/apache/stanbol/enhancer/engines/lucenefstlinking/TaggingSession.java
(original)
+++
stanbol/trunk/enhancement-engines/lucenefstlinking/src/main/java/org/apache/stanbol/enhancer/engines/lucenefstlinking/TaggingSession.java
Tue Aug 27 07:08:54 2013
@@ -37,6 +37,7 @@ import org.apache.lucene.document.Docume
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.StringField;
+import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexableField;
import org.apache.lucene.queries.function.valuesource.IfFunction;
@@ -54,6 +55,8 @@ import org.opensextant.solrtexttagger.Ta
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
+import com.google.common.eventbus.AllowConcurrentEvents;
+
/**
* Profile created based on the {@link IndexConfiguration} for processing a
* parsed ContentItem. <p>
@@ -108,11 +111,24 @@ public class TaggingSession implements C
//private final ValueSourceAccessor uniqueKeyCache;
//private final Map<Integer,Match> matchPool = new
HashMap<Integer,Match>(2048);
private final FieldLoaderImpl fieldLoader;
+ /**
+ * The current version of the SolIndex (as reported by
+ * {@link DirectoryReader#getVersion()}) of the
+ * {@link IndexConfiguration#getIndex()}
+ */
+ private final Long indexVersion;
TaggingSession(String language, IndexConfiguration config) throws
CorpusException {
this.language = language;
this.config = config;
+ //init the SolrIndexSearcher
+ searcherRef = config.getIndex().getSearcher();
+ SolrIndexSearcher searcher = searcherRef.get();
+ DirectoryReader indexReader = searcher.getIndexReader();
+ indexVersion = Long.valueOf(indexReader.getVersion());
+
+ //get the corpusInfo
CorpusInfo langCorpusInfo = config.getCorpus(language);
CorpusInfo defaultCorpusInfo = config.getDefaultCorpus();
@@ -130,7 +146,7 @@ public class TaggingSession implements C
}
if(langCorpusInfo != null){
this.langCorpus = new Corpus(langCorpusInfo,
- obtainFstCorpus(langCorpusInfo));
+ obtainFstCorpus(indexVersion,langCorpusInfo));
this.labelField = langCorpusInfo.storedField;
solrDocfields.add(labelField);
this.labelLang = langCorpusInfo.language == null ||
@@ -142,7 +158,7 @@ public class TaggingSession implements C
}
if(defaultCorpusInfo != null &&
!defaultCorpusInfo.equals(langCorpusInfo)){
this.defaultCorpus = new Corpus(defaultCorpusInfo,
- obtainFstCorpus(defaultCorpusInfo));
+ obtainFstCorpus(indexVersion,defaultCorpusInfo));
this.defaultLabelField = defaultCorpusInfo.storedField;
solrDocfields.add(defaultLabelField);
this.defaultLabelLang = defaultCorpusInfo.language == null ||
@@ -179,9 +195,7 @@ public class TaggingSession implements C
} else {
this.rankingField = null;
}
- searcherRef = config.getIndex().getSearcher();
- SolrIndexSearcher searcher = searcherRef.get();
- documentCacheRef = config.getEntityCacheManager().getCache(searcher);
+ documentCacheRef =
config.getEntityCacheManager().getCache(indexVersion);
// uniqueKeyCache = null; //no longer used.
// uniqueKeyCache = new ValueSourceAccessor(searcher,
idSchemaField.getType()
// .getValueSource(idSchemaField, null));
@@ -301,12 +315,12 @@ public class TaggingSession implements C
/**
* Obtains the FST corpus for the parsed CorpusInfo. The other parameters
* are just used for error messages in case this is not successful.
+ * @param indexVersion the current version of the index
* @param fstInfo the info about the corpus
- * @param ci the contentIteem (just used for logging and error messages)
- * @return
- * @throws CorpusException
+ * @return the TaggerFstCorpus
+ * @throws CorpusException if the requested corpus is currently not
available
*/
- private TaggerFstCorpus obtainFstCorpus(CorpusInfo fstInfo) throws
CorpusException {
+ private TaggerFstCorpus obtainFstCorpus(Long indexVersion, CorpusInfo
fstInfo) throws CorpusException {
TaggerFstCorpus fstCorpus;
synchronized (fstInfo) { // one at a time
fstCorpus = fstInfo.getCorpus();
@@ -333,12 +347,42 @@ public class TaggingSession implements C
throw new CorpusException(fstInfo.getErrorMessage(),
null);
}
}
+ } else { //fstCorpus != null
+ if(indexVersion != null && indexVersion.longValue() !=
fstCorpus.getIndexVersion()){
+ log.info("FST corpus for language '{}' is outdated ...",
fstInfo.language);
+ if(fstInfo.isEnqueued()){
+ log.info(" ... already sheduled for recreation. "
+ + "Use outaded corpus for tagging");
+ } else if(fstInfo.allowCreation &&
config.getExecutorService() != null){
+ log.info(" ... initialise recreation");
+ config.getExecutorService().execute(
+ new CorpusCreationTask(config, fstInfo));
+ } else {
+ log.warn("Unable to update outdated FST corpus for
language '{}' "
+ + "because runtimeCreation is {} and
ExecutorServic "
+ + "is {} available!", new
Object[]{fstInfo.language,
+ fstInfo.allowCreation ? "enabled" : "disabled"
,
+ config.getExecutorService() == null ? "not" :
""});
+ log.warn(" ... please adapt the Engine configuration
for up "
+ + "to date FST corpora!");
+ }
+ } else { //FST corpus is up to date with the current Solr
index version
+ log.debug("FST corpus for language '{}' is up to date",
fstInfo.language);
+ }
}
-
}
return fstCorpus;
}
/**
+ * The current version of the SolrIndex as reported by the {@link
IndexReader}
+ * used by this TaggingSession.
+ * @return the current version of the SolrIndex.
+ */
+ public Long getIndexVersion() {
+ return indexVersion;
+ }
+
+ /**
* {@link FieldLoader} implementation used to create {@link Match}
instances
*/
private class FieldLoaderImpl implements FieldLoader {