Hi Phillip The KeywordTokenizer just assures that 'CUS-729913' when appearing in the text is not split up to ['CUS', '-', '729913']. 'CUS729913' appearing in the text will never match 'CUS-729913' as this engine requires the first ~80% of the chars of a token to be identical so that it assumes a token to match.
So while - with KeywordTokenizer enabled - the KeywordLinkingEngine can be used to detect exact mentions of keys it can not be used for fuzzy matching of those. I would recommend you to switch to the FST Linking Engine [1]. This engine matches against the labels as processed by the configured Solr Analyzer. So if the configured Analyzer uses a correctly configured Solr WordDelimiterFilter the engine will suggest 'CUS-729913 ' for 'CUS729913' and the other way around. The default configuration of the SolrYard does use Analyzer configuration that include the WordDelimiterFilter and AFAIK the default config should work for your use case. But anyways here is an example of a Solr Analyzer configuration (Note that 'catenate**' is enabled at index time). <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="0" splitOnNumerics="0" catenateWords="1" catenateNumbers="1" catenateAll="1" preserveOriginal="1"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="0" splitOnNumerics="0"/> </analyzer> The "fise:confidence" of the suggestion will be calculated using the 'Levenshtein distance' between the mention 'CUS729913' and the label 'CUS-729913'. So confidence values of fuzzy matched keys should be pretty high. If you want to keep using the KeywordLinkingEngine the only option is to define all variants that could be mentioned in the text as labels. So given your example both CUS-729913 and CUS729913. best Rupert [1] http://stanbol.staging.apache.org/docs/trunk/components/enhancer/engines/lucenefstlinking On Wed, Nov 27, 2013 at 1:16 AM, Phillip Rhodes <motley.crue....@gmail.com> wrote: > Stanbol devs: > > I'm working on using the Stanbol EntityHub to allow user of "local" > knowledge and site specific vocabularies, and have run into a sticking > point. I can configure a new SolrYard and ManagedSite, and setup a > KeywordLinkingEngine, enable KeywordTokenizer and get back results for > alphanumeric strings like > > "CUS729913" > > but I haven't had any luck yet making it work with identifiers with > embedded puncation, like: > > "CUS-729913" > > Can anyone tell me if this is possible with Stanbol, and - if so - > maybe give me a clue on what the missing incantation is? Or if > Stanbol doesn't currently support that, maybe an idea of where in the > code to start looking, with an eye towards implementing something like > that? > > > Thanks, > > > > Phil R. > > -- > This message optimized for indexing by NSA PRISM -- | Rupert Westenthaler rupert.westentha...@gmail.com | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen