Apparently I had just overlooked something, or was using stale data, or something silly. Matching identifiers with punctuation like '-' works now. I also managed to get the additional dereference fields to come back, although I find that in 0.12, that field was not exposed as a property that could be managed in the Felix console, so I had to tweak the source and rebuild. But once I did that, it works perfectly.
Thanks again for all your assistance. Phil This message optimized for indexing by NSA PRISM On Fri, Nov 29, 2013 at 1:14 AM, Rupert Westenthaler <rupert.westentha...@gmail.com> wrote: > Hi Phillip > > On Wed, Nov 27, 2013 at 8:08 PM, Phillip Rhodes > <motley.crue....@gmail.com> wrote: >> Thanks for the information Rupert. I'll look into the FST linking for >> sure. What I don't quite understand though, unless I just overlooked >> something simple while >> I was experimenting with this, is why I don't get matches on >> "CUS-12345" in my text when I have an entity in the referenced Site >> that has an rdfs:label property with the value "CUS-12345" (IOW, there >> is an exact match). >> >> I assumed it was something to do with tokenization and that by the >> time the tokens made it down to the linking engine, CUS-12345 had been >> separated into "CUS" and "12345" or something. But if I understand >> you correctly, this actually *should* work, as long as the punctuated >> label matches exactly in both the input text and the entity label? >> > > This is correct. If the Tokenizer splits the Tokens it will not match. > This is why the KeywordLinkingEngine had the option to activate a > KeywordTokenizer. > > copied form the documentation of the KeywordLinkingEngine > > Keyword Tokenizer > (org.apache.stanbol.enhancer.engines.keywordextraction.keywordTokenizer): > This allows to use a special Tokenizer for matching keywords and alpha > numeric IDs. Typical language specific Tokenizers tend to split such > IDs in several tokens and therefore might prevent a correct matching. > This Tokenizer should only be activated if the KeywordLinkingEngine is > configured to match against IDs like ISBN numbers, Product IDs ... It > should not be used to match against natural language labels. > > > For the EntityLinkingEngine this option is not available, as > Tokenization is not done by this engine. > > best > Rupert > >> >> >> Phil >> This message optimized for indexing by NSA PRISM >> >> >> On Wed, Nov 27, 2013 at 3:24 AM, Rupert Westenthaler >> <rupert.westentha...@gmail.com> wrote: >>> Hi Phillip >>> >>> The KeywordTokenizer just assures that 'CUS-729913' when appearing in >>> the text is not split up to ['CUS', '-', '729913']. 'CUS729913' >>> appearing in the text will never match 'CUS-729913' as this engine >>> requires the first ~80% of the chars of a token to be identical so >>> that it assumes a token to match. >>> >>> So while - with KeywordTokenizer enabled - the KeywordLinkingEngine >>> can be used to detect exact mentions of keys it can not be used for >>> fuzzy matching of those. >>> >>> I would recommend you to switch to the FST Linking Engine [1]. This >>> engine matches against the labels as processed by the configured Solr >>> Analyzer. So if the configured Analyzer uses a correctly configured >>> Solr WordDelimiterFilter the engine will suggest 'CUS-729913 ' for >>> 'CUS729913' and the other way around. >>> >>> The default configuration of the SolrYard does use Analyzer >>> configuration that include the WordDelimiterFilter and AFAIK the >>> default config should work for your use case. But anyways here is an >>> example of a Solr Analyzer configuration (Note that 'catenate**' is >>> enabled at index time). >>> >>> <analyzer type="index"> >>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >>> <filter class="solr.WordDelimiterFilterFactory" >>> splitOnCaseChange="0" >>> splitOnNumerics="0" catenateWords="1" catenateNumbers="1" >>> catenateAll="1" preserveOriginal="1"/> >>> </analyzer> >>> <analyzer type="query"> >>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >>> <filter class="solr.WordDelimiterFilterFactory" >>> splitOnCaseChange="0" >>> splitOnNumerics="0"/> >>> </analyzer> >>> >>> The "fise:confidence" of the suggestion will be calculated using the >>> 'Levenshtein distance' between the mention 'CUS729913' and the label >>> 'CUS-729913'. So confidence values of fuzzy matched keys should be >>> pretty high. >>> >>> If you want to keep using the KeywordLinkingEngine the only option is >>> to define all variants that could be mentioned in the text as labels. >>> So given your example both CUS-729913 and CUS729913. >>> >>> best >>> Rupert >>> >>> >>> >>> >>> >>> [1] >>> http://stanbol.staging.apache.org/docs/trunk/components/enhancer/engines/lucenefstlinking >>> >>> >>> >>> On Wed, Nov 27, 2013 at 1:16 AM, Phillip Rhodes >>> <motley.crue....@gmail.com> wrote: >>>> Stanbol devs: >>>> >>>> I'm working on using the Stanbol EntityHub to allow user of "local" >>>> knowledge and site specific vocabularies, and have run into a sticking >>>> point. I can configure a new SolrYard and ManagedSite, and setup a >>>> KeywordLinkingEngine, enable KeywordTokenizer and get back results for >>>> alphanumeric strings like >>>> >>>> "CUS729913" >>>> >>>> but I haven't had any luck yet making it work with identifiers with >>>> embedded puncation, like: >>>> >>>> "CUS-729913" >>>> >>>> Can anyone tell me if this is possible with Stanbol, and - if so - >>>> maybe give me a clue on what the missing incantation is? Or if >>>> Stanbol doesn't currently support that, maybe an idea of where in the >>>> code to start looking, with an eye towards implementing something like >>>> that? >>>> >>>> >>>> Thanks, >>>> >>>> >>>> >>>> Phil R. >>>> >>>> -- >>>> This message optimized for indexing by NSA PRISM >>> >>> >>> >>> -- >>> | Rupert Westenthaler rupert.westentha...@gmail.com >>> | Bodenlehenstraße 11 ++43-699-11108907 >>> | A-5500 Bischofshofen > > > > -- > | Rupert Westenthaler rupert.westentha...@gmail.com > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen