Re: Matching an identifier with punctuation?

Phillip Rhodes Wed, 27 Nov 2013 11:08:49 -0800

Thanks for the information Rupert.  I'll look into the FST linking for
sure.   What I don't quite understand though, unless I just overlooked
something simple while
I was experimenting with this, is why I don't get matches on
"CUS-12345" in my text when I have an entity in the referenced Site
that has an rdfs:label property with the value "CUS-12345" (IOW, there
is an exact match).


I assumed it was something to do with tokenization and that by the
time the tokens made it down to the linking engine, CUS-12345 had been
separated into "CUS" and "12345" or something.   But if I understand
you correctly, this actually *should* work, as long as the punctuated
label matches exactly in both the input text and the entity label?



Phil
This message optimized for indexing by NSA PRISM


On Wed, Nov 27, 2013 at 3:24 AM, Rupert Westenthaler
<rupert.westentha...@gmail.com> wrote:
> Hi Phillip
>
> The KeywordTokenizer just assures that 'CUS-729913' when appearing in
> the text is not split up to ['CUS', '-', '729913']. 'CUS729913'
> appearing in the text will never match 'CUS-729913' as this engine
> requires the first ~80% of the chars of a token to be identical so
> that it assumes a token to match.
>
> So while - with KeywordTokenizer enabled - the KeywordLinkingEngine
> can be used to detect exact mentions of keys it can not be used for
> fuzzy matching of those.
>
> I would recommend you to switch to the FST Linking Engine [1]. This
> engine matches against the labels as processed by the configured Solr
> Analyzer. So if the configured Analyzer uses a correctly configured
> Solr WordDelimiterFilter the engine will suggest  'CUS-729913 ' for
> 'CUS729913' and the other way around.
>
> The default configuration of the SolrYard does use Analyzer
> configuration that include the WordDelimiterFilter and AFAIK the
> default config should work for your use case. But anyways here is an
> example of a Solr Analyzer configuration (Note that 'catenate**' is
> enabled at index time).
>
>       <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="0"
>             splitOnNumerics="0" catenateWords="1" catenateNumbers="1"
>             catenateAll="1" preserveOriginal="1"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="0"
>             splitOnNumerics="0"/>
>       </analyzer>
>
> The "fise:confidence" of the suggestion will be calculated using the
> 'Levenshtein distance' between the mention 'CUS729913' and the label
> 'CUS-729913'. So confidence values of fuzzy matched keys should be
> pretty high.
>
> If you want to keep using the KeywordLinkingEngine the only option is
> to define all variants that could be mentioned in the text as labels.
> So given your example both CUS-729913 and CUS729913.
>
> best
> Rupert
>
>
>
>
>
> [1] 
> http://stanbol.staging.apache.org/docs/trunk/components/enhancer/engines/lucenefstlinking
>
>
>
> On Wed, Nov 27, 2013 at 1:16 AM, Phillip Rhodes
> <motley.crue....@gmail.com> wrote:
>> Stanbol devs:
>>
>> I'm working on using the Stanbol EntityHub to allow user of "local"
>> knowledge and site specific vocabularies, and have run into a sticking
>> point.  I can configure a new SolrYard and ManagedSite, and setup a
>> KeywordLinkingEngine, enable KeywordTokenizer and get back results for
>> alphanumeric strings like
>>
>> "CUS729913"
>>
>> but I haven't had any luck yet making it work with identifiers with
>> embedded puncation, like:
>>
>> "CUS-729913"
>>
>> Can anyone tell me if this is possible with Stanbol, and - if so -
>> maybe give me a clue on what the missing incantation is?  Or if
>> Stanbol doesn't currently support that, maybe an idea of where in the
>> code to start looking, with an eye towards implementing something like
>> that?
>>
>>
>> Thanks,
>>
>>
>>
>> Phil R.
>>
>> --
>> This message optimized for indexing by NSA PRISM
>
>
>
> --
> | Rupert Westenthaler             rupert.westentha...@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen

Re: Matching an identifier with punctuation?

Reply via email to