Re: Matching an identifier with punctuation?

Rupert Westenthaler Wed, 27 Nov 2013 00:25:43 -0800

Hi Phillip

The KeywordTokenizer just assures that 'CUS-729913' when appearing in
the text is not split up to ['CUS', '-', '729913']. 'CUS729913'
appearing in the text will never match 'CUS-729913' as this engine
requires the first ~80% of the chars of a token to be identical so
that it assumes a token to match.

So while - with KeywordTokenizer enabled - the KeywordLinkingEngine
can be used to detect exact mentions of keys it can not be used for
fuzzy matching of those.

I would recommend you to switch to the FST Linking Engine [1]. This
engine matches against the labels as processed by the configured Solr
Analyzer. So if the configured Analyzer uses a correctly configured
Solr WordDelimiterFilter the engine will suggest  'CUS-729913 ' for
'CUS729913' and the other way around.

The default configuration of the SolrYard does use Analyzer
configuration that include the WordDelimiterFilter and AFAIK the
default config should work for your use case. But anyways here is an
example of a Solr Analyzer configuration (Note that 'catenate**' is
enabled at index time).

      <analyzer type="index">
       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="0"
            splitOnNumerics="0" catenateWords="1" catenateNumbers="1"
            catenateAll="1" preserveOriginal="1"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="0"
            splitOnNumerics="0"/>
      </analyzer>

The "fise:confidence" of the suggestion will be calculated using the
'Levenshtein distance' between the mention 'CUS729913' and the label
'CUS-729913'. So confidence values of fuzzy matched keys should be
pretty high.

If you want to keep using the KeywordLinkingEngine the only option is
to define all variants that could be mentioned in the text as labels.
So given your example both CUS-729913 and CUS729913.

best
Rupert

[1] 
http://stanbol.staging.apache.org/docs/trunk/components/enhancer/engines/lucenefstlinking

On Wed, Nov 27, 2013 at 1:16 AM, Phillip Rhodes
<[email protected]> wrote:
> Stanbol devs:
>
> I'm working on using the Stanbol EntityHub to allow user of "local"
> knowledge and site specific vocabularies, and have run into a sticking
> point.  I can configure a new SolrYard and ManagedSite, and setup a
> KeywordLinkingEngine, enable KeywordTokenizer and get back results for
> alphanumeric strings like
>
> "CUS729913"
>
> but I haven't had any luck yet making it work with identifiers with
> embedded puncation, like:
>
> "CUS-729913"
>
> Can anyone tell me if this is possible with Stanbol, and - if so -
> maybe give me a clue on what the missing incantation is?  Or if
> Stanbol doesn't currently support that, maybe an idea of where in the
> code to start looking, with an eye towards implementing something like
> that?
>
>
> Thanks,
>
>
>
> Phil R.
>
> --
> This message optimized for indexing by NSA PRISM

-- 
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Matching an identifier with punctuation?

Reply via email to