Re: Matching an identifier with punctuation?

Phillip Rhodes Thu, 28 Nov 2013 22:34:23 -0800

Apparently I had just overlooked something, or was using stale data,
or something silly.  Matching identifiers with punctuation like '-'
works now.  I also managed to get the additional dereference fields to
come back, although I find that in 0.12, that field was not exposed as
a property that could be managed in the Felix console, so I had to
tweak the source and rebuild.   But once I did that, it works
perfectly.


Thanks again for all your assistance.


Phil
This message optimized for indexing by NSA PRISM


On Fri, Nov 29, 2013 at 1:14 AM, Rupert Westenthaler
<rupert.westentha...@gmail.com> wrote:
> Hi Phillip
>
> On Wed, Nov 27, 2013 at 8:08 PM, Phillip Rhodes
> <motley.crue....@gmail.com> wrote:
>> Thanks for the information Rupert.  I'll look into the FST linking for
>> sure.   What I don't quite understand though, unless I just overlooked
>> something simple while
>> I was experimenting with this, is why I don't get matches on
>> "CUS-12345" in my text when I have an entity in the referenced Site
>> that has an rdfs:label property with the value "CUS-12345" (IOW, there
>> is an exact match).
>>
>> I assumed it was something to do with tokenization and that by the
>> time the tokens made it down to the linking engine, CUS-12345 had been
>> separated into "CUS" and "12345" or something.   But if I understand
>> you correctly, this actually *should* work, as long as the punctuated
>> label matches exactly in both the input text and the entity label?
>>
>
> This is correct. If the Tokenizer splits the Tokens it will not match.
> This is why the KeywordLinkingEngine had the option to activate a
> KeywordTokenizer.
>
> copied form the documentation of the KeywordLinkingEngine
>
> Keyword Tokenizer
> (org.apache.stanbol.enhancer.engines.keywordextraction.keywordTokenizer):
> This allows to use a special Tokenizer for matching keywords and alpha
> numeric IDs. Typical language specific Tokenizers tend to split such
> IDs in several tokens and therefore might prevent a correct matching.
> This Tokenizer should only be activated if the KeywordLinkingEngine is
> configured to match against IDs like ISBN numbers, Product IDs ... It
> should not be used to match against natural language labels.
>
>
> For the EntityLinkingEngine this option is not available, as
> Tokenization is not done by this engine.
>
> best
> Rupert
>
>>
>>
>> Phil
>> This message optimized for indexing by NSA PRISM
>>
>>
>> On Wed, Nov 27, 2013 at 3:24 AM, Rupert Westenthaler
>> <rupert.westentha...@gmail.com> wrote:
>>> Hi Phillip
>>>
>>> The KeywordTokenizer just assures that 'CUS-729913' when appearing in
>>> the text is not split up to ['CUS', '-', '729913']. 'CUS729913'
>>> appearing in the text will never match 'CUS-729913' as this engine
>>> requires the first ~80% of the chars of a token to be identical so
>>> that it assumes a token to match.
>>>
>>> So while - with KeywordTokenizer enabled - the KeywordLinkingEngine
>>> can be used to detect exact mentions of keys it can not be used for
>>> fuzzy matching of those.
>>>
>>> I would recommend you to switch to the FST Linking Engine [1]. This
>>> engine matches against the labels as processed by the configured Solr
>>> Analyzer. So if the configured Analyzer uses a correctly configured
>>> Solr WordDelimiterFilter the engine will suggest  'CUS-729913 ' for
>>> 'CUS729913' and the other way around.
>>>
>>> The default configuration of the SolrYard does use Analyzer
>>> configuration that include the WordDelimiterFilter and AFAIK the
>>> default config should work for your use case. But anyways here is an
>>> example of a Solr Analyzer configuration (Note that 'catenate**' is
>>> enabled at index time).
>>>
>>>       <analyzer type="index">
>>>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>         <filter class="solr.WordDelimiterFilterFactory" 
>>> splitOnCaseChange="0"
>>>             splitOnNumerics="0" catenateWords="1" catenateNumbers="1"
>>>             catenateAll="1" preserveOriginal="1"/>
>>>       </analyzer>
>>>       <analyzer type="query">
>>>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>         <filter class="solr.WordDelimiterFilterFactory" 
>>> splitOnCaseChange="0"
>>>             splitOnNumerics="0"/>
>>>       </analyzer>
>>>
>>> The "fise:confidence" of the suggestion will be calculated using the
>>> 'Levenshtein distance' between the mention 'CUS729913' and the label
>>> 'CUS-729913'. So confidence values of fuzzy matched keys should be
>>> pretty high.
>>>
>>> If you want to keep using the KeywordLinkingEngine the only option is
>>> to define all variants that could be mentioned in the text as labels.
>>> So given your example both CUS-729913 and CUS729913.
>>>
>>> best
>>> Rupert
>>>
>>>
>>>
>>>
>>>
>>> [1] 
>>> http://stanbol.staging.apache.org/docs/trunk/components/enhancer/engines/lucenefstlinking
>>>
>>>
>>>
>>> On Wed, Nov 27, 2013 at 1:16 AM, Phillip Rhodes
>>> <motley.crue....@gmail.com> wrote:
>>>> Stanbol devs:
>>>>
>>>> I'm working on using the Stanbol EntityHub to allow user of "local"
>>>> knowledge and site specific vocabularies, and have run into a sticking
>>>> point.  I can configure a new SolrYard and ManagedSite, and setup a
>>>> KeywordLinkingEngine, enable KeywordTokenizer and get back results for
>>>> alphanumeric strings like
>>>>
>>>> "CUS729913"
>>>>
>>>> but I haven't had any luck yet making it work with identifiers with
>>>> embedded puncation, like:
>>>>
>>>> "CUS-729913"
>>>>
>>>> Can anyone tell me if this is possible with Stanbol, and - if so -
>>>> maybe give me a clue on what the missing incantation is?  Or if
>>>> Stanbol doesn't currently support that, maybe an idea of where in the
>>>> code to start looking, with an eye towards implementing something like
>>>> that?
>>>>
>>>>
>>>> Thanks,
>>>>
>>>>
>>>>
>>>> Phil R.
>>>>
>>>> --
>>>> This message optimized for indexing by NSA PRISM
>>>
>>>
>>>
>>> --
>>> | Rupert Westenthaler             rupert.westentha...@gmail.com
>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>> | A-5500 Bischofshofen
>
>
>
> --
> | Rupert Westenthaler             rupert.westentha...@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen

Re: Matching an identifier with punctuation?

Reply via email to