Re: Current state of work (2013-11-25)

Adam Estrada Tue, 26 Nov 2013 16:38:50 -0800

Understood! Thanks Martin and I think isHeuristicMatchForName() sounds great!


Adam

On Tue, Nov 26, 2013 at 7:18 PM, Martin Desruisseaux
<[email protected]> wrote:
> Hello Adam
>
> Thanks for the links, I was not aware of them. There is currently no
> probability value for matching string(s). The current heuristic rules are
> based on known practices, like ESRI adding the "D_" prefix for datum, spaces
> replaced by '_' and non-alphanumeric characters ignored. I have not yet
> found a need to match strings that are only similar. For now I have seen
> either exact match with above rules, or completely different names (e.g.
> "International 1924" and "Hayford 1909" are the same ellipsoid).
>
> Lucene of course have a role, and actually we do use it, but rather in some
> layers on top of metadata. I think it will come to SIS later, presumably in
> a separated module...
>
>     Martin
>
>
>
> Le 26/11/13 18:49, Adam Estrada a écrit :
>
>> Martin,
>>
>> Is there a probability value that is returned for the matching
>> string(s)? I actually just came across a blog post[1] that does
>> something similar to what you are working towards. They use the
>> verbiage "best partial" for determining strings of noticeably
>> different lengths. This appears to be similar to using a Jaccard
>> index[2] for string comparison but on smaller bodies of text like the
>> titles of said aliases. Would this be an application for using a
>> Lucene index that already has all the info retrieval goodness built in
>> to it?
>>
>> Adam
>>
>> [1]
>> http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
>> [2] http://en.wikipedia.org/wiki/Jaccard_index
>
>

Re: Current state of work (2013-11-25)

Reply via email to