Understood! Thanks Martin and I think isHeuristicMatchForName() sounds great!
Adam On Tue, Nov 26, 2013 at 7:18 PM, Martin Desruisseaux <[email protected]> wrote: > Hello Adam > > Thanks for the links, I was not aware of them. There is currently no > probability value for matching string(s). The current heuristic rules are > based on known practices, like ESRI adding the "D_" prefix for datum, spaces > replaced by '_' and non-alphanumeric characters ignored. I have not yet > found a need to match strings that are only similar. For now I have seen > either exact match with above rules, or completely different names (e.g. > "International 1924" and "Hayford 1909" are the same ellipsoid). > > Lucene of course have a role, and actually we do use it, but rather in some > layers on top of metadata. I think it will come to SIS later, presumably in > a separated module... > > Martin > > > > Le 26/11/13 18:49, Adam Estrada a écrit : > >> Martin, >> >> Is there a probability value that is returned for the matching >> string(s)? I actually just came across a blog post[1] that does >> something similar to what you are working towards. They use the >> verbiage "best partial" for determining strings of noticeably >> different lengths. This appears to be similar to using a Jaccard >> index[2] for string comparison but on smaller bodies of text like the >> titles of said aliases. Would this be an application for using a >> Lucene index that already has all the info retrieval goodness built in >> to it? >> >> Adam >> >> [1] >> http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/ >> [2] http://en.wikipedia.org/wiki/Jaccard_index > >
