[
https://issues.apache.org/jira/browse/HIVE-9556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14317524#comment-14317524
]
Alexander Pivovarov commented on HIVE-9556:
-------------------------------------------
String similarity functions can be used to find fraud activity. e.g. person
registers with slightly different names - "Alexander" vs "Alexandre"
Also it can be used to find the same addresses. "110 Rock Harbor ln" vs "110
Rock harbour Lane"
Oracle has function SOUNDEX to find strings which sound similar
Postgres has
- soundex
- difference
- levenshtein // returns int instead of double
- -metaphone
- dmetaphone
http://www.postgresql.org/docs/9.1/static/fuzzystrmatch.html
Strings similarity function might be useful if people migrate from Oracle or
from Postgres to Hive.
If people work with accounts, names, addresses, medical records, etc they can
find strings similarity functions extremely useful.
Strings similarity functions can be used by Data Scientists as well.
Levenshtein distance is included to Apache Commons Lang
StringUtils.getLevenshteinDistance()
which is standard library found in most of java projects
It would be nice to have Levenshtein Distance in Hive as well
> create UDF to measure strings similarity using Levenshtein Distance algo
> ------------------------------------------------------------------------
>
> Key: HIVE-9556
> URL: https://issues.apache.org/jira/browse/HIVE-9556
> Project: Hive
> Issue Type: Improvement
> Components: UDF
> Reporter: Alexander Pivovarov
> Assignee: Alexander Pivovarov
> Attachments: HIVE-9556.1.patch, HIVE-9556.2.patch
>
>
> algorithm description http://en.wikipedia.org/wiki/Levenshtein_distance
> {code}
> --one edit operation, greatest str len = 12
> str_sim_levenshtein('Test String1', 'Test String2') = 1 - 1 / 12 = 0.91666667
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)