Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/13870 )

Change subject: IMPALA-8752: Added Jaro-Winkler edit distance and similarity 
built-in function
......................................................................

IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

The added functions return the Jaro/Jaro-Winkler similarity/distance
of two strings. The algorithm calcuates the Jaro-Similarity of the
strings, then adds more weight to the result if there are
common prefixes. (Jaro-Winkler)
For more detail, see:
https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

Extended the algorithm with another optional parameter: boost threshold
The prefix weight will only be applied if the Jaro-similarity
exceeds the given threshold. By default, its value is 0.7.

The new built-in functions are:
 * jaro_distance, jaro_dst
 * jaro_similarity, jaro_sim
 * jaro_winkler_distance, jw_dst
 * jaro_winkler_similarity, jw_sim

Testing:
 * Added unit tests to expr-test.cc
 * Manual testing over 1400 word pairs from
   http://marvin.cs.uidaho.edu/misspell.html
   Results match Apache commons

Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Reviewed-on: http://gerrit.cloudera.org:8080/13870
Reviewed-by: Zoltan Borok-Nagy <borokna...@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M common/function-registry/impala_functions.py
4 files changed, 323 insertions(+), 0 deletions(-)

Approvals:
  Zoltan Borok-Nagy: Looks good to me, approved
  Impala Public Jenkins: Verified

--
To view, visit http://gerrit.cloudera.org:8080/13870
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Gerrit-Change-Number: 13870
Gerrit-PatchSet: 11
Gerrit-Owner: Norbert Luksa <norbert.lu...@cloudera.com>
Gerrit-Reviewer: Greg Rahn <gr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <norbert.lu...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>

Reply via email to