This is an automated email from the ASF dual-hosted git repository.
dzamo pushed a commit to branch gh-pages
in repository https://gitbox.apache.org/repos/asf/drill.git
The following commit(s) were added to refs/heads/gh-pages by this push:
new 8619175 Reformat string-distance-functions.md.
8619175 is described below
commit 8619175ee4fccf3c116b2fb87f5b59d49da078fc
Author: James Turton <[email protected]>
AuthorDate: Thu Jul 8 10:16:26 2021 +0200
Reformat string-distance-functions.md.
---
.../050-aggregate-and-aggregate-statistical.md | 2 -
.../sql-functions/062-string-distance-functions.md | 63 ++++++----------------
2 files changed, 15 insertions(+), 50 deletions(-)
diff --git
a/_docs/en/sql-reference/sql-functions/050-aggregate-and-aggregate-statistical.md
b/_docs/en/sql-reference/sql-functions/050-aggregate-and-aggregate-statistical.md
index 6a199a5..7690aab 100644
---
a/_docs/en/sql-reference/sql-functions/050-aggregate-and-aggregate-statistical.md
+++
b/_docs/en/sql-reference/sql-functions/050-aggregate-and-aggregate-statistical.md
@@ -9,7 +9,6 @@ parent: "SQL Functions"
The following table lists the aggregate functions that you can use in Drill
queries.
-|\-------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------|
| **Function** | **Argument
Type**
| **Return Type**
|
| ------------------------------------------------------------ |
-----------------------------------------------------------------------------------------------------------------------------------------
|
------------------------------------------------------------------------------------------------------------------------------------
|
@@ -21,7 +20,6 @@ queries.
| COUNT(\[DISTINCT\] expression) | any
| BIGINT
|
| MAX(expression), MIN(expression) | BINARY,
DECIMAL, VARCHAR, DATE, TIME, or TIMESTAMP
| Same as argument type
|
| SUM(expression) | SMALLINT,
INTEGER, BIGINT, FLOAT, DOUBLE, DECIMAL, INTERVAL
| DECIMAL for DECIMAL
argument, BIGINT for any integer-type argument (including BIGINT), DOUBLE
for floating-point arguments |
-| ------------------------------------------- |
-----------------------------------------------------------------------------------------------------------------------------------------
|
------------------------------------------------------------------------------------------------------------------------------------
|
- Drill 1.14 and later supports the ANY_VALUE function.
- Starting in Drill 1.14, the DECIMAL data type is enabled by default.
diff --git
a/_docs/en/sql-reference/sql-functions/062-string-distance-functions.md
b/_docs/en/sql-reference/sql-functions/062-string-distance-functions.md
index c0d0c3f..09f069c 100644
--- a/_docs/en/sql-reference/sql-functions/062-string-distance-functions.md
+++ b/_docs/en/sql-reference/sql-functions/062-string-distance-functions.md
@@ -4,7 +4,9 @@ slug: "String Distance Functions"
parent: "SQL Functions"
---
-Starting in version 1.14, Drill supports string distance functions. Typically,
you use string distance functions in the WHERE clause of a query to measure the
difference between two strings. For example, if you want to match a street
address, but do not know how to spell a street name, you could issue a query on
the data source with the street addresses:
+**Introduced in release**: 1.14.
+
+Drill provides a functions for calculating a variety well known string
distance metrics. Typically, you use string distance functions in the WHERE
clause of a query to measure the difference between two strings. For example,
if you want to match a street address, but do not know how to spell a street
name, you could execute a query on the data source with the street addresses:
SELECT street_address
FROM address-data
@@ -15,57 +17,22 @@ The search would return addresses from rows with street
addresses similar to 123
1234 N. Quail Lane
1234 N Quaile Lan
-Drill supports the following string distance functions:
-
--
[`cosine_distance(string1,string2)`]({{site.baseurl}}/docs/string-distance-functions/#cosine_distance(string1,string2))
--
[`fuzzy_score(string1,string2)`]({{site.baseurl}}/docs/string-distance-functions/#fuzzy_score(string1,string2))
--
[`hamming_distance(string1,string2)`]({{site.baseurl}}/docs/string-distance-functions/#hamming_distance-(string1,string2))
--
[`jaccard_distance(string1,string2)`]({{site.baseurl}}/docs/string-distance-functions/#jaccard_distance-(string1,string2))
--
[`jaro_distance(string1,string2)`]({{site.baseurl}}/docs/string-distance-functions/#jaro_distance-(string1,string2))
--
[`levenshtein_distance(string1,string2)`]({{site.baseurl}}/docs/string-distance-functions/#levenshtein_distance-(string1,string2))
--
[`longest_common_substring_distance(string1,string2)`]({{site.baseurl}}/docs/string-distance-functions/#longest_common_substring_distance(string1,string2))
-
-
-## Function Descriptions
-The following sections describe each of the string distance functions that
Drill supports.
-
-### cosine_distance(string1,string2)
-
-Calculates the cosine distance between two strings.
-
-
-### fuzzy_score(string1,string2)
-
-Calculates the cosine distance between two strings. A matching algorithm that
is similar to the searching algorithms implemented in editors such as Sublime
Text, TextMate, Atom, and others. One point is given for every matched
character. Subsequent matches yield two bonus points. A higher score indicates
a higher similarity.
-
-
-### hamming_distance (string1,string2)
-
-The hamming distance between two strings of equal length is the number of
positions at which the corresponding symbols are different. For further
explanation about the Hamming Distance, refer to
http://en.wikipedia.org/wiki/Hamming_distance.
-
-
-### jaccard_distance (string1,string2)
-
-Measures the Jaccard distance of two sets of character sequence. [Jaccard
distance](https://en.wikipedia.org/wiki/Jaccard_index) is the dissimilarity
between two sets. It is the complementary of Jaccard similarity.
-
-
-### jaro_distance (string1,string2)
-
-A similarity algorithm indicating the percentage of matched characters between
two character sequences. The Jaro measure is the weighted sum of percentage of
matched characters from each file and transposed characters. Winkler increased
this measure for matching initial characters. This implementation is based on
the [Jaro Winkler similarity
algorithm](https://en.wikipedia.org/wiki/Jaro–Winkler_distance).
-
-
-### levenshtein_distance (string1,string2)
-An algorithm for measuring the difference between two character sequences.
This is the number of changes needed to change one sequence into another, where
each change is a single character modification (deletion, insertion, or
substitution).
+Drill supports the following string distance functions.
+|Function|Return type|Description|
+|-|-|-|
+|COSINE_DISTANCE(string1, string2)|FLOAT8|Returns the cosine distance, a
measurement of the angular distance between between two strings regarded as
word vectors.|
+|FUZZY_SCORE(string1, string2)|FLOAT8|Returns the score from a fuzzy string
matching algorithm[^1]. Higher scores indicate greater similarity.|
+|HAMMING_DISTANCE(string1, string2)|FLOAT8|Returns the [Hamming
distance](http://en.wikipedia.org/wiki/Hamming_distance) between two strings of
equal length, a measurement of the number of positions at which corresponding
characters differ.|
+|JACCARD_DISTANCE(string1, string2)|FLOAT8|Returns the [Jaccard
distance](https://en.wikipedia.org/wiki/Jaccard_index) between two strings
regarded as unordered sets of characters, a measurement of the overlap between
two sets.|
+|JARO_DISTANCE(string1, string2)|FLOAT8|Returns the [Jaro-Winkler
distance](https://en.wikipedia.org/wiki/Jaro–Winkler_distance), a measurement
of the fraction of matching characters between two strings.|
+|LEVENSHTEIN_DISTANCE(string1, string2)|FLOAT8|Returns the [Levenshtein
distance](https://en.wikipedia.org/wiki/Levenshtein_distance) between two
strings, a measurement of the number of single character modifications needed
change one string into another.|
+|LONGEST\_COMMON\_SUBSTRING_DISTANCE(string1, string2)|FLOAT8|Returns the
length of the [longest common
substring](https://en.wikipedia.org/wiki/Longest_common_subsequence_problem)
across two strings[^2].|
-### longest\_common\_substring_distance(string1,string2)
-Returns the length of the longest sub-sequence that two strings have in common.
-Two strings that are entirely different, return a value of 0, and two strings
that return a value of the commonly shared length implies that the strings are
completely the same in value and position. This implementation is based on the
[Longest Commons Substring
algorithm](https://en.wikipedia.org/wiki/Longest_common_subsequence_problem).
-
+[^1]: Calculates the score from a matching algorithm similar to the searching
algorithms implemented in editors such as Sublime Text, TextMate, Atom, and
others. One point is given for every matched character. Subsequent matches
yield two bonus points.
-**Note:** Generally this algorithm is fairly inefficient, as for length m, n
of the input
-CharSequence's left and right respectively, the runtime of the algorithm is
O(m*n).
+[^2]: Generally this algorithm is fairly inefficient, as for length m, n of
the input CharSequence's left and right respectively, the runtime of the
algorithm is O(m*n).