[
https://issues.apache.org/jira/browse/SPARK-20028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chenzhao Guo updated SPARK-20028:
---------------------------------
Description:
This is the implementation of `ngrams` aggregate expression which is also
implemented by Hive. It takes use of n-gram concept in natural language
processing to understand texts.
Currently, Spark doesn't support using Hive UDAF GenericUDAFnGrams, which is
actually a feature missing.
An n-gram is a contiguous subsequence of n item(s) drawn from a given sequence.
This expression finds the k most frequent n-grams from one or more sequences.
This expression has the pattern of : ngrams(children: Array[Array[String]](or
Array[String]), n: Int, k: Int, accuracy: Int), it can be used in conjuction
with `sentences` to split the column of String to Array. Among the parameters:
Children indicates the 'given sequence' we collect n-grams from;
N indicates n-gram's element number, size 1 is referred to as a "unigram", size
2 is a "bigram", size 3 is a "trigram"...
K indicates top k;
Accuracy is related to the memory used for frequency estimation, more memory
will give more accurate frequency counts.
A simple example:
`SELECT ngrams(array("abc", "abc", "bcd", "abc", "bcd"), 2, 4);` will get
`[{["abc","bcd"]:2.0},
{["abc","abc"]:1.0},
{["bcd","abc"]:1.0}]`. Because there are four 2-grams for the input which are
`["abc", "abc"], ["abc", "bcd"], ["bcd", "abc"], ["abc", "bcd"]`, and `["abc",
"bcd"]` occurs 2 times, the other two 2-grams occurs 1 time each, while
`["abc","abc"]` is alphabetically before `["bcd","abc"]`, so the answer is like
that.
was:
N-grams are subsequences of length N drawn from a longer sequence. The purpose
of the ngrams() is to find the k most frequent n-grams from one or more
sequences.
The aggregation function has the pattern of :
ngrams(array<array<string>>(or array<string>), int N, int K, int accuracy),
where
the first parameter indicates the 'longer sequence' we collect n-grams from,
N & K indicates the top k n-grams,
Accuracy indicates the frequency counting accuracy.
> Implement NGrams aggregate function
> -----------------------------------
>
> Key: SPARK-20028
> URL: https://issues.apache.org/jira/browse/SPARK-20028
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 2.2.0
> Reporter: Chenzhao Guo
>
> This is the implementation of `ngrams` aggregate expression which is also
> implemented by Hive. It takes use of n-gram concept in natural language
> processing to understand texts.
> Currently, Spark doesn't support using Hive UDAF GenericUDAFnGrams, which is
> actually a feature missing.
> An n-gram is a contiguous subsequence of n item(s) drawn from a given
> sequence. This expression finds the k most frequent n-grams from one or more
> sequences.
> This expression has the pattern of : ngrams(children: Array[Array[String]](or
> Array[String]), n: Int, k: Int, accuracy: Int), it can be used in conjuction
> with `sentences` to split the column of String to Array. Among the
> parameters:
> Children indicates the 'given sequence' we collect n-grams from;
> N indicates n-gram's element number, size 1 is referred to as a "unigram",
> size 2 is a "bigram", size 3 is a "trigram"...
> K indicates top k;
> Accuracy is related to the memory used for frequency estimation, more memory
> will give more accurate frequency counts.
> A simple example:
> `SELECT ngrams(array("abc", "abc", "bcd", "abc", "bcd"), 2, 4);` will get
> `[{["abc","bcd"]:2.0},
> {["abc","abc"]:1.0},
> {["bcd","abc"]:1.0}]`. Because there are four 2-grams for the input which are
> `["abc", "abc"], ["abc", "bcd"], ["bcd", "abc"], ["abc", "bcd"]`, and
> `["abc", "bcd"]` occurs 2 times, the other two 2-grams occurs 1 time each,
> while `["abc","abc"]` is alphabetically before `["bcd","abc"]`, so the answer
> is like that.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]