Takuya Kitazawa created HIVEMALL-146:
----------------------------------------

             Summary: Implement yet another UDF to generate n-grams from a list 
of words
                 Key: HIVEMALL-146
                 URL: https://issues.apache.org/jira/browse/HIVEMALL-146
             Project: Hivemall
          Issue Type: New Feature
            Reporter: Takuya Kitazawa
            Assignee: Takuya Kitazawa


Hive has {{ngrams()}} function to obtain n-grams of a list of words: 
https://cwiki.apache.org/confluence/display/Hive/StatisticsAndDataMining#StatisticsAndDataMining-ngrams()andcontext_ngrams():N-gramfrequencyestimation

While the existing function returns "estimated" top-k list of frequent n-grams, 
NLP applications sometimes need to get "exact" list of n-grams which include 
all of 1-, 2-, ..., n-grams. To give an example, for an input \["machine", 
"learning"\], we might expect to get the following result: \["machine", 
"learning", "machine learning"\].

Hence, this ticket requests to implement yet another UDF something like 
{{ngrams()}}. Implementation could be similar to {{getNgrams()}} in the 
Stanford CoreNLP library: 
https://github.com/stanfordnlp/CoreNLP/blob/d6318a0cb06dba635550477bc843952cc5a5f868/src/edu/stanford/nlp/util/StringUtils.java#L2132-L2142



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to