Added usage of tokenize_cn
Project: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/commit/1f819536 Tree: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/tree/1f819536 Diff: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/diff/1f819536 Branch: refs/heads/master Commit: 1f819536b294920d5629d59b4524f2a6d6a0d014 Parents: 5eb8037 Author: partyyoung <[email protected]> Authored: Fri Jun 30 17:49:59 2017 +0800 Committer: partyyoung <[email protected]> Committed: Fri Jun 30 17:49:59 2017 +0800 ---------------------------------------------------------------------- docs/gitbook/misc/tokenizer.md | 23 ++++++++++++++++++++++- resources/ddl/define-additional.hive | 3 +++ 2 files changed, 25 insertions(+), 1 deletion(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/1f819536/docs/gitbook/misc/tokenizer.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/misc/tokenizer.md b/docs/gitbook/misc/tokenizer.md index 47f07e0..a2d3820 100644 --- a/docs/gitbook/misc/tokenizer.md +++ b/docs/gitbook/misc/tokenizer.md @@ -46,4 +46,25 @@ select tokenize_ja("kuromojiã使ã£ãåãã¡æ¸ãã®ãã¹ãã§ãã第 ``` > ["kuromoji","使ã","åãã¡æ¸ã","ãã¹ã","第","äº","弿°","normal","search","extended","æå®","ããã©ã«ã","normal","ã¢ã¼ã"] -For detailed APIs, please refer Javadoc of [JapaneseAnalyzer](https://lucene.apache.org/core/5_3_1/analyzers-kuromoji/org/apache/lucene/analysis/ja/JapaneseAnalyzer.html) as well. \ No newline at end of file +For detailed APIs, please refer Javadoc of [JapaneseAnalyzer](https://lucene.apache.org/core/5_3_1/analyzers-kuromoji/org/apache/lucene/analysis/ja/JapaneseAnalyzer.html) as well. + +# Tokenizer for Chinese Texts + +Hivemall-NLP module provides a Chinese text tokenizer UDF using [SmartChineseAnalyzer](http://lucene.apache.org/core/5_3_1/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/SmartChineseAnalyzer.html). + +> add jar /tmp/[hivemall-nlp-xxx-with-dependencies.jar](https://github.com/myui/hivemall/releases); + +> source /tmp/[define-additional.hive](https://github.com/myui/hivemall/releases); + +The signature of the UDF is as follows: +```sql +tokenize_cn(string line, optional const array<string> stopWords) +``` + +It's basic usage is as follows: +```sql +select tokenize_cn("Smartcn为Apache2.0åè®®ç弿ºä¸æåè¯ç³»ç»ï¼Javaè¯è¨ç¼åï¼ä¿®æ¹çä¸ç§é¢è®¡ç®æICTCLASåè¯ç³»ç»ã"); +``` +> [smartcn, 为, apach, 2, 0, åè®®, ç, 弿º, 䏿, åè¯, ç³»ç», java, è¯è¨, ç¼å, ä¿®æ¹, ç, ä¸ç§é¢, 计ç®, æ, ictcla, åè¯, ç³»ç»] + +For detailed APIs, please refer Javadoc of [SmartChineseAnalyzer](http://lucene.apache.org/core/5_3_1/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/SmartChineseAnalyzer.html) as well. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/1f819536/resources/ddl/define-additional.hive ---------------------------------------------------------------------- diff --git a/resources/ddl/define-additional.hive b/resources/ddl/define-additional.hive index 7bbfcf4..af5cf82 100644 --- a/resources/ddl/define-additional.hive +++ b/resources/ddl/define-additional.hive @@ -9,6 +9,9 @@ drop temporary function if exists tokenize_ja; create temporary function tokenize_ja as 'hivemall.nlp.tokenizer.KuromojiUDF'; +drop temporary function if exists tokenize_cn; +create temporary function tokenize_cn as 'hivemall.nlp.tokenizer.SmartcnUDF'; + ------------------------------ -- XGBoost related features -- ------------------------------
