Added usage of tokenize_cn

Project: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/repo
Commit: 
http://git-wip-us.apache.org/repos/asf/incubator-hivemall/commit/1f819536
Tree: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/tree/1f819536
Diff: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/diff/1f819536

Branch: refs/heads/master
Commit: 1f819536b294920d5629d59b4524f2a6d6a0d014
Parents: 5eb8037
Author: partyyoung <[email protected]>
Authored: Fri Jun 30 17:49:59 2017 +0800
Committer: partyyoung <[email protected]>
Committed: Fri Jun 30 17:49:59 2017 +0800

----------------------------------------------------------------------
 docs/gitbook/misc/tokenizer.md       | 23 ++++++++++++++++++++++-
 resources/ddl/define-additional.hive |  3 +++
 2 files changed, 25 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/1f819536/docs/gitbook/misc/tokenizer.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/misc/tokenizer.md b/docs/gitbook/misc/tokenizer.md
index 47f07e0..a2d3820 100644
--- a/docs/gitbook/misc/tokenizer.md
+++ b/docs/gitbook/misc/tokenizer.md
@@ -46,4 +46,25 @@ select 
tokenize_ja("kuromojiを使った分かち書きのテストです。第
 ```
 > ["kuromoji","使う","分かち書き","テスト","第","二","引数","normal","search","extended","指定","デフォルト","normal","モード"]
 
-For detailed APIs, please refer Javadoc of 
[JapaneseAnalyzer](https://lucene.apache.org/core/5_3_1/analyzers-kuromoji/org/apache/lucene/analysis/ja/JapaneseAnalyzer.html)
 as well.
\ No newline at end of file
+For detailed APIs, please refer Javadoc of 
[JapaneseAnalyzer](https://lucene.apache.org/core/5_3_1/analyzers-kuromoji/org/apache/lucene/analysis/ja/JapaneseAnalyzer.html)
 as well.
+
+# Tokenizer for Chinese Texts
+
+Hivemall-NLP module provides a Chinese text tokenizer UDF using 
[SmartChineseAnalyzer](http://lucene.apache.org/core/5_3_1/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/SmartChineseAnalyzer.html).
 
+
+> add jar 
/tmp/[hivemall-nlp-xxx-with-dependencies.jar](https://github.com/myui/hivemall/releases);
+
+> source 
/tmp/[define-additional.hive](https://github.com/myui/hivemall/releases);
+
+The signature of the UDF is as follows:
+```sql
+tokenize_cn(string line, optional const array<string> stopWords)
+```
+
+It's basic usage is as follows:
+```sql
+select 
tokenize_cn("Smartcn为Apache2.0协议的开源中文分词系统,Java语言编写,修改的中科院计算所ICTCLAS分词系统。");
+```
+> [smartcn, 为, apach, 2, 0, 协议, 的, 开源, 中文, 分词, 系统, 
java, 语言, 编写, 修改, 的, 中科院, 计算, 所, ictcla, 分词, 
系统]
+
+For detailed APIs, please refer Javadoc of 
[SmartChineseAnalyzer](http://lucene.apache.org/core/5_3_1/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/SmartChineseAnalyzer.html)
 as well.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/1f819536/resources/ddl/define-additional.hive
----------------------------------------------------------------------
diff --git a/resources/ddl/define-additional.hive 
b/resources/ddl/define-additional.hive
index 7bbfcf4..af5cf82 100644
--- a/resources/ddl/define-additional.hive
+++ b/resources/ddl/define-additional.hive
@@ -9,6 +9,9 @@
 drop temporary function if exists tokenize_ja;
 create temporary function tokenize_ja as 'hivemall.nlp.tokenizer.KuromojiUDF';
 
+drop temporary function if exists tokenize_cn;
+create temporary function tokenize_cn as 'hivemall.nlp.tokenizer.SmartcnUDF';
+
 ------------------------------
 -- XGBoost related features --
 ------------------------------

Reply via email to