[2/5] incubator-hivemall git commit: Added usage of tokenize_cn

takuti Sat, 01 Jul 2017 06:14:44 -0700

Added usage of tokenize_cn


Project: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/repo
Commit: 
http://git-wip-us.apache.org/repos/asf/incubator-hivemall/commit/1f819536
Tree: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/tree/1f819536
Diff: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/diff/1f819536

Branch: refs/heads/master
Commit: 1f819536b294920d5629d59b4524f2a6d6a0d014
Parents: 5eb8037
Author: partyyoung <[email protected]>
Authored: Fri Jun 30 17:49:59 2017 +0800
Committer: partyyoung <[email protected]>
Committed: Fri Jun 30 17:49:59 2017 +0800

----------------------------------------------------------------------
 docs/gitbook/misc/tokenizer.md       | 23 ++++++++++++++++++++++-
 resources/ddl/define-additional.hive |  3 +++
 2 files changed, 25 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/1f819536/docs/gitbook/misc/tokenizer.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/misc/tokenizer.md b/docs/gitbook/misc/tokenizer.md
index 47f07e0..a2d3820 100644
--- a/docs/gitbook/misc/tokenizer.md
+++ b/docs/gitbook/misc/tokenizer.md
@@ -46,4 +46,25 @@ select 
tokenize_ja("kuromojiãä½¿ã£ãåãã¡æ¸ãã®ãã¹ãã§ããç¬¬
 ```
 > ["kuromoji","ä½¿ã","åãã¡æ¸ã","ãã¹ã","ç¬¬","äº","å¼æ°","normal","search","extended","æå®","ããã©ã«ã","normal","ã¢ã¼ã"]
 
-For detailed APIs, please refer Javadoc of 
[JapaneseAnalyzer](https://lucene.apache.org/core/5_3_1/analyzers-kuromoji/org/apache/lucene/analysis/ja/JapaneseAnalyzer.html)
 as well.
\ No newline at end of file
+For detailed APIs, please refer Javadoc of 
[JapaneseAnalyzer](https://lucene.apache.org/core/5_3_1/analyzers-kuromoji/org/apache/lucene/analysis/ja/JapaneseAnalyzer.html)
 as well.
+
+# Tokenizer for Chinese Texts
+
+Hivemall-NLP module provides a Chinese text tokenizer UDF using 
[SmartChineseAnalyzer](http://lucene.apache.org/core/5_3_1/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/SmartChineseAnalyzer.html).
 
+
+> add jar 
/tmp/[hivemall-nlp-xxx-with-dependencies.jar](https://github.com/myui/hivemall/releases);
+
+> source 
/tmp/[define-additional.hive](https://github.com/myui/hivemall/releases);
+
+The signature of the UDF is as follows:
+```sql
+tokenize_cn(string line, optional const array<string> stopWords)
+```
+
+It's basic usage is as follows:
+```sql
+select 
tokenize_cn("Smartcnä¸ºApache2.0åè®®çå¼æºä¸æåè¯ç³»ç»ï¼Javaè¯è¨ç¼åï¼ä¿®æ¹çä¸ç§é¢è®¡ç®æICTCLASåè¯ç³»ç»ã");
+```
+> [smartcn, ä¸º, apach, 2, 0, åè®®, ç, å¼æº, ä¸æ, åè¯, ç³»ç», 
java, è¯è¨, ç¼å, ä¿®æ¹, ç, ä¸ç§é¢, è®¡ç®, æ, ictcla, åè¯, 
ç³»ç»]
+
+For detailed APIs, please refer Javadoc of 
[SmartChineseAnalyzer](http://lucene.apache.org/core/5_3_1/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/SmartChineseAnalyzer.html)
 as well.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/1f819536/resources/ddl/define-additional.hive
----------------------------------------------------------------------
diff --git a/resources/ddl/define-additional.hive 
b/resources/ddl/define-additional.hive
index 7bbfcf4..af5cf82 100644
--- a/resources/ddl/define-additional.hive
+++ b/resources/ddl/define-additional.hive
@@ -9,6 +9,9 @@
 drop temporary function if exists tokenize_ja;
 create temporary function tokenize_ja as 'hivemall.nlp.tokenizer.KuromojiUDF';
 
+drop temporary function if exists tokenize_cn;
+create temporary function tokenize_cn as 'hivemall.nlp.tokenizer.SmartcnUDF';
+
 ------------------------------
 -- XGBoost related features --
 ------------------------------

[2/5] incubator-hivemall git commit: Added usage of tokenize_cn

Reply via email to