GitHub user takuti opened a pull request:
https://github.com/apache/incubator-hivemall/pull/97
[HIVEMALL-130] Support user-defined dictionary for `tokenize_ja`
## What changes were proposed in this pull request?
- Add a new argument to `tokenize_ja` which enables users to register
user-defined dictionary
- Value can be either `const array<string>` (for array of custom
definitions) or `const string` (for URL pointing an external dictionary file)
- Users need to follow [Kuromoji official user dictionary
format](https://github.com/atilika/kuromoji/blob/909fd6b32bf4e9dc86b7599de5c9b50ca8f004a1/kuromoji-core/src/test/resources/userdict.txt)
as `<word>,<result>,<read>,<class>`
- Update document and refactor accordingly
## What type of PR is it?
Improvement
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-130
## How was this patch tested?
Manually tested both on local and EMR with the Kuromoji official sample
file:
```
hive> select
> tokenize_ja("æ¥æ¬çµæ¸æ°è", "normal"),
> tokenize_ja("æ¥æ¬çµæ¸æ°è", "normal", array(), array(),
"https://raw.githubusercontent.com/atilika/kuromoji/909fd6b32bf4e9dc86b7599de5c9b50ca8f004a1/kuromoji-core/src/test/res/userdict.txt"),
> tokenize_ja("æ¥æ¬çµæ¸æ°è", "normal", array(), array(),
array("æ¥æ¬çµæ¸æ°è,æ¥æ¬ çµæ¸ æ°è,ããã³ ã±ã¤ã¶ã¤
ã·ã³ãã³,ã«ã¹ã¿ã åè©", "é¢è¥¿å½é空港,é¢è¥¿ å½é
空港,ã«ã³ãµã¤ ã³ã¯ãµã¤ ã¯ã¦ã³ã¦,ãã¹ãåè©"))
> ;
OK
_c0 _c1 _c2
["æ¥æ¬çµæ¸æ°è"] ["æ¥æ¬","çµæ¸","æ°è"]
["æ¥æ¬","çµæ¸","æ°è"]
Time taken: 2.094 seconds, Fetched: 1 row(s)
```
## How to use this feature?
As shown above.
## Checklist
- [x] Did you apply source code formatter, i.e., `mvn formatter:format`,
for your commit?
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/takuti/incubator-hivemall kuromoji-user-dict
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-hivemall/pull/97.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #97
----
commit 9f6eecb3ccb53e5a6ed8e4772b3be23c310e9f10
Author: Takuya Kitazawa <[email protected]>
Date: 2017-07-07T05:50:50Z
stopWords takes nullable array
commit 44cb2bf7675c190e6d86e3df5d99e23755ae9bcd
Author: Takuya Kitazawa <[email protected]>
Date: 2017-07-07T08:08:56Z
Support user dictionary specified via URL of plain text (CSV)
Unit test updated accordingly
commit d05be59c868d78c6418bd251b65293eac71c0642
Author: Takuya Kitazawa <[email protected]>
Date: 2017-07-07T08:36:00Z
Support gzip compressed connection
commit be9574fc3ab0d28b142e0cdc62f882221c948e06
Author: Takuya Kitazawa <[email protected]>
Date: 2017-07-10T01:50:27Z
Refactor InputStream/GZIPInputStream getter
commit a339a257846bca21c5d041a78c6d82ba3760e9a6
Author: Takuya Kitazawa <[email protected]>
Date: 2017-07-10T02:33:54Z
Set timeout
commit 1d0d17adc86ae3250b05d3262989834c45e0835c
Author: Takuya Kitazawa <[email protected]>
Date: 2017-07-10T02:43:09Z
Use BoundedInputStream to ignore huge file
commit bd1c7515837ed8afd03f4fe9c0ef2b2e5505dbe1
Author: Takuya Kitazawa <[email protected]>
Date: 2017-07-10T03:42:07Z
Create HttpUtils for http-connection-related code
commit f4459037afc57ec60b00b57926945e653b1a4f36
Author: Takuya Kitazawa <[email protected]>
Date: 2017-07-10T05:28:52Z
Support gzipped file url
commit 9b91ebbe3f90674b859c3cc645ababe75a21904d
Author: Takuya Kitazawa <[email protected]>
Date: 2017-07-10T09:03:00Z
Refactor
commit 1e32e65eb3f30520f98ae1915d3d5569962c84a4
Author: Takuya Kitazawa <[email protected]>
Date: 2017-07-11T02:06:13Z
Support creating user dictionary via `const array<string>`
commit e39016866bee9868404325a19e8775fb64a836c9
Author: Takuya Kitazawa <[email protected]>
Date: 2017-07-11T03:25:10Z
Refactor
Esp. argOIs handling in initialize()
commit 6d5f9c9ca81d4388fc0a67b80ade13f3732ebfc3
Author: Takuya Kitazawa <[email protected]>
Date: 2017-07-11T04:16:08Z
Check response code
commit 895d3a0a524c3362bdf8bd0a1b114a75179cb4b1
Author: Takuya Kitazawa <[email protected]>
Date: 2017-07-11T04:48:21Z
Update tokenizer doc
commit d92697e682a6e897ec26b101566a1ac0005ae124
Author: Takuya Kitazawa <[email protected]>
Date: 2017-07-11T04:49:12Z
Format
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---