GitHub user takuti opened a pull request: https://github.com/apache/incubator-hivemall/pull/97
[HIVEMALL-130] Support user-defined dictionary for `tokenize_ja` ## What changes were proposed in this pull request? - Add a new argument to `tokenize_ja` which enables users to register user-defined dictionary - Value can be either `const array<string>` (for array of custom definitions) or `const string` (for URL pointing an external dictionary file) - Users need to follow [Kuromoji official user dictionary format](https://github.com/atilika/kuromoji/blob/909fd6b32bf4e9dc86b7599de5c9b50ca8f004a1/kuromoji-core/src/test/resources/userdict.txt) as `<word>,<result>,<read>,<class>` - Update document and refactor accordingly ## What type of PR is it? Improvement ## What is the Jira issue? https://issues.apache.org/jira/browse/HIVEMALL-130 ## How was this patch tested? Manually tested both on local and EMR with the Kuromoji official sample file: ``` hive> select > tokenize_ja("æ¥æ¬çµæ¸æ°è", "normal"), > tokenize_ja("æ¥æ¬çµæ¸æ°è", "normal", array(), array(), "https://raw.githubusercontent.com/atilika/kuromoji/909fd6b32bf4e9dc86b7599de5c9b50ca8f004a1/kuromoji-core/src/test/res/userdict.txt"), > tokenize_ja("æ¥æ¬çµæ¸æ°è", "normal", array(), array(), array("æ¥æ¬çµæ¸æ°è,æ¥æ¬ çµæ¸ æ°è,ããã³ ã±ã¤ã¶ã¤ ã·ã³ãã³,ã«ã¹ã¿ã åè©", "é¢è¥¿å½é空港,é¢è¥¿ å½é 空港,ã«ã³ãµã¤ ã³ã¯ãµã¤ ã¯ã¦ã³ã¦,ãã¹ãåè©")) > ; OK _c0 _c1 _c2 ["æ¥æ¬çµæ¸æ°è"] ["æ¥æ¬","çµæ¸","æ°è"] ["æ¥æ¬","çµæ¸","æ°è"] Time taken: 2.094 seconds, Fetched: 1 row(s) ``` ## How to use this feature? As shown above. ## Checklist - [x] Did you apply source code formatter, i.e., `mvn formatter:format`, for your commit? You can merge this pull request into a Git repository by running: $ git pull https://github.com/takuti/incubator-hivemall kuromoji-user-dict Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-hivemall/pull/97.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #97 ---- commit 9f6eecb3ccb53e5a6ed8e4772b3be23c310e9f10 Author: Takuya Kitazawa <k.tak...@gmail.com> Date: 2017-07-07T05:50:50Z stopWords takes nullable array commit 44cb2bf7675c190e6d86e3df5d99e23755ae9bcd Author: Takuya Kitazawa <k.tak...@gmail.com> Date: 2017-07-07T08:08:56Z Support user dictionary specified via URL of plain text (CSV) Unit test updated accordingly commit d05be59c868d78c6418bd251b65293eac71c0642 Author: Takuya Kitazawa <k.tak...@gmail.com> Date: 2017-07-07T08:36:00Z Support gzip compressed connection commit be9574fc3ab0d28b142e0cdc62f882221c948e06 Author: Takuya Kitazawa <k.tak...@gmail.com> Date: 2017-07-10T01:50:27Z Refactor InputStream/GZIPInputStream getter commit a339a257846bca21c5d041a78c6d82ba3760e9a6 Author: Takuya Kitazawa <k.tak...@gmail.com> Date: 2017-07-10T02:33:54Z Set timeout commit 1d0d17adc86ae3250b05d3262989834c45e0835c Author: Takuya Kitazawa <k.tak...@gmail.com> Date: 2017-07-10T02:43:09Z Use BoundedInputStream to ignore huge file commit bd1c7515837ed8afd03f4fe9c0ef2b2e5505dbe1 Author: Takuya Kitazawa <k.tak...@gmail.com> Date: 2017-07-10T03:42:07Z Create HttpUtils for http-connection-related code commit f4459037afc57ec60b00b57926945e653b1a4f36 Author: Takuya Kitazawa <k.tak...@gmail.com> Date: 2017-07-10T05:28:52Z Support gzipped file url commit 9b91ebbe3f90674b859c3cc645ababe75a21904d Author: Takuya Kitazawa <k.tak...@gmail.com> Date: 2017-07-10T09:03:00Z Refactor commit 1e32e65eb3f30520f98ae1915d3d5569962c84a4 Author: Takuya Kitazawa <k.tak...@gmail.com> Date: 2017-07-11T02:06:13Z Support creating user dictionary via `const array<string>` commit e39016866bee9868404325a19e8775fb64a836c9 Author: Takuya Kitazawa <k.tak...@gmail.com> Date: 2017-07-11T03:25:10Z Refactor Esp. argOIs handling in initialize() commit 6d5f9c9ca81d4388fc0a67b80ade13f3732ebfc3 Author: Takuya Kitazawa <k.tak...@gmail.com> Date: 2017-07-11T04:16:08Z Check response code commit 895d3a0a524c3362bdf8bd0a1b114a75179cb4b1 Author: Takuya Kitazawa <k.tak...@gmail.com> Date: 2017-07-11T04:48:21Z Update tokenizer doc commit d92697e682a6e897ec26b101566a1ac0005ae124 Author: Takuya Kitazawa <k.tak...@gmail.com> Date: 2017-07-11T04:49:12Z Format ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---