GitHub user takuti opened a pull request:

    https://github.com/apache/incubator-hivemall/pull/97

    [HIVEMALL-130] Support user-defined dictionary for `tokenize_ja`

    ## What changes were proposed in this pull request?
    
    - Add a new argument to `tokenize_ja` which enables users to register 
user-defined dictionary
      - Value can be either `const array<string>` (for array of custom 
definitions) or `const string` (for URL pointing an external dictionary file)
      - Users need to follow [Kuromoji official user dictionary 
format](https://github.com/atilika/kuromoji/blob/909fd6b32bf4e9dc86b7599de5c9b50ca8f004a1/kuromoji-core/src/test/resources/userdict.txt)
 as `<word>,<result>,<read>,<class>`
    - Update document and refactor accordingly
    
    ## What type of PR is it?
    
    Improvement
    
    ## What is the Jira issue?
    
    https://issues.apache.org/jira/browse/HIVEMALL-130
    
    ## How was this patch tested?
    
    Manually tested both on local and EMR with the Kuromoji official sample 
file:
    
    ```
    hive> select
        >   tokenize_ja("日本経済新聞", "normal"),
        >   tokenize_ja("日本経済新聞", "normal", array(), array(), 
"https://raw.githubusercontent.com/atilika/kuromoji/909fd6b32bf4e9dc86b7599de5c9b50ca8f004a1/kuromoji-core/src/test/res/userdict.txt";),
        >   tokenize_ja("日本経済新聞", "normal", array(), array(), 
array("日本経済新聞,日本 経済 新聞,ニホン ケイザイ 
シンブン,カスタム名詞", "関西国際空港,関西 国際 
空港,カンサイ コクサイ クウコウ,テスト名詞"))
        > ;
    OK
    _c0     _c1     _c2
    ["日本経済新聞"]        ["日本","経済","新聞"]  
["日本","経済","新聞"]
    Time taken: 2.094 seconds, Fetched: 1 row(s)
    ```
    
    ## How to use this feature?
    
    As shown above.
    
    ## Checklist
    
    - [x] Did you apply source code formatter, i.e., `mvn formatter:format`, 
for your commit?

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/takuti/incubator-hivemall kuromoji-user-dict

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-hivemall/pull/97.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #97
    
----
commit 9f6eecb3ccb53e5a6ed8e4772b3be23c310e9f10
Author: Takuya Kitazawa <k.tak...@gmail.com>
Date:   2017-07-07T05:50:50Z

    stopWords takes nullable array

commit 44cb2bf7675c190e6d86e3df5d99e23755ae9bcd
Author: Takuya Kitazawa <k.tak...@gmail.com>
Date:   2017-07-07T08:08:56Z

    Support user dictionary specified via URL of plain text (CSV)
    
    Unit test updated accordingly

commit d05be59c868d78c6418bd251b65293eac71c0642
Author: Takuya Kitazawa <k.tak...@gmail.com>
Date:   2017-07-07T08:36:00Z

    Support gzip compressed connection

commit be9574fc3ab0d28b142e0cdc62f882221c948e06
Author: Takuya Kitazawa <k.tak...@gmail.com>
Date:   2017-07-10T01:50:27Z

    Refactor InputStream/GZIPInputStream getter

commit a339a257846bca21c5d041a78c6d82ba3760e9a6
Author: Takuya Kitazawa <k.tak...@gmail.com>
Date:   2017-07-10T02:33:54Z

    Set timeout

commit 1d0d17adc86ae3250b05d3262989834c45e0835c
Author: Takuya Kitazawa <k.tak...@gmail.com>
Date:   2017-07-10T02:43:09Z

    Use BoundedInputStream to ignore huge file

commit bd1c7515837ed8afd03f4fe9c0ef2b2e5505dbe1
Author: Takuya Kitazawa <k.tak...@gmail.com>
Date:   2017-07-10T03:42:07Z

    Create HttpUtils for http-connection-related code

commit f4459037afc57ec60b00b57926945e653b1a4f36
Author: Takuya Kitazawa <k.tak...@gmail.com>
Date:   2017-07-10T05:28:52Z

    Support gzipped file url

commit 9b91ebbe3f90674b859c3cc645ababe75a21904d
Author: Takuya Kitazawa <k.tak...@gmail.com>
Date:   2017-07-10T09:03:00Z

    Refactor

commit 1e32e65eb3f30520f98ae1915d3d5569962c84a4
Author: Takuya Kitazawa <k.tak...@gmail.com>
Date:   2017-07-11T02:06:13Z

    Support creating user dictionary via `const array<string>`

commit e39016866bee9868404325a19e8775fb64a836c9
Author: Takuya Kitazawa <k.tak...@gmail.com>
Date:   2017-07-11T03:25:10Z

    Refactor
    
    Esp. argOIs handling in initialize()

commit 6d5f9c9ca81d4388fc0a67b80ade13f3732ebfc3
Author: Takuya Kitazawa <k.tak...@gmail.com>
Date:   2017-07-11T04:16:08Z

    Check response code

commit 895d3a0a524c3362bdf8bd0a1b114a75179cb4b1
Author: Takuya Kitazawa <k.tak...@gmail.com>
Date:   2017-07-11T04:48:21Z

    Update tokenizer doc

commit d92697e682a6e897ec26b101566a1ac0005ae124
Author: Takuya Kitazawa <k.tak...@gmail.com>
Date:   2017-07-11T04:49:12Z

    Format

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to