GitHub user takuti opened a pull request:
https://github.com/apache/incubator-hivemall/pull/83
[HIVEMALL-109][HIVEMALL-112] Fix topic model and tokenize UDFs
## What changes were proposed in this pull request?
#82
- Topic mode: `train_plsa` and `train_lda`
- Fix bugs caused by multi-byte input
- Fix wrong `recordBytes` calculation for iteration utilizing file IO
- Refactor and update unit tests accordingly
- `tokenize()`
- Support NULL input; the UDF simply returns NULL itself
## What type of PR is it?
Bug Fix
## What is the Jira issue?
- https://issues.apache.org/jira/browse/HIVEMALL-109
- https://issues.apache.org/jira/browse/HIVEMALL-112
## How was this patch tested?
- Unit tests
- Manual tests on EMR
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/takuti/incubator-hivemall fix-topicmodel
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-hivemall/pull/83.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #83
----
commit 988666a58801e1cf62b0c91c5815e973084ba972
Author: Takuya Kitazawa <[email protected]>
Date: 2017-06-02T07:40:12Z
Fix multi-byte-related issue in topic model UDFs
and validate it as unit test
commit b08f73aed98064059773ba8c2342814d03b991ff
Author: Takuya Kitazawa <[email protected]>
Date: 2017-06-02T08:07:19Z
Use `char`s instead of `byte`s
commit c1239fe7938a724147554d0c1c769ec7c3025013
Author: Takuya Kitazawa <[email protected]>
Date: 2017-06-02T08:24:20Z
Fix record bytes calculation
commit accee7a938c8034bd3c2a250bbdd27d57871092d
Author: Takuya Kitazawa <[email protected]>
Date: 2017-06-02T09:15:53Z
Use NIOUtils for writing strings to a byte buffer
commit ceff765de725cddc5e9f556433ab76272e4d9720
Author: Takuya Kitazawa <[email protected]>
Date: 2017-06-02T09:52:25Z
Fix record size related to iteration using temporary file
Since now iteration works correctly, manual for-loops are removed from
unit tests.
commit e9ec0f31ea2a6b5b67c89a141be197a734f66567
Author: Takuya Kitazawa <[email protected]>
Date: 2017-06-02T10:06:45Z
Fix `tokenize` for null input
commit dda972405c893277edb13add5fc2b4e7a5a96d83
Author: Takuya Kitazawa <[email protected]>
Date: 2017-06-02T11:35:20Z
Refactor on `recordTrainSampleToTempFile`
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---