GitHub user helenahm opened a pull request:
https://github.com/apache/incubator-hivemall/pull/93
Maximum Entropy Model
## What changes were proposed in this pull request?
A Distributed Max Entropy Model
## What type of PR is it?
Feature
## What is the Jira issue?
?
## How was this patch tested?
There are two tests at the moment,
hivemall.smile.classification.MaxEntUDTFTest.java
and hivemall.smile.tools.TreePredictUDFTest.java
plus I have tested the code on EMR:
add jar hivemall-core-0.4.2-rc.2-maxent-with-dependencies.jar;
add jar opennlp-maxent-3.0.0.jar;
source define-all.hive;
create temporary function train_maxent_classifier as
'hivemall.smile.classification.MaxEntUDTF';
create temporary function predict_maxent_classifier as
'hivemall.smile.tools.MaxEntPredictUDF';
drop table tmodel_maxent;
CREATE TABLE tmodel_maxent
STORED AS SEQUENCEFILE
AS
select
train_maxent_classifier(features, klass, "-attrs
Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,Q,Q,Q,Q,Q,Q,Q,Q")
from
t_test_maxent;
create table tmodel_combined as
select model, attributes, features, klass from t_test_maxent join
tmodel_maxent;
create table tmodel_predicted as
select
predict_maxent_classifier(model, attributes, features) result, klass from
tmodel_combined;
Source table:
drop table t_test_maxent;
create table t_test_maxent as select
array(
x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19,x20,x21,x22,x23,x24,x25,x26,x27,x28,x29,x30,x31,x32,x33,x34,x35,x36,
cast(tWord(x37) as double),
cast(tWord(x38) as double),
cast(tWord(x39) as double),
cast(tWord(x40) as double),
cast(tWord(x41) as double),
cast(tWord(x42) as double),
cast(tWord(x43) as double),
cast(tWord(x44) as double),
cast(contentWord(x45) as double),
cast(contentWord(x46) as double),
cast(contentWord(x47) as double),
cast(contentWord(x48) as double),
cast(contentWord(x49) as double),
cast(contentWord(x50) as double),
cast(contentWord(x51) as double),
cast(contentWord(x52) as double),
cast(contentWord(x53) as double),
cast(presentationWord(x54) as double),
cast(presentationWord(x55) as double),
cast(presentationWord(x56) as double),
cast(presentationWord(x57) as double),
cast(presentationWord(x58) as double),
cast(presentationWord(x59) as double),
cast(presentationWord(x60) as double),
cast(presentationWord(x61) as double),
cast(presentationWord(x62) as double),
x63,x64,x65,x66,x67,x68,x69,x70) features
, klass from pdfs_and_tiffs_instances_combined_instances where
regexp_replace(tp, 'T', '') == '76_698_855_347';
## How to use this feature?
Maximum Entropy Classifier is, from my point of view, the most useful
classification technique for many NLP tasks and many other tasks that are not
related to NLP. It is used for part of speech tagging, NER, and some other
tasks.
I have been searching for a distributed version of it and found one article
only that talks about it. "Efficient Large Scale Distributed Training of
Conditional Maximum Entropy Models" by Mehryar Mohri [quite well-known] and his
colleagues at Google. (Please, let me know how I can send you the article if
you will not get it by googling). Thus, I think it is time to implement that. I
plan to use Mixture Weight Method they describe.
By now a final udaf is still to be implemented (the one that collects all
the models and averages the weights), that I plan to commit next week.
See if you like the idea and will accept the code. It is based on Apache
maxent, that is open source and is written in a simple way.
Regards,
Elena.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/helenahm/incubator-hivemall master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-hivemall/pull/93.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #93
----
commit 45a656aa7278066ce3fc36fcd81fb1eca11f1079
Author: helenahm <[email protected]>
Date: 2017-06-02T05:10:13Z
Update LDAUDTFTest.java
commit fef9c1ce719d3924a28cc90d71d40728dc5c7563
Author: helenahm <[email protected]>
Date: 2017-06-02T05:22:54Z
Merge pull request #1 from helenahm/helenahm-patch-1
Update LDAUDTFTest.java
commit e92b13aa3cb4fc193ea3da3fadd8a8fe8a6a073b
Author: AKHMATOVA, Elena <[email protected]>
Date: 2017-07-02T03:41:14Z
maxent
commit d4031550f80007045353f1e24e58c99244ab3db3
Author: AKHMATOVA, Elena <[email protected]>
Date: 2017-07-02T03:49:16Z
maxent cont.
commit f921d91fe8a1958cfd198236219c129355ef2fea
Author: AKHMATOVA, Elena <[email protected]>
Date: 2017-07-02T03:54:38Z
maxent cont.
commit 2a712edfd9bbe765bb2781f84b519e283fe6bd56
Author: helenahm <[email protected]>
Date: 2017-07-02T03:59:59Z
Update LDAUDTFTest.java
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---