GitHub user helenahm opened a pull request:

    https://github.com/apache/incubator-hivemall/pull/93

    Maximum Entropy Model

    ## What changes were proposed in this pull request?
    
    A Distributed Max Entropy Model
    
    ## What type of PR is it?
    
    Feature
    
    ## What is the Jira issue?
    
    ?
    
    ## How was this patch tested?
    
    There are two tests at  the moment, 
hivemall.smile.classification.MaxEntUDTFTest.java
    and hivemall.smile.tools.TreePredictUDFTest.java
    
    plus I have tested the code on EMR:
    
    add jar hivemall-core-0.4.2-rc.2-maxent-with-dependencies.jar;
    add jar opennlp-maxent-3.0.0.jar;
    source define-all.hive;
    create temporary function train_maxent_classifier as 
'hivemall.smile.classification.MaxEntUDTF';
    create temporary function predict_maxent_classifier as 
'hivemall.smile.tools.MaxEntPredictUDF';
    drop table tmodel_maxent;
    CREATE TABLE tmodel_maxent 
    STORED AS SEQUENCEFILE 
    AS
    select 
      train_maxent_classifier(features, klass, "-attrs 
    
    
Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,Q,Q,Q,Q,Q,Q,Q,Q")
 
    from
      t_test_maxent;
    
    create table tmodel_combined as
    select model, attributes, features, klass from t_test_maxent join 
tmodel_maxent;
    
    create table tmodel_predicted as
    select
    predict_maxent_classifier(model, attributes, features) result, klass from 
tmodel_combined;
    
    Source table:
    drop table t_test_maxent;
    create table t_test_maxent as select
    array( 
x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19,x20,x21,x22,x23,x24,x25,x26,x27,x28,x29,x30,x31,x32,x33,x34,x35,x36,
    cast(tWord(x37) as double),
    cast(tWord(x38) as double),
    cast(tWord(x39) as double),
    cast(tWord(x40) as double),
    cast(tWord(x41) as double),
    cast(tWord(x42) as double),
    cast(tWord(x43) as double),
    cast(tWord(x44) as double),
    cast(contentWord(x45) as double),
    cast(contentWord(x46) as double),
    cast(contentWord(x47) as double),
    cast(contentWord(x48) as double),
    cast(contentWord(x49) as double),
    cast(contentWord(x50) as double),
    cast(contentWord(x51) as double),
    cast(contentWord(x52) as double),
    cast(contentWord(x53) as double),
    cast(presentationWord(x54) as double),
    cast(presentationWord(x55) as double),
    cast(presentationWord(x56) as double),
    cast(presentationWord(x57) as double),
    cast(presentationWord(x58) as double),
    cast(presentationWord(x59) as double),
    cast(presentationWord(x60) as double),
    cast(presentationWord(x61) as double),
    cast(presentationWord(x62) as double),
    x63,x64,x65,x66,x67,x68,x69,x70) features
    , klass from pdfs_and_tiffs_instances_combined_instances where 
regexp_replace(tp, 'T', '') == '76_698_855_347';
    
    
    ## How to use this feature?
    
    Maximum Entropy Classifier is, from my point of view, the most useful 
classification technique for many NLP tasks and many other tasks that are not 
related to NLP. It is used for part of speech tagging, NER, and some other 
tasks.
    
    I have been searching for a distributed version of it and found one article 
only that talks about it. "Efficient Large Scale Distributed Training of 
Conditional Maximum Entropy Models" by Mehryar Mohri [quite well-known] and his 
colleagues at Google. (Please, let me know how I can send you the article if 
you will not get it by googling). Thus, I think it is time to implement that. I 
plan to use Mixture Weight Method they describe.
    
    By now a final udaf is still to be implemented (the one that collects all 
the models and averages the weights), that I plan to commit next week. 
    
    See if you like the idea and will accept the code. It is based on Apache 
maxent, that is open source and is written in a simple way.
    
    Regards,
    Elena.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/helenahm/incubator-hivemall master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-hivemall/pull/93.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #93
    
----
commit 45a656aa7278066ce3fc36fcd81fb1eca11f1079
Author: helenahm <[email protected]>
Date:   2017-06-02T05:10:13Z

    Update LDAUDTFTest.java

commit fef9c1ce719d3924a28cc90d71d40728dc5c7563
Author: helenahm <[email protected]>
Date:   2017-06-02T05:22:54Z

    Merge pull request #1 from helenahm/helenahm-patch-1
    
    Update LDAUDTFTest.java

commit e92b13aa3cb4fc193ea3da3fadd8a8fe8a6a073b
Author: AKHMATOVA, Elena <[email protected]>
Date:   2017-07-02T03:41:14Z

    maxent

commit d4031550f80007045353f1e24e58c99244ab3db3
Author: AKHMATOVA, Elena <[email protected]>
Date:   2017-07-02T03:49:16Z

    maxent cont.

commit f921d91fe8a1958cfd198236219c129355ef2fea
Author: AKHMATOVA, Elena <[email protected]>
Date:   2017-07-02T03:54:38Z

    maxent cont.

commit 2a712edfd9bbe765bb2781f84b519e283fe6bd56
Author: helenahm <[email protected]>
Date:   2017-07-02T03:59:59Z

    Update LDAUDTFTest.java

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to