[
https://issues.apache.org/jira/browse/HIVE-672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zheng Shao updated HIVE-672:
----------------------------
Attachment: weka.jar
HIVE-672.2.not.to.be.included.patch
HIVE-672.2.not.to.be.included.patch is the patch.
weka.jar should be put into contrib/lib.
There are test cases in the patch to show how to use the new functions.
> Integrate weka with Hive
> ------------------------
>
> Key: HIVE-672
> URL: https://issues.apache.org/jira/browse/HIVE-672
> Project: Hadoop Hive
> Issue Type: New Feature
> Reporter: Zheng Shao
> Assignee: Zheng Shao
> Attachments: HIVE-672.1.not.to.be.included.patch,
> HIVE-672.2.not.to.be.included.patch, weka.jar
>
>
> Weka is one of the most popular data mining package on the planet. It's used
> by numerous people around the world. Since weka is in Java, it should be
> pretty straight-forward to integrate weka with Hive.
> We just need to create some GenericUDAF functions that maps to Weka
> classifier training process. The output of the GenericUDAF can just be the
> serialized version of the trained classifiers.
> We should add another GenericUDF to load the classifier to classify new
> instances.
> The hive syntax can be as simple as this: (Note: In the example above, most
> of the "table." can be omitted. I put it there just for easier understanding
> of the query semantics.)
> The query builds a model (logistic regression) for predicting the CTR of each
> link on each page, based on user information, and evaluates the model on some
> data.
> {code}
> SELECT logdata.pageid, logdata.linkid, LogisticRegression( logdata.clicked,
> userinfo.age, userinfo.gender, userinfo.country, userinfo.interests ) as model
> FROM logdata JOIN userinfo
> ON logdata.userid = userinfo.userid
> GROUP BY logdata.pageid, logdata.linkid;
> SELECT logdata.pageid, logdata.linkid, logdata.clicked,
> LogisticRegressionEvaluate(classifiers.model, userinfo.age, userinfo.gender,
> userinfo.country, userinfo.interests) AS predicted
> FROM logdata JOIN userinfo
> ON logdata.userid = userinfo.userid
> JOIN classifiers
> ON logdata.pageid = classifiers.pageid AND logdata.linkid = classifiers.linkid
> {code}
> References:
> Use Weka in your Java Code:
> http://weka.wiki.sourceforge.net/Use+Weka+in+your+Java+code
> Note:
> Weka is under GPL license. We won't be able to include the code directly into
> Hive, but we can keep the discussions here.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.