[
https://issues.apache.org/jira/browse/FLINK-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597355#comment-14597355
]
ASF GitHub Bot commented on FLINK-1727:
---------------------------------------
GitHub user sachingoel0101 opened a pull request:
https://github.com/apache/flink/pull/861
[Flink-2030][ml]Online Histogram: Discrete and Categorical
This implements the Online Histograms for both categorical and continuous
data. For continuous data, we emulate a continuous probability distribution
which supports finding cumulative sum upto a particular value, and finding
value upto a specific cumulative probability [Quantiles].
For categorical fields, we emulate a probability mass function which
supports finding the probability associated with every class.
The continuous histogram follows this paper:
http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
Note: This is a sub-task of
https://issues.apache.org/jira/browse/FLINK-1727 which already has a PR pending
review at https://github.com/apache/flink/pull/710.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sachingoel0101/flink online_histogram
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/861.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #861
----
commit ec50b4bb4faf91570724b4aa79783936d0a9487f
Author: Sachin Goel <[email protected]>
Date: 2015-06-23T08:40:57Z
Online Histogram: Discrete and Categorical, Test Suites included
----
> Add decision tree to machine learning library
> ---------------------------------------------
>
> Key: FLINK-1727
> URL: https://issues.apache.org/jira/browse/FLINK-1727
> Project: Flink
> Issue Type: New Feature
> Components: Machine Learning Library
> Reporter: Till Rohrmann
> Assignee: Sachin Goel
> Labels: ML
>
> Decision trees are widely used for classification and regression tasks. Thus,
> it would be worthwhile to add support for them to Flink's machine learning
> library.
> A streaming parallel decision tree learning algorithm has been proposed by
> Ben-Haim and Tom-Tov [1]. This can maybe adapted to a batch use case as well.
> [2] contains an overview of different techniques of how to scale inductive
> learning algorithms up. A presentation of Spark's MLlib decision tree
> implementation can be found in [3].
> Resources:
> [1] [http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf]
> [2]
> [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.46.8226&rep=rep1&type=pdf]
> [3]
> [http://spark-summit.org/wp-content/uploads/2014/07/Scalable-Distributed-Decision-Trees-in-Spark-Made-Das-Sparks-Talwalkar.pdf]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)