[
https://issues.apache.org/jira/browse/DATAFU-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matthew Hayes updated DATAFU-2:
-------------------------------
Description:
Jian Wang has suggested that we add UDFs for entropy and weighted random
sampling and has implementations for each of these ready.
In Jian's words:
"In the real world, there are occasions we need to calculate the entropy of
discrete random variables, for instance, to calculate the mutual information
between variable X and Y using its entropy-based formula(mutual information
calculation could be found at
http://en.wikipedia.org/wiki/Mutual_information#Relation_to_other_quantities).
Would suggest to implement a UDF to calculate the entropy of given input
samples, following the definition at
http://en.wikipedia.org/wiki/Entropy_%28information_theory%29
This is the reference paper I use to learn about the weighted sampleing
algorithm: http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf
The present WeightedSample.java implements the Algorithm D.
We may try Algorithm A, A-res and A-expJ since they could be used in a data
stream and distributed environment. These algorithms could be implemented based
on ReservoirSample.java(inherit from this class?) since they also need a
reservior to store the selected items."
was:
Jian Wang has suggested that we add UDFs for entropy and weighted random
sampling and has implementations for each of these ready.
In his words:
"In the real world, there are occasions we need to calculate the entropy of
discrete random variables, for instance, to calculate the mutual information
between variable X and Y using its entropy-based formula(mutual information
calculation could be found at
http://en.wikipedia.org/wiki/Mutual_information#Relation_to_other_quantities).
Would suggest to implement a UDF to calculate the entropy of given input
samples, following the definition at
http://en.wikipedia.org/wiki/Entropy_%28information_theory%29
This is the reference paper I use to learn about the weighted sampleing
algorithm: http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf
The present WeightedSample.java implements the Algorithm D.
We may try Algorithm A, A-res and A-expJ since they could be used in a data
stream and distributed environment. These algorithms could be implemented based
on ReservoirSample.java(inherit from this class?) since they also need a
reservior to store the selected items."
> UDFs for entropy and weighted sampling algorithms
> -------------------------------------------------
>
> Key: DATAFU-2
> URL: https://issues.apache.org/jira/browse/DATAFU-2
> Project: DataFu
> Issue Type: Task
> Reporter: Matthew Hayes
>
> Jian Wang has suggested that we add UDFs for entropy and weighted random
> sampling and has implementations for each of these ready.
> In Jian's words:
> "In the real world, there are occasions we need to calculate the entropy of
> discrete random variables, for instance, to calculate the mutual information
> between variable X and Y using its entropy-based formula(mutual information
> calculation could be found at
> http://en.wikipedia.org/wiki/Mutual_information#Relation_to_other_quantities).
> Would suggest to implement a UDF to calculate the entropy of given input
> samples, following the definition at
> http://en.wikipedia.org/wiki/Entropy_%28information_theory%29
> This is the reference paper I use to learn about the weighted sampleing
> algorithm: http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf
> The present WeightedSample.java implements the Algorithm D.
> We may try Algorithm A, A-res and A-expJ since they could be used in a data
> stream and distributed environment. These algorithms could be implemented
> based on ReservoirSample.java(inherit from this class?) since they also need
> a reservior to store the selected items."
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)