Matthew Hayes created DATAFU-2:
----------------------------------

             Summary: UDFs for entropy and weighted sampling algorithms
                 Key: DATAFU-2
                 URL: https://issues.apache.org/jira/browse/DATAFU-2
             Project: DataFu
          Issue Type: Task
            Reporter: Matthew Hayes


Jian Wang has suggested that we add UDFs for entropy and weighted random 
sampling and has implementations for each of these ready.

In his words:

"In the real world, there are occasions we need to calculate the entropy of 
discrete random variables, for instance, to calculate the mutual information 
between variable X and Y using its entropy-based formula(mutual information 
calculation could be found at 
http://en.wikipedia.org/wiki/Mutual_information#Relation_to_other_quantities). 
Would suggest to implement a UDF to calculate the entropy of given input 
samples, following the definition at 
http://en.wikipedia.org/wiki/Entropy_%28information_theory%29

This is the reference paper I use to learn about the weighted sampleing 
algorithm: http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf

The present WeightedSample.java implements the Algorithm D.

We may try Algorithm A, A-res and A-expJ since they could be used in a data 
stream and distributed environment. These algorithms could be implemented based 
on ReservoirSample.java(inherit from this class?) since they also need a 
reservior to store the selected items."



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to