Entropy implementation in Map/Reduce
------------------------------------

                 Key: MAHOUT-747
                 URL: https://issues.apache.org/jira/browse/MAHOUT-747
             Project: Mahout
          Issue Type: New Feature
          Components: Math
    Affects Versions: 0.6
            Reporter: Christoph Nagel


Hi again,

because I got much to work with entropy and information gain ratio, I want to 
implement the following distributed algorithms:
* Entropy 
(https://secure.wikimedia.org/wikipedia/en/wiki/Entropy_%28information_theory%29)
* Conditional Entropy 
(https://secure.wikimedia.org/wikipedia/en/wiki/Conditional_entropy)
* Information Gain
* Information Gain Ratio 
(https://secure.wikimedia.org/wikipedia/en/wiki/Information_gain_ratio)

This issue is at first only for entropy.

Some questions:
* In which package do the classes belong. I put them first at 
'org.apache.mahout.math.stats', don't know if this is right, because they are 
components of information retrieval.
* Entropy only reads a set of elements. As input i took a sequence file with 
keys of type Text and values anyone, because I only work with the keys. Is this 
the best practise?
* Is there a generic solution, so that the type of keys can be anything 
inherited from Writable?

In Hadoop is a TokenCounterMapper, which emits each value with an 
IntWritable(1). I added a KeyCounterMapper into 
'org.apache.mahout.common.mapreduce' which does the same with the keys.

Will append my patch soon.

Regards, Christoph.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to