[
https://issues.apache.org/jira/browse/MAHOUT-747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059496#comment-13059496
]
Christoph Nagel commented on MAHOUT-747:
----------------------------------------
Added a new implementation.
@Sean Owen:
* why don't you think, that the entropy calculation is distributed? There are 2
m/r-tasks. First groups and counts, which can take max |key| reducer. Second
does the calculation and sums all values in the reducer. OK, only one reducer
can do this, but I don't see another way. Added a more complexity to it, sorry,
because of better calculation. Thanks @Ted Dunning for his hint.
* changed the conditionalEntropy calculation, so no join is needed.
* sorry, don't understand the point "Aren't CalculateSimilarityCombiner and
DoubleSumReducer virtually the same?". Can't find CalculateSimilarityCombiner.
* Changed all IntWritable to VarIntWritable
* Thanks for StringTuple, saves much lines.
Regards, Christoph.
> Entropy implementation in Map/Reduce
> ------------------------------------
>
> Key: MAHOUT-747
> URL: https://issues.apache.org/jira/browse/MAHOUT-747
> Project: Mahout
> Issue Type: New Feature
> Components: Math
> Affects Versions: 0.6
> Reporter: Christoph Nagel
> Attachments: MAHOUT-747.patch
>
>
> Hi again,
> because I got much to work with entropy and information gain ratio, I want to
> implement the following distributed algorithms:
> * Entropy
> (https://secure.wikimedia.org/wikipedia/en/wiki/Entropy_%28information_theory%29)
> * Conditional Entropy
> (https://secure.wikimedia.org/wikipedia/en/wiki/Conditional_entropy)
> * Information Gain
> * Information Gain Ratio
> (https://secure.wikimedia.org/wikipedia/en/wiki/Information_gain_ratio)
> This issue is at first only for entropy.
> Some questions:
> * In which package do the classes belong. I put them first at
> 'org.apache.mahout.math.stats', don't know if this is right, because they are
> components of information retrieval.
> * Entropy only reads a set of elements. As input i took a sequence file with
> keys of type Text and values anyone, because I only work with the keys. Is
> this the best practise?
> * Is there a generic solution, so that the type of keys can be anything
> inherited from Writable?
> In Hadoop is a TokenCounterMapper, which emits each value with an
> IntWritable(1). I added a KeyCounterMapper into
> 'org.apache.mahout.common.mapreduce' which does the same with the keys.
> Will append my patch soon.
> Regards, Christoph.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira