[ 
https://issues.apache.org/jira/browse/MAHOUT-747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13058661#comment-13058661
 ] 

Sean Owen commented on MAHOUT-747:
----------------------------------

I have a number of comments from large to small.

- I am still not sure how the entropy calculation is distributed. All map keys 
are NullWritable so they all go to one reducer.
- CalculateSpecificConditionalEntropyMapper seems to store too much in memory 
-- a mapping for all keys. Does this not blow up at scale?

- Aren't CalculateSimilarityCombiner and DoubleSumReducer virtually the same?
- Use VarIntWritable instead of IntWritable for much better I/O efficiency
- There's already a StringTuple class that would let you write a pair of Strings

- I prefer to avoid inner classes myself here but it's more a question of 
preference
- I think it's more standard to use one capital letter for generic types ("K") 
rather than words ("Key")

> Entropy implementation in Map/Reduce
> ------------------------------------
>
>                 Key: MAHOUT-747
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-747
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Christoph Nagel
>         Attachments: MAHOUT-747.patch
>
>
> Hi again,
> because I got much to work with entropy and information gain ratio, I want to 
> implement the following distributed algorithms:
> * Entropy 
> (https://secure.wikimedia.org/wikipedia/en/wiki/Entropy_%28information_theory%29)
> * Conditional Entropy 
> (https://secure.wikimedia.org/wikipedia/en/wiki/Conditional_entropy)
> * Information Gain
> * Information Gain Ratio 
> (https://secure.wikimedia.org/wikipedia/en/wiki/Information_gain_ratio)
> This issue is at first only for entropy.
> Some questions:
> * In which package do the classes belong. I put them first at 
> 'org.apache.mahout.math.stats', don't know if this is right, because they are 
> components of information retrieval.
> * Entropy only reads a set of elements. As input i took a sequence file with 
> keys of type Text and values anyone, because I only work with the keys. Is 
> this the best practise?
> * Is there a generic solution, so that the type of keys can be anything 
> inherited from Writable?
> In Hadoop is a TokenCounterMapper, which emits each value with an 
> IntWritable(1). I added a KeyCounterMapper into 
> 'org.apache.mahout.common.mapreduce' which does the same with the keys.
> Will append my patch soon.
> Regards, Christoph.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to