[
https://issues.apache.org/jira/browse/MAHOUT-747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13057138#comment-13057138
]
Sean Owen commented on MAHOUT-747:
----------------------------------
I somehow feel this is not a great fit for Hadoop. This is a really simple job,
to be sure, and that's no bad thing per se. However it does raise the question
of whether the big overheads of distributing buy you much.
For example using a job to count the number of records in the input is a load
of overhead for that task, even if you have done it well with a combiner.
And the second job uses 1 reducer anyway; the "meat" of the computation is not
distributed.
I can surely imagine this computation existing as part of another computation,
but I wonder if you wouldn't already have info at hand like the number of
items? does this actually form a reusable component.
> Entropy implementation in Map/Reduce
> ------------------------------------
>
> Key: MAHOUT-747
> URL: https://issues.apache.org/jira/browse/MAHOUT-747
> Project: Mahout
> Issue Type: New Feature
> Components: Math
> Affects Versions: 0.6
> Reporter: Christoph Nagel
> Attachments: MAHOUT-747.patch
>
>
> Hi again,
> because I got much to work with entropy and information gain ratio, I want to
> implement the following distributed algorithms:
> * Entropy
> (https://secure.wikimedia.org/wikipedia/en/wiki/Entropy_%28information_theory%29)
> * Conditional Entropy
> (https://secure.wikimedia.org/wikipedia/en/wiki/Conditional_entropy)
> * Information Gain
> * Information Gain Ratio
> (https://secure.wikimedia.org/wikipedia/en/wiki/Information_gain_ratio)
> This issue is at first only for entropy.
> Some questions:
> * In which package do the classes belong. I put them first at
> 'org.apache.mahout.math.stats', don't know if this is right, because they are
> components of information retrieval.
> * Entropy only reads a set of elements. As input i took a sequence file with
> keys of type Text and values anyone, because I only work with the keys. Is
> this the best practise?
> * Is there a generic solution, so that the type of keys can be anything
> inherited from Writable?
> In Hadoop is a TokenCounterMapper, which emits each value with an
> IntWritable(1). I added a KeyCounterMapper into
> 'org.apache.mahout.common.mapreduce' which does the same with the keys.
> Will append my patch soon.
> Regards, Christoph.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira