computing conditional probabilities with Hadoop?

Chris Dyer Thu, 27 Sep 2007 12:23:12 -0700

Hi all--
I'm new to using Hadoop so I'm hoping to get a little guidance on what
the best way to solve a particular class of problems would be.


The general use case is this: from a very small set of data, I will
generate a massive set of pairs of values, ie, <A,B>.  I would like to
compute the maximum likelihood estimate (MLE) of the conditional
probability P(A|B).  However, it is very obvious to me how to compute
the counts of C(<A,B>) and equally obvious how to compute the counts
C(<A,*>) or C(<*,B>), but what I need is: C(<A,B>)/C(<*,B>).

My approach:

My initial approach to the decomposition of this problem is to use a
mapper to go from my input data to <A,B> pairs, and then a reducer to
go for <A,B> pairs to counts C(A,B).  However, at that point, I'd like
a second reducer like thing (call it Normalize) to run which
aggregates all the C(*,B) pairs and returns a value P(A|B) for each A
that occurs with B.  This is where things get fuzzy for me.  How do I
do this?  A reducer can only return a single value (for example, if I
make B the key for Normalize it could return C(B) very easily).  What
I need is a value type that reduce can return that is essential a list
of (key,value) pairs.  Does such a thing exist?  Am I approaching this
the wrong way?

Thanks for any assistance!
Chris

------------------------------------------
Chris Dyer
Dept. of Linguistics
University of Maryland
1401 Marie Mount Hall
College Park MD 20742

computing conditional probabilities with Hadoop?

Reply via email to