Hi all-- I'm new to using Hadoop so I'm hoping to get a little guidance on what the best way to solve a particular class of problems would be.
The general use case is this: from a very small set of data, I will generate a massive set of pairs of values, ie, <A,B>. I would like to compute the maximum likelihood estimate (MLE) of the conditional probability P(A|B). However, it is very obvious to me how to compute the counts of C(<A,B>) and equally obvious how to compute the counts C(<A,*>) or C(<*,B>), but what I need is: C(<A,B>)/C(<*,B>). My approach: My initial approach to the decomposition of this problem is to use a mapper to go from my input data to <A,B> pairs, and then a reducer to go for <A,B> pairs to counts C(A,B). However, at that point, I'd like a second reducer like thing (call it Normalize) to run which aggregates all the C(*,B) pairs and returns a value P(A|B) for each A that occurs with B. This is where things get fuzzy for me. How do I do this? A reducer can only return a single value (for example, if I make B the key for Normalize it could return C(B) very easily). What I need is a value type that reduce can return that is essential a list of (key,value) pairs. Does such a thing exist? Am I approaching this the wrong way? Thanks for any assistance! Chris ------------------------------------------ Chris Dyer Dept. of Linguistics University of Maryland 1401 Marie Mount Hall College Park MD 20742
