As I may have mentioned, my main goal currently is the processing of 
physiologic data using hadoop and MR.  The steps are:

Convert ADC units to physical units (input is <sample num, raw value>, output 
is <sample num, physical value>
Perform a peak detection to detect the systolic blood pressure (input is 
<sample num, physical value>, output is <sample num, physical value> but the 
output is only a subset of the input)
Calculate the central tendency measure using a sliding window (mapper input is 
<sample num, physical value>, mapper output is <window ID, (sample num, 
physical value)>, reducer input is <window ID, central tendency measurement at 
different radii> )

Each of the above steps builds upon the result of the previous.  So, for the 
first two steps, I have been doing everything in the mapper and specified 0 
reduce tasks.  The last step, I am performing calculations on a sliding window 
of N points, skipping forward M points for the next window.  N is >> M.  So, to 
implement this, I have a mapper that outputs all of the x,y points (the value) 
for a particular key (the window ID).  The reducer then performs the 
calculations on each window's data.  Everything works pretty well except that I 
noticed the splitting of the input across different mappers affects the final 
output.  Due to the nature of the calculations, this doesn't affect the end 
result very much.

However, I'm trying to make sure I understand everything properly, and I want 
to see if there is a better/proper way of implementing something like this.  
I'm guessing the problem comes from the fact that I'm trying to use contiguous 
data points to create a window of N points.  The window ID is just the first 
sample num encountered for the window.  As a result, the first sample num 
encountered will change for everything but the first map task, when compared to 
a serial execution.

Thanks!

--Andrew

Reply via email to