Hi, I have a spreadsheet where each column contains values for one variable. and I need to calculate sum, variance, etc for each column. For my understanding, mapper and reducer work for <key, value> pair, can anyone kindly enlighten me how to abstract this problem?
Maybe for the mapper, let it read each line, set variable name/number as "key", and corresponding value as "value". Then when all pairs with the same "key" (i.e. they belong to same variable) be passed to a reducer, reducer can do the calculation, and output to file. is this idea correct? can anyone kindly give some comment? Besides, in this method, the number of reducers will be determined by the number of variables I have. What happen if variable number is limited, and for each variable, the number of entries is far much bigger than the total number of variables, then execution time for each reducer can be comparatively long. Any way to make use of more hardware resource, and create more reducers to run in parallel? Best regards, Xin