Hi, You can refer to the following code to calculate sigmax(sum) Mappers Extracting a specific column - https://github.com/zinnia-phatak-dev/Nectar/blob/master/Nectar-common/src/main/java/com/zinnia/nectar/util/hadoop/FieldSeperator.java
Sum Mapper - https://github.com/zinnia-phatak-dev/Nectar/blob/master/Nectar-regression/src/main/java/com/zinnia/nectar/regression/hadoop/primitive/mapreduce/SigmaMapper.java Sum Reducer - https://github.com/zinnia-phatak-dev/Nectar/blob/master/Nectar-regression/src/main/java/com/zinnia/nectar/regression/hadoop/primitive/mapreduce/DoubleSumReducer.java Driver or Main class - https://github.com/zinnia-phatak-dev/Nectar/blob/master/Nectar-regression/src/main/java/com/zinnia/nectar/regression/hadoop/primitive/jobs/SigmaJob.java By default it works for a tab seperarted file . But you can easily change the code by change FieldSeperator code. On Tue, Apr 3, 2012 at 10:25 AM, Fang Xin <nusfang...@gmail.com> wrote: > Hi Rohit, thank you for your reply. > As for the second assumption, could you kindly further enlighten me a > bit, please? > > Thank you. > > On Tue, Apr 3, 2012 at 12:50 PM, Rohit Kelkar <rohitkel...@gmail.com> > wrote: > > Your idea in first paragraph is correct. To speed up things you can > > also explore the possibility of using a Combiner. For ex. for > > computing the sum set the combiner to be the same class as your > > reducer. For calculating variance write a combiner class that would > > output (xi - mu)^2 and in the reducer code you could take the sqrt. > > > > Your second assumption that number of reducers = number of variables > > is not right. > > > > - Rohit Kelkar > > > > On Tue, Apr 3, 2012 at 10:10 AM, Fang Xin <nusfang...@gmail.com> wrote: > >> Hi, > >> > >> I have a spreadsheet where each column contains values for one > >> variable. and I need to calculate sum, variance, etc for each column. > >> For my understanding, mapper and reducer work for <key, value> pair, > >> can anyone kindly enlighten me how to abstract this problem? > >> > >> Maybe for the mapper, let it read each line, set variable name/number > >> as "key", and corresponding value as "value". > >> Then when all pairs with the same "key" (i.e. they belong to same > >> variable) be passed to a reducer, reducer can do the calculation, and > >> output to file. > >> is this idea correct? can anyone kindly give some comment? > >> > >> Besides, in this method, the number of reducers will be determined by > >> the number of variables I have. > >> What happen if variable number is limited, and for each variable, the > >> number of entries is far much bigger than the total number of > >> variables, then execution time for each reducer can be comparatively > >> long. > >> Any way to make use of more hardware resource, and create more > >> reducers to run in parallel? > >> > >> Best regards, > >> Xin > -- https://github.com/zinnia-phatak-dev/Nectar