Hi, This relates to a bug we had a while back.
When running a reducer, if you want to buffer the values, you normally need to take a copy of each value as you iterate through them. This is because the iterator always returns the same object but the contents of the object get filled with each value as the iterator steps through. However *this behaviour is not reproduced by the reducer drivers in MR unit*. Even if you give the reduce driver a List (why do we have to give a List when reducer specifies merely an Iterable?) designed to behave this way, MR unit copies the values into a normal List before presenting them to the reducer. At least this is the case with the 0.20.1 install we have. Anyway, in order to test our bug fix we extended the ReduceDriver class to actually copy the values into an iterable that does reproduce the behaviour so that we can test for bugs caused by failing to copy the values. In more recent versions of Hadoop (we use 0.20.1) is the behaviour of the reduce drivers altered to match that of actual running reducers in this respect? Are there any plans to do this? Alternatively, I'd be willing to fix this in the Hadoop codebase myself if necessary. Regards, James -- James Hammerton | Senior Data Mining Engineer www.mendeley.com/profiles/james-hammerton Mendeley Limited | London, UK | www.mendeley.com Registered in England and Wales | Company Number 6419015
