Hi Avro Devs, I am currently with a customer with a long running MapReduce job that is very slow (hours where I expect minutes). I traced the issue back to mapreduce.AvroMultipleOutputs.
My customer was using the write method that does not take a namedOutput. The problem here is the instantiation of the Job and TaskContext class for every record and that's a very slow operation (turned a 4h job into 45min when fixed). So instead we switched to namedOutputs but our problem is that we don't know which outputs we'll have before we start the job. Unfortunately, the class takes a copy of all named outputs at instantiation time (from the job configuration at that time) so anything added after the start is discarded. It's been so long that I worked with MR related classes: The Job & Configuration are instantiated on the ApplicationMaster and then serialized as Tasks to the Mappers & Reducers. So putting the named outputs in some other structure in the class probably won't work, I guess? But does anything speak against making the named outputs changeable for each instance of the AvroMultipleOutput class? I'd like to add a non-static version of addNamedOutput - I can't think of anything that would prevent this. I'm working on a patch for this (it'll bring larger changes to the class) and I'll obviously keep the current API. I'm just wondering if there is anything that would prevent this addition? Cheers, Lars
