Hi, it may be a stupid question, but in my application I could do without sort by keys. If only reducers could be told to start their work on the first maps that they see, my processing would begin to show results much earlier, before all the mappers are done. Now, eventually, all mappers will have to finish, so I am not gaining on the total task duration, but only on first results appearing faster.
Then, if course, I could obtain some intermediates statistics with counters or with some additional NoSQL database. I am also concerned about millions of maps that my mappers are emitting - is that OK? Am I putting too much of a burden on the shuffle stage? Thank you, Mark