Oh, certainly. I was thinking in the realm of distributed systems only. Surely serialization across a network is a necessary step in anything like that. Serializing to local disk first, or a distributed file system, may not be. The local writes may not matter. But wouldn't YARN-type setups still be writing to distributed storage?
My broad hunch is that communicating the same amount of data faster probably doesn't get an order of magnitude faster, but a different paradigm that lets you transmit less data does. I was musing about whether M/R forced you into a hopelessly huge amount of I/O implementation for RF, and I'm not sure it does, not yet. These days I continue to want a better sense not just of whether entire paradigms are more/less suitable and how and when and why, but when two different concepts in the same paradigm are qualitatively different or just a different point on a tradeoff curve, optimizing for a different type of problem. On Fri, Mar 8, 2013 at 10:35 PM, Ted Dunning <[email protected]> wrote: > The big cost in map-reduce iteration isn't just startup. It is that the > input has to be read from disk and the output written to same. Were it to > stay in memory, things would be vastly faster. > > Also, startup costs are still pretty significant. Even on MapR, one of the > major problems in setting the recent minute-sort record was getting things > to start quickly. Just setting the heartbeat faster doesn't work on > ordinary Hadoop because there is a global lock that begins to starve the > system. We (our guys, not me) had to seriously hack the job tracker to > move things out of that critical section. At that point, we were able to > shorten the heartbeat interval to 800 ms (which on 2000 nodes means >2400 > heartbeats per second). The startup and cleanup tasks are also single > threaded. > > It might be plausible to shorten to this degree and further on a small > cluster. But iteration is still very painful. > >
