tgravescs commented on issue #19788: [SPARK-9853][Core] Optimize shuffle fetch of contiguous partition IDs URL: https://github.com/apache/spark/pull/19788#issuecomment-456833428 >> If remote server supports merge, it will merge blocks and the returned StreamHandle.numChunks < OpenBlocks.blockIds.length. The client will check and know merge happens, so it will work accordingly. So just looking at the description, this implementation is simply having the server side read from the separate map output files and send them out in one stream when the reducer actually reads, correct? Meaning you are still getting disk seeks on the server side, but on the client side it see's one stream that contains the multiple map outputs, correct? I'm curious what specific performance benefits you were seeing from this? Is it just the client side or is there something on the server side that I might not be thinking about?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
