Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/21440#discussion_r203773818 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -659,6 +659,11 @@ private[spark] class BlockManager( * Get block from remote block managers as serialized bytes. */ def getRemoteBytes(blockId: BlockId): Option[ChunkedByteBuffer] = { + // TODO if we change this method to return the ManagedBuffer, then getRemoteValues + // could just use the inputStream on the temp file, rather than memory-mapping the file. + // Until then, replication can cause the process to use too much memory and get killed + // by the OS / cluster manager (not a java OOM, since its a memory-mapped file) even though + // we've read the data to disk. --- End diff -- to be honest I don't have perfect understanding of this, but my impression is that it is not exactly _lazy_ loading, the OS has a lot of leeway in deciding how much to keep in memory, but that it should always release the memory under pressure. this is problematic under yarn, when the container's memory use is being monitored independently of the OS. so the OS thinks its fine to put large amounts of data in physical memory, but then the yarn NM looks at the memory use of the specific process tree, decides its over the limits it has configured, and so kills it. At least, I've seen cases of yarn killing things for exceeding memory limits where I *thought* that was the case, though I did not directly confirm it.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org