attilapiros commented on a change in pull request #24554: [SPARK-27622][Core]
Avoiding the network when block manager fetches disk persisted RDD blocks from
the same host
URL: https://github.com/apache/spark/pull/24554#discussion_r289459257
##########
File path: core/src/main/scala/org/apache/spark/storage/BlockManager.scala
##########
@@ -827,10 +832,57 @@ private[spark] class BlockManager(
*/
private[spark] def getRemoteValues[T: ClassTag](blockId: BlockId):
Option[BlockResult] = {
val ct = implicitly[ClassTag[T]]
- getRemoteManagedBuffer(blockId).map { data =>
+ getRemoteBlock(blockId, (data: ManagedBuffer) => {
val values =
serializerManager.dataDeserializeStream(blockId,
data.createInputStream())(ct)
new BlockResult(values, DataReadMethod.Network, data.size)
+ })
+ }
+
+ /**
+ * Get the remote block and transform it to the provided data type.
+ *
+ * If the block is persisted to the disk and stored at an executor running
on the same host then
+ * first it is tried to be accessed using the local directories of the other
executor directly.
+ * If the file is successfully identified then tried to be transformed by
the provided
+ * transformation function which expected to open the file. If there is any
exception during this
+ * transformation then block access falls back to fetching it from the
remote executor via the
+ * network.
+ *
+ * @param blockId identifies the block to get
+ * @param bufferTransformer this transformer expected to open the file if
the block is backed by a
+ * file by this it is guaranteed the whole content
can be loaded
+ * @tparam T result type
+ * @return
+ */
+ private[spark] def getRemoteBlock[T](
+ blockId: BlockId,
+ bufferTransformer: ManagedBuffer => T): Option[T] = {
Review comment:
I have seen cases for violating this but they seamed to me more generic
methods (Loner patterns).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]