[GitHub] [spark] xkrogen commented on a diff in pull request #40147: [SPARK-42543][CONNECT] Specify protocol for UDF artifact transfer in JVM/Scala client

via GitHub Thu, 23 Feb 2023 13:59:12 -0800


xkrogen commented on code in PR #40147:
URL: https://github.com/apache/spark/pull/40147#discussion_r1116278139



##########
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##########
@@ -183,6 +183,60 @@ message ExecutePlanResponse {
   }
 }
 
+// Request to transfer client-local artifacts.
+message AddArtifactsRequest {
+
+  // Definition of an Artifact.
+  message Artifact {
+    // The name of the artifact is expected in the form of a "Relative Path" 
that is made up of a
+    // sequence of directories and the final file element.
+    // Examples of "Relative Path"s: "jars/test.jar", "classes/xyz.class", 
"abc.xyz", "a/b/X.jar".
+    // The server is expected to maintain the hierarchy of files as defined by 
their name. (i.e
+    // The relative path of the file on the server's filesystem will be the 
same as the name of
+    // the provided artifact)
+    string name = 1;
+    // Raw data.
+    bytes data = 2;
+  }
+
+  // A number of small artifacts batched into a single RPC.
+  message Batch {
+    repeated Artifact artifacts = 1;
+  }
+
+  // The client_id is set by the client to be able to collate streaming 
responses from
+  // different queries.
+  string client_id = 1;
+
+  // User context
+  UserContext user_context = 2;
+
+  // The payload is either a batch of artifacts or a partial chunk of a large 
artifact.
+  oneof payload {
+    Batch batch = 3;
+    // A large artifact chunked into multiple requests. The server side should 
assume that the
+    // artifact has been completely uploaded either when it encounters a new 
artifact name, or
+    // when the the stream is completed.
+    Artifact chunk = 4;
+  }
+}
+
+// Response to adding an artifact. Contains relevant metadata to verify 
successful transfer of
+// artifact(s).
+message AddArtifactsResponse {
+  // Metadata of an artifact.
+  message ArtifactSummary {
+    string name = 1;
+    // Size in bytes.
+    int64 size = 2;
+    // CRC to verify integrity of the transferred artifact.
+    int64 crc = 3;

Review Comment:
   Usually the sender of the artifact (the client) would provide the crc, so 
that the receiver (the server) can validate it after receiving. Why do we 
return the crc from the server back to the client? How is the client supposed 
to respond if the crc is incorrect?



##########
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##########
@@ -183,6 +183,60 @@ message ExecutePlanResponse {
   }
 }
 
+// Request to transfer client-local artifacts.
+message AddArtifactsRequest {
+
+  // Definition of an Artifact.
+  message Artifact {
+    // The name of the artifact is expected in the form of a "Relative Path" 
that is made up of a
+    // sequence of directories and the final file element.
+    // Examples of "Relative Path"s: "jars/test.jar", "classes/xyz.class", 
"abc.xyz", "a/b/X.jar".
+    // The server is expected to maintain the hierarchy of files as defined by 
their name. (i.e
+    // The relative path of the file on the server's filesystem will be the 
same as the name of
+    // the provided artifact)
+    string name = 1;
+    // Raw data.
+    bytes data = 2;
+  }
+
+  // A number of small artifacts batched into a single RPC.
+  message Batch {
+    repeated Artifact artifacts = 1;
+  }
+
+  // The client_id is set by the client to be able to collate streaming 
responses from
+  // different queries.
+  string client_id = 1;
+
+  // User context
+  UserContext user_context = 2;
+
+  // The payload is either a batch of artifacts or a partial chunk of a large 
artifact.
+  oneof payload {
+    Batch batch = 3;
+    // A large artifact chunked into multiple requests. The server side should 
assume that the
+    // artifact has been completely uploaded either when it encounters a new 
artifact name, or
+    // when the the stream is completed.
+    Artifact chunk = 4;

Review Comment:
   This protocol basically defines the end of an artifact by the start of a new 
artifact. It feels awkward. Why not include some metadata, such as total 
expected length for an artifact, to make it easy for the server to determine up 
front whether it is going to receive a full artifact or just a chunk of it? 
This could also make it easier for the server-side to, for example, allocate 
the entire required space up front.



##########
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##########
@@ -193,5 +247,9 @@ service SparkConnectService {
 
   // Analyzes a query and returns a [[AnalyzeResponse]] containing metadata 
about the query.
   rpc AnalyzePlan(AnalyzePlanRequest) returns (AnalyzePlanResponse) {}
+
+  // Add artifacts to the session and returns a [[AddArtifactsResponse]] 
containing metadata about
+  // the added artifacts.
+  rpc AddArtifacts(stream AddArtifactsRequest) returns (AddArtifactsResponse) 
{}

Review Comment:
   There's some somewhat concerning language about stream performance in Python 
on gRPC's [performance page](https://grpc.io/docs/guides/performance/), can you 
provide any benchmarks about how well this performs for JVM/Python clients? 
Individual JAR files, and the number of JAR files needed, can become quite 
large.



##########
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##########
@@ -183,6 +183,60 @@ message ExecutePlanResponse {
   }
 }
 
+// Request to transfer client-local artifacts.
+message AddArtifactsRequest {
+
+  // Definition of an Artifact.
+  message Artifact {
+    // The name of the artifact is expected in the form of a "Relative Path" 
that is made up of a
+    // sequence of directories and the final file element.
+    // Examples of "Relative Path"s: "jars/test.jar", "classes/xyz.class", 
"abc.xyz", "a/b/X.jar".
+    // The server is expected to maintain the hierarchy of files as defined by 
their name. (i.e
+    // The relative path of the file on the server's filesystem will be the 
same as the name of
+    // the provided artifact)
+    string name = 1;
+    // Raw data.
+    bytes data = 2;
+  }
+
+  // A number of small artifacts batched into a single RPC.
+  message Batch {
+    repeated Artifact artifacts = 1;
+  }
+
+  // The client_id is set by the client to be able to collate streaming 
responses from
+  // different queries.
+  string client_id = 1;
+
+  // User context
+  UserContext user_context = 2;
+
+  // The payload is either a batch of artifacts or a partial chunk of a large 
artifact.
+  oneof payload {
+    Batch batch = 3;

Review Comment:
   This seems unnecessarily complex. Couldn't we just have `repeated Artifact 
artifacts` here directly? The client can choose to put multiple, or just one. 
What value is the `Batch` abstraction providing? Why is there an important 
distinction between a `Batch` and a single artifact?



##########
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##########
@@ -183,6 +183,60 @@ message ExecutePlanResponse {
   }
 }
 
+// Request to transfer client-local artifacts.
+message AddArtifactsRequest {
+
+  // Definition of an Artifact.
+  message Artifact {
+    // The name of the artifact is expected in the form of a "Relative Path" 
that is made up of a
+    // sequence of directories and the final file element.
+    // Examples of "Relative Path"s: "jars/test.jar", "classes/xyz.class", 
"abc.xyz", "a/b/X.jar".
+    // The server is expected to maintain the hierarchy of files as defined by 
their name. (i.e
+    // The relative path of the file on the server's filesystem will be the 
same as the name of
+    // the provided artifact)
+    string name = 1;
+    // Raw data.
+    bytes data = 2;
+  }

Review Comment:
   Should consider allowing clients to specify artifacts from Ivy coordinates, 
either in addition or instead of directly providing the contents of a JAR? This 
seems to be a cleaner separation of client and server; the client shouldn't 
need to have a copy of the JARs it will use to execute some UDF. Spark already 
has support for localizing artifacts from an external repository.



##########
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##########
@@ -183,6 +183,60 @@ message ExecutePlanResponse {
   }
 }
 
+// Request to transfer client-local artifacts.
+message AddArtifactsRequest {
+
+  // Definition of an Artifact.
+  message Artifact {
+    // The name of the artifact is expected in the form of a "Relative Path" 
that is made up of a
+    // sequence of directories and the final file element.
+    // Examples of "Relative Path"s: "jars/test.jar", "classes/xyz.class", 
"abc.xyz", "a/b/X.jar".
+    // The server is expected to maintain the hierarchy of files as defined by 
their name. (i.e
+    // The relative path of the file on the server's filesystem will be the 
same as the name of
+    // the provided artifact)
+    string name = 1;
+    // Raw data.
+    bytes data = 2;
+  }
+
+  // A number of small artifacts batched into a single RPC.
+  message Batch {
+    repeated Artifact artifacts = 1;
+  }
+
+  // The client_id is set by the client to be able to collate streaming 
responses from
+  // different queries.
+  string client_id = 1;
+
+  // User context
+  UserContext user_context = 2;
+
+  // The payload is either a batch of artifacts or a partial chunk of a large 
artifact.
+  oneof payload {
+    Batch batch = 3;
+    // A large artifact chunked into multiple requests. The server side should 
assume that the
+    // artifact has been completely uploaded either when it encounters a new 
artifact name, or
+    // when the the stream is completed.
+    Artifact chunk = 4;
+  }
+}
+
+// Response to adding an artifact. Contains relevant metadata to verify 
successful transfer of
+// artifact(s).
+message AddArtifactsResponse {
+  // Metadata of an artifact.
+  message ArtifactSummary {
+    string name = 1;
+    // Size in bytes.
+    int64 size = 2;

Review Comment:
   doesn't the client already know both of these? why do we need to send it 
back?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] xkrogen commented on a diff in pull request #40147: [SPARK-42543][CONNECT] Specify protocol for UDF artifact transfer in JVM/Scala client

Reply via email to