SCHJonathan commented on code in PR #52154:
URL: https://github.com/apache/spark/pull/52154#discussion_r2371717085


##########
sql/connect/common/src/main/protobuf/spark/connect/pipelines.proto:
##########
@@ -90,6 +92,24 @@ message PipelineCommand {
     optional string format = 8;
   }
 
+  // Metadata about why a query function failed to be executed successfully.
+  message QueryFunctionFailure {
+    // Identifier for a dataset within the graph that the query function 
needed to know the schema
+    // of but which had not yet been analyzed itself.
+    optional string missing_dependency = 1;

Review Comment:
   This is a very good point. This Proto basically describe a `Flow`'s query 
function cannot be lazily analyzed by Spark Connect client because it triggers 
some eager analysis (e.g., df.schema / df.isStreaming), and the dependencies 
represented by the `df` have not yet been resolved by the dataflow graph. 
   
   It's very possible for the `df` to contain multiple unresolved dependencies 
(e.g., multi-table join). Thereby, instead of storing a string identifier for a 
single missing dependency, we directly pass the entire logical plan to the 
server and let the server to filter out which leaf nodes in the plan have not 
yet been resolved by the dataflow graph 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to