Re: [PR] [SPARK-48638][CONNECT] Add ExecutionInfo support for DataFrame [spark]

via GitHub Sat, 22 Jun 2024 23:31:00 -0700


WweiL commented on code in PR #46996:
URL: https://github.com/apache/spark/pull/46996#discussion_r1649448417



##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -206,7 +208,10 @@ def _repr_html_(self) -> Optional[str]:
 
     @property
     def write(self) -> "DataFrameWriter":
-        return DataFrameWriter(self._plan, self._session)
+        def cb(qe: "ExecutionInfo") -> None:
+            self._execution_info = qe
+
+        return DataFrameWriter(self._plan, self._session, cb)

Review Comment:
   Looks like writeStream is not override here. So I imagine streaming query is 
not supported yet.
   
   In streaming a query could have multiple data frames, what we do in scala is 
to access it with query.lastExecution
   
   
https://github.com/apache/spark/blob/94763438943e21ee8156a4f1a3facddbf8d45797/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L192
   
   That's, as it's name, the QueryExecution(`IncrementalExecution`) of the last 
execution.
   
   We could also add a similar mechanism to StreamingQuery object. This sounds 
like an interesting followup that im interested in



##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -206,7 +208,10 @@ def _repr_html_(self) -> Optional[str]:
 
     @property
     def write(self) -> "DataFrameWriter":
-        return DataFrameWriter(self._plan, self._session)
+        def cb(qe: "ExecutionInfo") -> None:
+            self._execution_info = qe
+
+        return DataFrameWriter(self._plan, self._session, cb)

Review Comment:
   Looks like writeStream is not overriden here. So I imagine streaming query 
is not supported yet.
   
   In streaming a query could have multiple data frames, what we do in scala is 
to access it with query.lastExecution
   
   
https://github.com/apache/spark/blob/94763438943e21ee8156a4f1a3facddbf8d45797/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L192
   
   That's, as it's name, the QueryExecution(`IncrementalExecution`) of the last 
execution.
   
   We could also add a similar mechanism to StreamingQuery object. This sounds 
like an interesting followup that im interested in



##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -206,7 +208,10 @@ def _repr_html_(self) -> Optional[str]:
 
     @property
     def write(self) -> "DataFrameWriter":
-        return DataFrameWriter(self._plan, self._session)
+        def cb(qe: "ExecutionInfo") -> None:
+            self._execution_info = qe
+
+        return DataFrameWriter(self._plan, self._session, cb)

Review Comment:
   Looks like writeStream is not overriden here. So I imagine streaming query 
is not supported yet.
   
   In streaming a query could have multiple data frames, what we do in scala is 
to access it with query.explain(), which uses this lastExecution
   
   
https://github.com/apache/spark/blob/94763438943e21ee8156a4f1a3facddbf8d45797/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L192
   
   That's, as it's name, the QueryExecution(`IncrementalExecution`) of the last 
execution.
   
   We could also add a similar mechanism to StreamingQuery object. This sounds 
like an interesting followup that im interested in



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-48638][CONNECT] Add ExecutionInfo support for DataFrame [spark]

Reply via email to