Re: [PR] [SPARK-48330][SS][PYTHON] Fix the python streaming data source timeout issue for large trigger interval [spark]

via GitHub Sun, 19 May 2024 22:02:21 -0700


chaoqin-li1123 commented on code in PR #46651:
URL: https://github.com/apache/spark/pull/46651#discussion_r1606250225



##########
python/pyspark/sql/worker/python_streaming_sink_runner.py:
##########
@@ -82,36 +91,36 @@ def main(infile: IO, outfile: IO) -> None:
         overwrite = read_bool(infile)
         # Instantiate data source reader.
         try:
+            # Create the data source writer instance.
             writer = data_source.streamWriter(schema=schema, 
overwrite=overwrite)
-            # Initialization succeed.
+
+            # Receive the commit messages.
+            num_messages = read_int(infile)
+            commit_messages = []
+            for _ in range(num_messages):
+                message = pickleSer._read_with_length(infile)
+                if message is not None and not isinstance(message, 
WriterCommitMessage):
+                    raise PySparkAssertionError(
+                        error_class="PYTHON_DATA_SOURCE_TYPE_MISMATCH",
+                        message_parameters={
+                            "expected": "an instance of WriterCommitMessage",
+                            "actual": f"'{type(message).__name__}'",
+                        },
+                    )
+                commit_messages.append(message)
+
+            batch_id = read_long(infile)
+            abort = read_bool(infile)
+
+            # Commit or abort the Python data source write.
+            # Note the commit messages can be None if there are failed tasks.
+            if abort:
+                writer.abort(commit_messages, batch_id)  # type: 
ignore[arg-type]
+            else:
+                writer.commit(commit_messages, batch_id)  # type: 
ignore[arg-type]
+                # Send a status code back to JVM.

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-48330][SS][PYTHON] Fix the python streaming data source timeout issue for large trigger interval [spark]

Reply via email to