Re: [PR] [SPARK-48497][PYTHON][DOCS] Add an example for Python data source writer in user guide [spark]

via GitHub Mon, 17 Jun 2024 13:22:19 -0700


allisonwang-db commented on code in PR #46833:
URL: https://github.com/apache/spark/pull/46833#discussion_r1643390347



##########
python/docs/source/user_guide/sql/python_data_source.rst:
##########
@@ -109,6 +112,42 @@ Define the reader logic to generate synthetic data. Use 
the `faker` library to p
                     row.append(value)
                 yield tuple(row)
 
+**Implement the Writer**
+
+Create a fake data source writer that processes each partition of data, counts 
the rows, and either
+prints the total count of rows after a successful write or the number of 
failed tasks if the writing process fails.
+
+.. code-block:: python
+
+    from dataclasses import dataclass
+    from typing import Iterator, List
+    from pyspark.sql.types import Row
+    from pyspark.sql.datasource import DataSource, DataSourceWriter, 
WriterCommitMessage
+
+    @dataclass
+    class SimpleCommitMessage(WriterCommitMessage):
+        partition_id: int
+        count: int
+
+    class FakeDataSourceWriter(DataSourceWriter):
+
+        def write(self, rows: Iterator[Row]) -> SimpleCommitMessage:
+            from pyspark import TaskContext

Review Comment:
   This import actually needs to be inside the write method otherwise it will 
throw a serialization error.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-48497][PYTHON][DOCS] Add an example for Python data source writer in user guide [spark]

Reply via email to