[GitHub] [spark] HyukjinKwon commented on a change in pull request #30835: [SPARK-33836][SS][PYTHON] Expose DataStreamReader.table and DataStreamWriter.toTable

GitBox Fri, 18 Dec 2020 06:05:04 -0800


HyukjinKwon commented on a change in pull request #30835:
URL: https://github.com/apache/spark/pull/30835#discussion_r545849671




##########
File path: python/pyspark/sql/streaming.py
##########
@@ -1464,11 +1491,82 @@ def start(self, path=None, format=None, 
outputMode=None, partitionBy=None, query
         else:
             return self._sq(self._jwrite.start(path))
 
+    def toTable(self, tableName, format=None, outputMode=None, 
partitionBy=None, queryName=None,
+                **options):
+        """
+        Streams the contents of the :class:`DataFrame` to the output table.
+
+        A new table will be created if the table not exists. The returned 
[[StreamingQuery]]
+        object can be used to interact with the stream.
+
+        .. versionadded:: 3.2.0
+
+        Parameters
+        ----------
+        tableName : str
+            string, for the name of the table.
+        format : str, optional
+            the format used to save.
+        outputMode : str, optional
+            specifies how data of a streaming DataFrame/Dataset is written to a
+            streaming sink.
+
+            * `append`: Only the new rows in the streaming DataFrame/Dataset 
will be written to the
+              sink
+            * `complete`: All the rows in the streaming DataFrame/Dataset will 
be written to the
+              sink every time these is some updates
+            * `update`: only the rows that were updated in the streaming 
DataFrame/Dataset will be
+              written to the sink every time there are some updates. If the 
query doesn't contain
+              aggregations, it will be equivalent to `append` mode.
+        partitionBy : str or list, optional
+            names of partitioning columns
+        queryName : str, optional
+            unique name for the query
+        **options : dict
+            All other string options. You may want to provide a 
`checkpointLocation`.
+
+        Notes
+        -----
+        This API is evolving.
+
+        Examples
+        --------
+        >>> sq = 
sdf.writeStream.format('parquet').queryName('this_query').option(
+        ...      'checkpointLocation', 
'/tmp/checkpoint').toTable(output_table_name)
+        >>> sq.isActive
+        True
+        >>> sq.name
+        'this_query'
+        >>> sq.stop()
+        >>> sq.isActive
+        False
+        >>> sq = sdf.writeStream.trigger(processingTime='5 seconds').toTable(
+        ...     output_table_name, queryName='that_query', 
outputMode="append", format='parquet',
+        ...     checkpointLocation='/tmp/checkpoint')
+        >>> sq.name
+        'that_query'
+        >>> sq.isActive
+        True
+        >>> sq.stop()
+        """

Review comment:
       I was thinking something like ... :
   
   ```python
           """
           Starts the execution of the streaming query, which will continually 
output results to the given
           table as new data arrives. A new table will be created if the table 
not exists. The returned
           :class:`StreamingQuery` object can be used to interact with the 
stream.
      
           .. versionadded:: 3.2.0
   
           Parameters
           ----------
           tableName : str
               string, for the name of the table.
           format : str, optional
               the format used to save.
           outputMode : str, optional
               specifies how data of a streaming DataFrame/Dataset is written 
to a
               streaming sink.
   
               * `append`: Only the new rows in the streaming DataFrame/Dataset 
will be written to the
                 sink
               * `complete`: All the rows in the streaming DataFrame/Dataset 
will be written to the
                 sink every time these is some updates
               * `update`: only the rows that were updated in the streaming 
DataFrame/Dataset will be
                 written to the sink every time there are some updates. If the 
query doesn't contain
                 aggregations, it will be equivalent to `append` mode.
           partitionBy : str or list, optional
               names of partitioning columns
           queryName : str, optional
               unique name for the query
           **options : dict
               All other string options. You may want to provide a 
`checkpointLocation`.
   
           Notes
           -----
           This API is evolving.
   
           Examples
           --------
           >>> sq = 
sdf.writeStream.format('parquet').queryName('this_query').option(
           ...      'checkpointLocation', 
'/tmp/checkpoint').toTable(output_table_name)
           >>> sq.isActive
           True
           >>> sq.name
           'this_query'
           >>> sq.stop()
           >>> sq.isActive
           False
           >>> sq = sdf.writeStream.trigger(processingTime='5 seconds').toTable(
           ...     output_table_name, queryName='that_query', 
outputMode="append", format='parquet',
           ...     checkpointLocation='/tmp/checkpoint')
           >>> sq.name
           'that_query'
           >>> sq.isActive
           True
           >>> sq.stop()
           """
   ```
   
   or .. 
   
   ```python
           """
           Starts the execution of the streaming query, which will continually 
output results to the given
           table as new data arrives.
      
           A new table will be created if the table not exists. The returned
           :class:`StreamingQuery` object can be used to interact with the 
stream.
      
           .. versionadded:: 3.2.0
   
           Parameters
           ----------
           tableName : str
               string, for the name of the table.
           format : str, optional
               the format used to save.
           outputMode : str, optional
               specifies how data of a streaming DataFrame/Dataset is written 
to a
               streaming sink.
   
               * `append`: Only the new rows in the streaming DataFrame/Dataset 
will be written to the
                 sink
               * `complete`: All the rows in the streaming DataFrame/Dataset 
will be written to the
                 sink every time these is some updates
               * `update`: only the rows that were updated in the streaming 
DataFrame/Dataset will be
                 written to the sink every time there are some updates. If the 
query doesn't contain
                 aggregations, it will be equivalent to `append` mode.
           partitionBy : str or list, optional
               names of partitioning columns
           queryName : str, optional
               unique name for the query
           **options : dict
               All other string options. You may want to provide a 
`checkpointLocation`.
   
           Notes
           -----
           This API is evolving.
   
           Examples
           --------
           >>> sq = 
sdf.writeStream.format('parquet').queryName('this_query').option(
           ...      'checkpointLocation', 
'/tmp/checkpoint').toTable(output_table_name)
           >>> sq.isActive
           True
           >>> sq.name
           'this_query'
           >>> sq.stop()
           >>> sq.isActive
           False
           >>> sq = sdf.writeStream.trigger(processingTime='5 seconds').toTable(
           ...     output_table_name, queryName='that_query', 
outputMode="append", format='parquet',
           ...     checkpointLocation='/tmp/checkpoint')
           >>> sq.name
           'that_query'
           >>> sq.isActive
           True
           >>> sq.stop()
           """
   ```
   
   Either way should be fine. There's no strict rule on the docstrings 
currently (except https://numpydoc.readthedocs.io/en/latest/format.html). It's 
just my suggestion :-).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon commented on a change in pull request #30835: [SPARK-33836][SS][PYTHON] Expose DataStreamReader.table and DataStreamWriter.toTable

Reply via email to