HyukjinKwon commented on a change in pull request #30835:
URL: https://github.com/apache/spark/pull/30835#discussion_r545849671
##########
File path: python/pyspark/sql/streaming.py
##########
@@ -1464,11 +1491,82 @@ def start(self, path=None, format=None,
outputMode=None, partitionBy=None, query
else:
return self._sq(self._jwrite.start(path))
+ def toTable(self, tableName, format=None, outputMode=None,
partitionBy=None, queryName=None,
+ **options):
+ """
+ Streams the contents of the :class:`DataFrame` to the output table.
+
+ A new table will be created if the table not exists. The returned
[[StreamingQuery]]
+ object can be used to interact with the stream.
+
+ .. versionadded:: 3.2.0
+
+ Parameters
+ ----------
+ tableName : str
+ string, for the name of the table.
+ format : str, optional
+ the format used to save.
+ outputMode : str, optional
+ specifies how data of a streaming DataFrame/Dataset is written to a
+ streaming sink.
+
+ * `append`: Only the new rows in the streaming DataFrame/Dataset
will be written to the
+ sink
+ * `complete`: All the rows in the streaming DataFrame/Dataset will
be written to the
+ sink every time these is some updates
+ * `update`: only the rows that were updated in the streaming
DataFrame/Dataset will be
+ written to the sink every time there are some updates. If the
query doesn't contain
+ aggregations, it will be equivalent to `append` mode.
+ partitionBy : str or list, optional
+ names of partitioning columns
+ queryName : str, optional
+ unique name for the query
+ **options : dict
+ All other string options. You may want to provide a
`checkpointLocation`.
+
+ Notes
+ -----
+ This API is evolving.
+
+ Examples
+ --------
+ >>> sq =
sdf.writeStream.format('parquet').queryName('this_query').option(
+ ... 'checkpointLocation',
'/tmp/checkpoint').toTable(output_table_name)
+ >>> sq.isActive
+ True
+ >>> sq.name
+ 'this_query'
+ >>> sq.stop()
+ >>> sq.isActive
+ False
+ >>> sq = sdf.writeStream.trigger(processingTime='5 seconds').toTable(
+ ... output_table_name, queryName='that_query',
outputMode="append", format='parquet',
+ ... checkpointLocation='/tmp/checkpoint')
+ >>> sq.name
+ 'that_query'
+ >>> sq.isActive
+ True
+ >>> sq.stop()
+ """
Review comment:
I was thinking something like ... :
```python
"""
Starts the execution of the streaming query, which will continually
output results to the given
table as new data arrives. A new table will be created if the table
not exists. The returned
:class:`StreamingQuery` object can be used to interact with the
stream.
.. versionadded:: 3.2.0
Parameters
----------
tableName : str
string, for the name of the table.
format : str, optional
the format used to save.
outputMode : str, optional
specifies how data of a streaming DataFrame/Dataset is written
to a
streaming sink.
* `append`: Only the new rows in the streaming DataFrame/Dataset
will be written to the
sink
* `complete`: All the rows in the streaming DataFrame/Dataset
will be written to the
sink every time these is some updates
* `update`: only the rows that were updated in the streaming
DataFrame/Dataset will be
written to the sink every time there are some updates. If the
query doesn't contain
aggregations, it will be equivalent to `append` mode.
partitionBy : str or list, optional
names of partitioning columns
queryName : str, optional
unique name for the query
**options : dict
All other string options. You may want to provide a
`checkpointLocation`.
Notes
-----
This API is evolving.
Examples
--------
>>> sq =
sdf.writeStream.format('parquet').queryName('this_query').option(
... 'checkpointLocation',
'/tmp/checkpoint').toTable(output_table_name)
>>> sq.isActive
True
>>> sq.name
'this_query'
>>> sq.stop()
>>> sq.isActive
False
>>> sq = sdf.writeStream.trigger(processingTime='5 seconds').toTable(
... output_table_name, queryName='that_query',
outputMode="append", format='parquet',
... checkpointLocation='/tmp/checkpoint')
>>> sq.name
'that_query'
>>> sq.isActive
True
>>> sq.stop()
"""
```
or ..
```python
"""
Starts the execution of the streaming query, which will continually
output results to the given
table as new data arrives.
A new table will be created if the table not exists. The returned
:class:`StreamingQuery` object can be used to interact with the
stream.
.. versionadded:: 3.2.0
Parameters
----------
tableName : str
string, for the name of the table.
format : str, optional
the format used to save.
outputMode : str, optional
specifies how data of a streaming DataFrame/Dataset is written
to a
streaming sink.
* `append`: Only the new rows in the streaming DataFrame/Dataset
will be written to the
sink
* `complete`: All the rows in the streaming DataFrame/Dataset
will be written to the
sink every time these is some updates
* `update`: only the rows that were updated in the streaming
DataFrame/Dataset will be
written to the sink every time there are some updates. If the
query doesn't contain
aggregations, it will be equivalent to `append` mode.
partitionBy : str or list, optional
names of partitioning columns
queryName : str, optional
unique name for the query
**options : dict
All other string options. You may want to provide a
`checkpointLocation`.
Notes
-----
This API is evolving.
Examples
--------
>>> sq =
sdf.writeStream.format('parquet').queryName('this_query').option(
... 'checkpointLocation',
'/tmp/checkpoint').toTable(output_table_name)
>>> sq.isActive
True
>>> sq.name
'this_query'
>>> sq.stop()
>>> sq.isActive
False
>>> sq = sdf.writeStream.trigger(processingTime='5 seconds').toTable(
... output_table_name, queryName='that_query',
outputMode="append", format='parquet',
... checkpointLocation='/tmp/checkpoint')
>>> sq.name
'that_query'
>>> sq.isActive
True
>>> sq.stop()
"""
```
Either way should be fine. There's no strict rule on the docstrings
currently (except https://numpydoc.readthedocs.io/en/latest/format.html). It's
just my suggestion :-).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]