Github user tdas commented on a diff in the pull request:
https://github.com/apache/spark/pull/21477#discussion_r193631209
--- Diff: python/pyspark/sql/streaming.py ---
@@ -843,6 +844,169 @@ def trigger(self, processingTime=None, once=None,
continuous=None):
self._jwrite = self._jwrite.trigger(jTrigger)
return self
+ def foreach(self, f):
+ """
+ Sets the output of the streaming query to be processed using the
provided writer ``f``.
+ This is often used to write the output of a streaming query to
arbitrary storage systems.
+ The processing logic can be specified in two ways.
+
+ #. A **function** that takes a row as input.
+ This is a simple way to express your processing logic. Note
that this does
+ not allow you to deduplicate generated data when failures
cause reprocessing of
+ some input data. That would require you to specify the
processing logic in the next
+ way.
+
+ #. An **object** with a ``process`` method and optional ``open``
and ``close`` methods.
--- End diff --
I discussed this with @marmbrus . If there is a ForeachWriter class in
python, then uses will have to additionally import it. That's just another
overhead that can be avoided by just allowing any class with the appropriate
methods. One less step for python users.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]