Github user tdas commented on a diff in the pull request:
https://github.com/apache/spark/pull/21477#discussion_r193886200
--- Diff: python/pyspark/sql/streaming.py ---
@@ -843,6 +844,169 @@ def trigger(self, processingTime=None, once=None,
continuous=None):
self._jwrite = self._jwrite.trigger(jTrigger)
return self
+ def foreach(self, f):
+ """
+ Sets the output of the streaming query to be processed using the
provided writer ``f``.
+ This is often used to write the output of a streaming query to
arbitrary storage systems.
+ The processing logic can be specified in two ways.
+
+ #. A **function** that takes a row as input.
--- End diff --
Python APIs anyways have slightly divergences from Scala/Java APIs in order
to provide better experiences for Python users. For example,
`StreamingQuery.lastProgress` returns an object of type
`StreamingQueryProgress` in Java/Scala but returns a dict in python. Because
python users are more used to dealing with dicts, and java/scala users (typed
language) are more comfortable with structures). Similarly, in
DataFrame.select, you can refer to columns in scala using `$"columnName"` but
in python, you can refer to it as `df.columnName`. If we strictly adhere to
pure consistency, then we cannot make it convenient for users in different
languages. And ultimately convenience is what matters for the user experience.
So its okay to have a superset of supported types in python compared to
java/scala.
Personally, I think we should also add the lambda variant to Scala as well.
But that decision for Scala is independent of this PR as there is enough
justification for add the lambda variant for
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]