Jan Bols created SPARK-25756: -------------------------------- Summary: pyspark pandas_udf does not respect append outputMode in structured streaming Key: SPARK-25756 URL: https://issues.apache.org/jira/browse/SPARK-25756 Project: Spark Issue Type: Bug Components: PySpark, Structured Streaming Affects Versions: 2.3.2 Reporter: Jan Bols
When using the following setup: * structured streaming * a watermark and groupBy followed by an apply using a pandas grouped map udf * a sink using an append outputMode I would expect the following: * udf to be called for each group --> OK * when new data arrives, the udf will be called again –> OK * when new data arrives for the same group, the udf will be called with the complete pandas dataframe of all received data for that group (up till the watermark) --> NOK: within the same group, the size of the pandas dataframe can decrease between invocations * the results are only written to the sink once the processing time is passed the watermark --> NOK: every time the udf is called, new results are being sent to the output It looks like pandas udf is unusable for structured streaming this way. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org