Jan Bols created SPARK-25756:
--------------------------------

             Summary: pyspark pandas_udf does not respect append outputMode in 
structured streaming
                 Key: SPARK-25756
                 URL: https://issues.apache.org/jira/browse/SPARK-25756
             Project: Spark
          Issue Type: Bug
          Components: PySpark, Structured Streaming
    Affects Versions: 2.3.2
            Reporter: Jan Bols


When using the following setup:
 * structured streaming
 * a watermark and groupBy followed by an apply using a pandas grouped map udf
 * a sink using an append outputMode

I would expect the following:
 * udf to be called for each group --> OK
 * when new data arrives, the udf will be called again –> OK
 * when new data arrives for the same group, the udf will be called with the 
complete pandas dataframe of all received data for that group (up till the 
watermark) --> NOK: within the same group, the size of the pandas dataframe can 
decrease between invocations
 * the results are only written to the sink once the processing time is passed 
the watermark --> NOK: every time the udf is called, new results are being sent 
to the output

It looks like pandas udf is unusable for structured streaming this way.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to