On another note, when it comes to checkpointing on structured streaming I noticed if I have a stream running off s3 and I kill the process. The next time the process starts running it dulplicates the last record inserted. is that normal?
So say I have streaming enabled on one folder "test" which only has two files "update1" and "update 2", then I kill the spark job using Ctrl+C. When I rerun the stream it picks up "update 2" again Is this normal? isnt ctrl+c a failure? I would expect checkpointing to know that update 2 was already processed Regards Sam On Tue, Feb 7, 2017 at 4:58 PM, Sam Elamin <hussam.ela...@gmail.com> wrote: > Thanks Micheal! > > > > On Tue, Feb 7, 2017 at 4:49 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> Here a JIRA: https://issues.apache.org/jira/browse/SPARK-19497 >> >> We should add this soon. >> >> On Tue, Feb 7, 2017 at 8:35 AM, Sam Elamin <hussam.ela...@gmail.com> >> wrote: >> >>> Hi All >>> >>> When trying to read a stream off S3 and I try and drop duplicates I get >>> the following error: >>> >>> Exception in thread "main" org.apache.spark.sql.AnalysisException: >>> Append output mode not supported when there are streaming aggregations on >>> streaming DataFrames/DataSets;; >>> >>> >>> Whats strange if I use the batch "spark.read.json", it works >>> >>> Can I assume you cant drop duplicates in structured streaming >>> >>> Regards >>> Sam >>> >> >> >