Re: [SS] Why is a streaming aggregation required for complete output mode?

Holden Karau Fri, 18 Aug 2017 09:41:47 -0700

So performing complete output without an aggregation would require building
up a table of the entire input to write out at each micro batch. This would
get prohibitively expensive quickly. With an aggregation we just need to
keep track of the aggregates and update them every batch, so the memory
requirement is more reasonable.


(Note: I don't do a lot of work in streaming so there may be additional
reasons, but these are the ones I remember from when I was working on
looking at integrating ML with SS).

On Fri, Aug 18, 2017 at 5:25 AM Jacek Laskowski <ja...@japila.pl> wrote:

> Hi,
>
> Why is the requirement for a streaming aggregation in a streaming
> query? What would happen if Spark allowed Complete without a single
> aggregation? This is the latest master.
>
> scala> val q = ids.
>      |   writeStream.
>      |   format("memory").
>      |   queryName("dups").
>      |   outputMode(OutputMode.Complete).  // <-- memory sink supports
> checkpointing for Complete output mode only
>      |   trigger(Trigger.ProcessingTime(30.seconds)).
>      |   option("checkpointLocation", "checkpoint-dir"). // <-- use
> checkpointing to save state between restarts
>      |   start
> org.apache.spark.sql.AnalysisException: Complete output mode not
> supported when there are no streaming aggregations on streaming
> DataFrames/Datasets;;
> Project [cast(time#10 as bigint) AS time#15L, id#6]
> +- Deduplicate [id#6], true
>    +- Project [cast(time#5 as timestamp) AS time#10, id#6]
>       +- Project [_1#2 AS time#5, _2#3 AS id#6]
>          +- StreamingExecutionRelation MemoryStream[_1#2,_2#3], [_1#2,
> _2#3]
>
>   at
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297)
>   at
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForStreaming(UnsupportedOperationChecker.scala:115)
>   at
> org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:232)
>   at
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278)
>   at
> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:249)
>   ... 57 elided
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: [SS] Why is a streaming aggregation required for complete output mode?

Reply via email to