Yeah, turning it into an RDD should preserve the incremental planning. On Tue, Jun 28, 2016 at 6:30 PM, Holden Karau <hol...@pigscanfly.ca> wrote:
> Ok, that makes sense (the JIRA where the restriction note was added didn't > have a lot of details). So for now, would converting to an RDD inside of a > custom Sink and then doing your operations on that be a reasonable work > around? > > > On Tuesday, June 28, 2016, Michael Armbrust <mich...@databricks.com> > wrote: > >> This is not too broadly worded, and in general I would caution that any >> interface in org.apache.spark.sql.catalyst or >> org.apache.spark.sql.execution is considered internal and likely to change >> in between releases. We do plan to open a stable source/sink API in a >> future release. >> >> The problem here is that the DataFrame is constructed using an >> incrementalized physical query plan. If you call any operations on the >> Dataframe that change the logical plan, you will loose prior state and the >> DataFrame will return an incorrect result. Since this was discovered late >> in the release process we decided it was better to document the current >> behavior, rather than do a large refactoring. >> >> On Tue, Jun 28, 2016 at 12:59 PM, Holden Karau <hol...@pigscanfly.ca> >> wrote: >> >>> Looking at the Sink in 2.0 there is a warning (added in SPARK-16020 >>> without a lot of details) that says "Note: You cannot apply any operators >>> on `data` except consuming it (e.g., `collect/foreach`)." but I'm wondering >>> if this restriction is perhaps too broadly worded? Provided that we consume >>> the data in a blocking fashion could we apply some other transformation >>> beforehand? Or is there a better way to get equivalent foreachRDD >>> functionality with the structured streaming API? >>> >>> On somewhat of tangent - would it maybe make sense to mark >>> transformations on Datasets which are not supported for Streaming use (e.g. >>> toJson etc.)? >>> >>> Cheers, >>> >>> Holden :) >>> -- >>> Cell : 425-233-8271 >>> Twitter: https://twitter.com/holdenkarau >>> >> >> > > -- > Cell : 425-233-8271 > Twitter: https://twitter.com/holdenkarau > >