Re: eager? in dataframe's checkpoint

2017-02-02 Thread Jean Georges Perrin
i wrote this piece based on all that, hopefully it will help: http://jgp.net/2017/02/02/what-are-spark-checkpoints-on-dataframes/ > On Jan 31, 2017, at 4:18 PM, Burak Yavuz wrote: > > Hi Koert, > > When

Re: eager? in dataframe's checkpoint

2017-01-31 Thread Koert Kuipers
i thought RDD.checkpoint is async? checkpointData is indeed updated synchronously, but checkpointData.isCheckpointed is false until the actual checkpoint operation has completed. and until the actual checkpoint operation is done any operation will be on the original rdd. for example notice how

Re: eager? in dataframe's checkpoint

2017-01-31 Thread Burak Yavuz
Hi Koert, When eager is true, we return you a new DataFrame that depends on the files written out to the checkpoint directory. All previous operations on the checkpointed DataFrame are gone forever. You basically start fresh. AFAIK, when eager is true, the method will not return until the

Re: eager? in dataframe's checkpoint

2017-01-31 Thread Koert Kuipers
i understand that checkpoint cuts the lineage, but i am not fully sure i understand the role of eager. eager simply seems to materialize the rdd early with a count, right after the rdd has been checkpointed. but why is that useful? rdd.checkpoint is asynchronous, so when the rdd.count happens

Re: eager? in dataframe's checkpoint

2017-01-26 Thread Burak Yavuz
Hi, One of the goals of checkpointing is to cut the RDD lineage. Otherwise you run into StackOverflowExceptions. If you eagerly checkpoint, you basically cut the lineage there, and the next operations all depend on the checkpointed DataFrame. If you don't checkpoint, you continue to build the

eager? in dataframe's checkpoint

2017-01-26 Thread Jean Georges Perrin
Hey Sparkers, Trying to understand the Dataframe's checkpoint (not in the context of streaming) https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#checkpoint(boolean)