Actually, I realized keeping the info would not be enough as I need to find
back the checkpoint files to delete them :/
2017-10-25 19:07 GMT+02:00 Bernard Jesop :
> As far as I understand, Dataset.rdd is not the same as InternalRDD.
> It is just another RDD representation of the same Datas
been
checkpointed?
Should I manually keep track of that info?
2017-10-25 15:51 GMT+02:00 Bernard Jesop :
> Hello everyone,
>
> I have a question about checkpointing on dataset.
>
> It seems in 2.1.0 that there is a Dataset.checkpoint(), however unlike RDD
> there is no Data
Hello everyone,
I have a question about checkpointing on dataset.
It seems in 2.1.0 that there is a Dataset.checkpoint(), however unlike RDD
there is no Dataset.isCheckpointed().
I wonder if Dataset.checkpoint is a syntactic sugar for
Dataset.rdd.checkpoint.
When I do :
Dataset.checkpoint; Data
false
>
> scala> df.show()
> +--+---+
> |_1| _2|
> +--+---+
> | Scala| 35|
> |Python| 30|
> | R| 15|
> | Java| 20|
> +--+---+
>
>
> scala> df.rdd.isCheckpointed
> res4: Boolean = false
>
> scala> df.rdd.count()
> res5: Long =
nted"*.
Do you have any idea why? (knowing that the checkpoint file is created)
Best regards,
Bernard JESOP
It seems to be because of this issues:
https://issues.apache.org/jira/browse/SPARK-10925
I added a checkpoint, as suggested, to break the lineage and it worked.
Best regards,
2017-07-04 17:26 GMT+02:00 Bernard Jesop :
> Thank Didac,
>
> My bad, actually this code is incomplete, it sh
y S_ID.
>
> I guess that you are looking for something more like the following example
> dfAgg = df.groupBy("S_ID”)
>.agg(org.apache.spark.sql.functions.count(*“userName"*).as(
> *“usersCount**”*),
> .agg(org.apache.spark.sql.functions.collect_set(“city")
> .as("List
Hello, I don't understand my error message.
Basically, all I am doing is :
- dfAgg = df.groupBy("S_ID")
- dfRes = df.join(dfAgg, Seq("S_ID"), "left_outer")
However I get this AnalysisException: "
Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved
attribute(s) S_ID#1903L m