Github user Matzz commented on a diff in the pull request:
https://github.com/apache/spark/pull/18717#discussion_r197476588
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -2759,6 +2760,34 @@ class Dataset[T] private[sql](
this
}
+ /**
+ * Persist this Dataset with the default storage level
(`MEMORY_AND_DISK`).
+ * @param eager If true, persist the Dataset eagerly.
+ * @group basic
+ * @since 2.3.0
+ */
+ def persist(eager: Boolean): this.type = {
+ persist()
+ if (eager) queryExecution.toRdd.foreachPartition(_ => {})
+ this
+ }
+
+ /**
+ * Persist this Dataset with the given storage level.
+ * @param eager If true, persist the Dataset eagerly.
+ * @param newLevel One of: `MEMORY_ONLY`, `MEMORY_AND_DISK`,
`MEMORY_ONLY_SER`,
+ * `MEMORY_AND_DISK_SER`, `DISK_ONLY`, `MEMORY_ONLY_2`,
+ * `MEMORY_AND_DISK_2`, etc.
+ *
+ * @group basic
+ * @since 2.3.0
+ */
+ def persist(eager: Boolean, newLevel: StorageLevel): this.type = {
+ persist(newLevel)
+ if (eager) queryExecution.toRdd.foreachPartition(_ => {})
--- End diff --
I think it's worth to check if `.foreachPartition(_ => {})` always works. I
tested that on plain RDD and it didn't trigger evaluation. I had to replace
that construct with: `rdd.foreachPartition(_.foreach(_ => {}))`
However, maybe it works differently if you operate on queryExecution
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]