Github user Matzz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18717#discussion_r197476588
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -2759,6 +2760,34 @@ class Dataset[T] private[sql](
         this
       }
     
    +  /**
    +   * Persist this Dataset with the default storage level 
(`MEMORY_AND_DISK`).
    +   * @param eager If true, persist the Dataset eagerly.
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  def persist(eager: Boolean): this.type = {
    +    persist()
    +    if (eager) queryExecution.toRdd.foreachPartition(_ => {})
    +    this
    +  }
    +
    +  /**
    +   * Persist this Dataset with the given storage level.
    +   * @param eager If true, persist the Dataset eagerly.
    +   * @param newLevel One of: `MEMORY_ONLY`, `MEMORY_AND_DISK`, 
`MEMORY_ONLY_SER`,
    +   *                 `MEMORY_AND_DISK_SER`, `DISK_ONLY`, `MEMORY_ONLY_2`,
    +   *                 `MEMORY_AND_DISK_2`, etc.
    +   *
    +   * @group basic
    +   * @since 2.3.0
    +   */
    +  def persist(eager: Boolean, newLevel: StorageLevel): this.type = {
    +    persist(newLevel)
    +    if (eager) queryExecution.toRdd.foreachPartition(_ => {})
    --- End diff --
    
    I think it's worth to check if `.foreachPartition(_ => {})` always works. I 
tested that on plain RDD and it didn't trigger evaluation. I had to replace 
that construct with: `rdd.foreachPartition(_.foreach(_ => {}))`
    However, maybe it works differently if you operate on queryExecution


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to