[spark] branch master updated: [MINOR][DOC] Add note regarding proper usage of QueryExecution.toRdd

gurwls223 Mon, 18 Feb 2019 17:43:07 -0800

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 865c88f  [MINOR][DOC] Add note regarding proper usage of 
QueryExecution.toRdd
865c88f is described below

commit 865c88f9c735b15dd1a0d275533f086665e8abd8
Author: Jungtaek Lim (HeartSaVioR) <kabh...@gmail.com>
AuthorDate: Tue Feb 19 09:42:21 2019 +0800

    [MINOR][DOC] Add note regarding proper usage of QueryExecution.toRdd
    
    ## What changes were proposed in this pull request?
    
    This proposes adding a note on `QueryExecution.toRdd` regarding Spark's 
internal optimization callers would need to indicate.
    
    ## How was this patch tested?
    
    This patch is a documentation change.
    
    Closes #23822 from HeartSaVioR/MINOR-doc-add-note-query-execution-to-rdd.
    
    Authored-by: Jungtaek Lim (HeartSaVioR) <kabh...@gmail.com>
    Signed-off-by: Hyukjin Kwon <gurwls...@apache.org>
---
 .../scala/org/apache/spark/sql/execution/QueryExecution.scala | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala
index 72499aa..49d6acf 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala
@@ -85,7 +85,16 @@ class QueryExecution(
     prepareForExecution(sparkPlan)
   }
 
-  /** Internal version of the RDD. Avoids copies and has no schema */
+  /**
+   * Internal version of the RDD. Avoids copies and has no schema.
+   * Note for callers: Spark may apply various optimization including reusing 
object: this means
+   * the row is valid only for the iteration it is retrieved. You should avoid 
storing row and
+   * accessing after iteration. (Calling `collect()` is one of known bad 
usage.)
+   * If you want to store these rows into collection, please apply some 
converter or copy row
+   * which produces new object per iteration.
+   * Given QueryExecution is not a public class, end users are discouraged to 
use this: please
+   * use `Dataset.rdd` instead where conversion will be applied.
+   */
   lazy val toRdd: RDD[InternalRow] = executedPlan.execute()
 
   /**


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [MINOR][DOC] Add note regarding proper usage of QueryExecution.toRdd

Reply via email to