[GitHub] [spark] xuechendi commented on a change in pull request #34396: [SPARK-37124][SQL] Support RowToColumnarExec with Arrow format

GitBox Wed, 10 Nov 2021 21:17:20 -0800


xuechendi commented on a change in pull request #34396:
URL: https://github.com/apache/spark/pull/34396#discussion_r747208132




##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/Columnar.scala
##########
@@ -458,6 +462,34 @@ case class RowToColumnarExec(child: SparkPlan) extends 
RowToColumnarTransition {
     // This avoids calling `schema` in the RDD closure, so that we don't need 
to include the entire
     // plan (this) in the closure.
     val localSchema = this.schema
+    if (enableArrowColumnVector) {
+      val maxRecordsPerBatch = SQLConf.get.arrowMaxRecordsPerBatch
+      val timeZoneId = SQLConf.get.sessionLocalTimeZone
+      return child.execute().mapPartitionsInternal { rowIterator =>
+        val context = TaskContext.get()
+        val allocator = ArrowUtils.getDefaultAllocator
+        val bytesIterator = ArrowConverters
+          .toBatchIterator(rowIterator, localSchema, maxRecordsPerBatch, 
timeZoneId, context)

Review comment:
       @HyukjinKwon , I looked into your PR and I am not sure what is the 
difference between your mapInArrow vs using pandas_udf, so I want to explain my 
PR vs pandas_udf here to avoid misunderstanding of your PR. 
   
   Compared passing internalRows to pandas_udf, yes, the implementation here is 
very similiar while purpose is very different. My PR here is to support 
RowToColumnarExec sparkplan with arrow. By doing so, the following operators 
like sort, join, aggregate, project/filter can do computation based on Arrow 
format instead of using On/OffHeapColumnVector. 
   
   And I think pandas_udf is used to do data transformation using pandas 
built-in computation or calling other thirdparty python package.
   
   Not sure if I answered your question?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] xuechendi commented on a change in pull request #34396: [SPARK-37124][SQL] Support RowToColumnarExec with Arrow format

Reply via email to