[GitHub] [spark] Ngone51 commented on a diff in pull request #38064: [SPARK-40622][SQL][CORE]Result of a single task in collect() must fit in 2GB

GitBox Wed, 12 Oct 2022 23:47:56 -0700


Ngone51 commented on code in PR #38064:
URL: https://github.com/apache/spark/pull/38064#discussion_r994222831



##########
sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala:
##########
@@ -2181,6 +2184,29 @@ class DatasetSuite extends QueryTest
   }
 }
 
+class DatasetLargeResultCollectingSuite extends QueryTest
+  with SharedSparkSession {
+
+  override protected def sparkConf: SparkConf = 
super.sparkConf.set(MAX_RESULT_SIZE.key, "4g")
+  test("collect data with single partition larger than 2GB bytes array limit") 
{
+    import org.apache.spark.sql.functions.udf
+
+    val genData = udf((id: Long, bytesSize: Int) => {
+      val rand = new Random(id)
+      val arr = new Array[Byte](bytesSize)
+      rand.nextBytes(arr)
+      arr
+    })
+
+    spark.udf.register("genData", genData.asNondeterministic())
+    // create data of size ~3GB in single partition, which exceeds the byte 
array limit
+    // random gen to make sure it's poorly compressed
+    val df = spark.range(0, 3000, 1, 1).selectExpr("id", s"genData(id, 
1000000) as data")
+    val res = df.queryExecution.executedPlan.executeCollect()

Review Comment:
   Will this take much memory on the driver?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] Ngone51 commented on a diff in pull request #38064: [SPARK-40622][SQL][CORE]Result of a single task in collect() must fit in 2GB

Reply via email to