Re: [PR] feat: Add random row generator in data generator [datafusion-comet]

via GitHub Mon, 20 May 2024 19:03:00 -0700


advancedxy commented on code in PR #451:
URL: https://github.com/apache/datafusion-comet/pull/451#discussion_r1607484041



##########
spark/src/test/scala/org/apache/comet/DataGenerator.scala:
##########
@@ -95,4 +102,55 @@ class DataGenerator(r: Random) {
       Range(0, n).map(_ => r.nextLong())
   }
 
+  // Generate a random row according to the schema, the string filed in the 
struct could be
+  // configured to generate strings by passing a stringGen function. Other 
types are delegated
+  // to Spark's RandomDataGenerator.
+  def generateRow(schema: StructType, stringGen: Option[() => String] = None): 
Row = {
+    val fields = mutable.ArrayBuffer.empty[Any]
+    schema.fields.foreach { f =>
+      f.dataType match {
+        case ArrayType(childType, nullable) =>
+          val data = if (f.nullable && r.nextFloat() <= PROBABILITY_OF_NULL) {
+            null
+          } else {
+            val arr = mutable.ArrayBuffer.empty[Any]
+            val n = 1 // rand.nextInt(10)
+            var i = 0
+            val generator = RandomDataGenerator.forType(childType, nullable, r)
+            assert(generator.isDefined, "Unsupported type")
+            val gen = generator.get
+            while (i < n) {
+              arr += gen()
+              i += 1
+            }
+            arr.toSeq
+          }
+          fields += data
+        case StructType(children) =>
+          fields += generateRow(StructType(children))
+        case StringType if stringGen.isDefined =>
+          val gen = stringGen.get
+          val data = if (f.nullable && r.nextFloat() <= PROBABILITY_OF_NULL) {
+            null
+          } else {
+            gen()
+          }
+          fields += data
+        case _ =>
+          val generator = RandomDataGenerator.forType(f.dataType, f.nullable, 
r)
+          assert(generator.isDefined, "Unsupported type")
+          val gen = generator.get
+          fields += gen()
+      }
+    }
+    Row.fromSeq(fields.toSeq)
+  }
+
+  def generateRows(

Review Comment:
   Yes. I will use this in the hash functions tests. Also, it should be useful 
for other randomized input tests since it will be much easier to generate the 
data frame with multiple columns.



##########
spark/src/test/scala/org/apache/comet/DataGenerator.scala:
##########
@@ -95,4 +102,55 @@ class DataGenerator(r: Random) {
       Range(0, n).map(_ => r.nextLong())
   }
 
+  // Generate a random row according to the schema, the string filed in the 
struct could be
+  // configured to generate strings by passing a stringGen function. Other 
types are delegated
+  // to Spark's RandomDataGenerator.
+  def generateRow(schema: StructType, stringGen: Option[() => String] = None): 
Row = {
+    val fields = mutable.ArrayBuffer.empty[Any]
+    schema.fields.foreach { f =>
+      f.dataType match {
+        case ArrayType(childType, nullable) =>
+          val data = if (f.nullable && r.nextFloat() <= PROBABILITY_OF_NULL) {
+            null
+          } else {
+            val arr = mutable.ArrayBuffer.empty[Any]
+            val n = 1 // rand.nextInt(10)
+            var i = 0
+            val generator = RandomDataGenerator.forType(childType, nullable, r)
+            assert(generator.isDefined, "Unsupported type")
+            val gen = generator.get
+            while (i < n) {
+              arr += gen()
+              i += 1
+            }
+            arr.toSeq
+          }
+          fields += data
+        case StructType(children) =>
+          fields += generateRow(StructType(children))
+        case StringType if stringGen.isDefined =>
+          val gen = stringGen.get
+          val data = if (f.nullable && r.nextFloat() <= PROBABILITY_OF_NULL) {
+            null
+          } else {
+            gen()
+          }
+          fields += data
+        case _ =>
+          val generator = RandomDataGenerator.forType(f.dataType, f.nullable, 
r)
+          assert(generator.isDefined, "Unsupported type")

Review Comment:
   The `RandomDataGenerator.forType(f.dataType, f.nullable, r)` handles the 
nullable type, so I don't think we need to handle it here.
   
   See: 
https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala#L380-L392



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: Add random row generator in data generator [datafusion-comet]

Reply via email to