Re: [PR] [SPARK-49723][SQL] Add Variant metrics to the JSON File Scan node [spark]

via GitHub Wed, 25 Sep 2024 14:07:55 -0700


gene-db commented on code in PR #48172:
URL: https://github.com/apache/spark/pull/48172#discussion_r1775990841



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala:
##########
@@ -72,9 +74,71 @@ case class PartitionedFile(
   }
 }
 
+/**
+ * Class used to store statistical data that is collected during a file scan 
and could be used to
+ * update the SQL metrics of the scan node. More members could be added to 
this class to to collect
+ * metrics related to new features.
+ */
+case class FileScanMetrics(
+    topLevelVariantMetrics: Option[VariantMetrics] = None,
+    nestedVariantMetrics: Option[VariantMetrics] = None) {
+
+  // Update SQL metrics with the data in topLevelVariantMetrics and 
nestedVariantMetrics
+  def updateSQLMetrics(sqlMetrics: Option[Map[String, SQLMetric]]): Unit = {
+    sqlMetrics match {
+      case Some(sqlMetrics) =>
+        topLevelVariantMetrics match {

Review Comment:
   I feel like `VariantConstructionMetrics` should have another method like 
`updateSQLMetrics`, which takes in a `Option[VariantMetrics]` and a 
`Map[String, SQLMetric]`, and does this update there.
   
   Then, here, we could just call that method to do this update work. We might 
need to distinguish between nested and top-level, but that is the general idea.



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala:
##########
@@ -83,7 +147,9 @@ class FileScanRDD(
     val readSchema: StructType,
     val metadataColumns: Seq[AttributeReference] = Seq.empty,
     metadataExtractors: Map[String, PartitionedFile => Any] = Map.empty,
-    options: FileSourceOptions = new 
FileSourceOptions(CaseInsensitiveMap(Map.empty)))
+    options: FileSourceOptions = new 
FileSourceOptions(CaseInsensitiveMap(Map.empty)),
+    fileScanMetricsObject: Option[FileScanMetrics] = None,

Review Comment:
   NIT: let's rename this to `fileScanMetrics`. I don't think we need to say 
"Object".



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala:
##########
@@ -230,6 +296,13 @@ class FileScanRDD(
 
       /** Advances to the next file. Returns true if a new non-empty iterator 
is available. */
       private def nextIterator(): Boolean = {
+        // This function is called at the end of each file and before the 
first file. Therefore,
+        // we update the Variant Metrics at the end of each JSON file and 
reset the
+        // variant metrics objects to process the next file.
+        fileScanMetricsObject match {

Review Comment:
   I think this can be simplified to something like:
   ```
   fileScanMetrics.foreach(_.updateSQLMetrics(sqlMetrics))
   ```
   



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala:
##########
@@ -609,6 +611,21 @@ case class FileSourceScanExec(
     override val disableBucketedScan: Boolean = false)
   extends FileSourceScanLike {
 
+  /** SQL metrics generated for JSON scans involving variants. */
+  protected lazy val variantBuilderMetrics: Map[String, SQLMetric] =

Review Comment:
   Maybe let's call this `variantSQLMetrics` or something, since it is similar 
to `topLevelVariantMetrics`. If we say `SQLMetrics` it makes it more obvious 
they are related to `SQLMetric`.



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala:
##########
@@ -230,6 +296,13 @@ class FileScanRDD(
 
       /** Advances to the next file. Returns true if a new non-empty iterator 
is available. */
       private def nextIterator(): Boolean = {
+        // This function is called at the end of each file and before the 
first file. Therefore,
+        // we update the Variant Metrics at the end of each JSON file and 
reset the

Review Comment:
   NIT: This is not specific to variant anymore.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-49723][SQL] Add Variant metrics to the JSON File Scan node [spark]

Reply via email to