gene-db commented on code in PR #48172:
URL: https://github.com/apache/spark/pull/48172#discussion_r1775990841
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala:
##########
@@ -72,9 +74,71 @@ case class PartitionedFile(
}
}
+/**
+ * Class used to store statistical data that is collected during a file scan
and could be used to
+ * update the SQL metrics of the scan node. More members could be added to
this class to to collect
+ * metrics related to new features.
+ */
+case class FileScanMetrics(
+ topLevelVariantMetrics: Option[VariantMetrics] = None,
+ nestedVariantMetrics: Option[VariantMetrics] = None) {
+
+ // Update SQL metrics with the data in topLevelVariantMetrics and
nestedVariantMetrics
+ def updateSQLMetrics(sqlMetrics: Option[Map[String, SQLMetric]]): Unit = {
+ sqlMetrics match {
+ case Some(sqlMetrics) =>
+ topLevelVariantMetrics match {
Review Comment:
I feel like `VariantConstructionMetrics` should have another method like
`updateSQLMetrics`, which takes in a `Option[VariantMetrics]` and a
`Map[String, SQLMetric]`, and does this update there.
Then, here, we could just call that method to do this update work. We might
need to distinguish between nested and top-level, but that is the general idea.
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala:
##########
@@ -83,7 +147,9 @@ class FileScanRDD(
val readSchema: StructType,
val metadataColumns: Seq[AttributeReference] = Seq.empty,
metadataExtractors: Map[String, PartitionedFile => Any] = Map.empty,
- options: FileSourceOptions = new
FileSourceOptions(CaseInsensitiveMap(Map.empty)))
+ options: FileSourceOptions = new
FileSourceOptions(CaseInsensitiveMap(Map.empty)),
+ fileScanMetricsObject: Option[FileScanMetrics] = None,
Review Comment:
NIT: let's rename this to `fileScanMetrics`. I don't think we need to say
"Object".
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala:
##########
@@ -230,6 +296,13 @@ class FileScanRDD(
/** Advances to the next file. Returns true if a new non-empty iterator
is available. */
private def nextIterator(): Boolean = {
+ // This function is called at the end of each file and before the
first file. Therefore,
+ // we update the Variant Metrics at the end of each JSON file and
reset the
+ // variant metrics objects to process the next file.
+ fileScanMetricsObject match {
Review Comment:
I think this can be simplified to something like:
```
fileScanMetrics.foreach(_.updateSQLMetrics(sqlMetrics))
```
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala:
##########
@@ -609,6 +611,21 @@ case class FileSourceScanExec(
override val disableBucketedScan: Boolean = false)
extends FileSourceScanLike {
+ /** SQL metrics generated for JSON scans involving variants. */
+ protected lazy val variantBuilderMetrics: Map[String, SQLMetric] =
Review Comment:
Maybe let's call this `variantSQLMetrics` or something, since it is similar
to `topLevelVariantMetrics`. If we say `SQLMetrics` it makes it more obvious
they are related to `SQLMetric`.
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala:
##########
@@ -230,6 +296,13 @@ class FileScanRDD(
/** Advances to the next file. Returns true if a new non-empty iterator
is available. */
private def nextIterator(): Boolean = {
+ // This function is called at the end of each file and before the
first file. Therefore,
+ // we update the Variant Metrics at the end of each JSON file and
reset the
Review Comment:
NIT: This is not specific to variant anymore.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]