Re: [PR] [SPARK-49723][SQL] Add Variant metrics to the JSON File Scan node [spark]

via GitHub Mon, 23 Sep 2024 08:53:40 -0700


gene-db commented on code in PR #48172:
URL: https://github.com/apache/spark/pull/48172#discussion_r1771664820



##########
common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java:
##########
@@ -60,13 +60,21 @@ public static Variant parseJson(String json, boolean 
allowDuplicateKeys) throws
   /**
    * Similar {@link #parseJson(String, boolean)}, but takes a JSON parser 
instead of string input.
    */
-  public static Variant parseJson(JsonParser parser, boolean 
allowDuplicateKeys)
-      throws IOException {
+  public static Variant parseJson(JsonParser parser, boolean 
allowDuplicateKeys,
+                                  VariantMetrics variantMetrics) throws 
IOException {
     VariantBuilder builder = new VariantBuilder(allowDuplicateKeys);
-    builder.buildJson(parser);
+    builder.buildJson(parser, variantMetrics, 0);
+    variantMetrics.variantCount += 1;
+    Variant v = builder.result();
+    variantMetrics.byteSize += v.value.length + v.metadata.length;
     return builder.result();
   }
 
+  public static Variant parseJson(JsonParser parser, boolean 
allowDuplicateKeys)

Review Comment:
   There should be a method comment for this public method.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -225,7 +241,8 @@ class JacksonParser(
    * Create a converter which converts the JSON documents held by the 
`JsonParser`
    * to a value according to a desired schema.

Review Comment:
   Can you add comments and explain what `isRoot` and `isChildOfRoot` are? Why 
do we need both?



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -119,15 +121,28 @@ class JacksonParser(
 
   private val variantAllowDuplicateKeys = 
SQLConf.get.getConf(SQLConf.VARIANT_ALLOW_DUPLICATE_KEYS)
 
-  protected final def parseVariant(parser: JsonParser): VariantVal = {
+  // isTopLevel helps in distinguishing between top level variant and nested 
variants for metrics.
+  private final def parseVariant(isTopLevel: Boolean)(parser: JsonParser): 
VariantVal = {
     // Skips `FIELD_NAME` at the beginning. This check is adapted from 
`parseJsonToken`, but we
     // cannot directly use the function here because it also handles the 
`VALUE_NULL` token and
     // returns null (representing a SQL NULL). Instead, we want to return a 
variant null.
     if (parser.getCurrentToken == FIELD_NAME) {
       parser.nextToken()
     }
     try {
-      val v = VariantBuilder.parseJson(parser, variantAllowDuplicateKeys)
+      val v = if (isTopLevel) {
+        topLevelVariantMetrics match {

Review Comment:
   I think the 2 match clauses are the same, right? Maybe we could do something 
like
   ```
   val v = if (isTopLevel) topLevelVariantMetrics else nestedVariantMetrics
   v match {
     case Some(metrics) => ...
     case None => ...
   }
   ```
   
   I think this would reduce code duplication?



##########
common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java:
##########
@@ -60,13 +60,21 @@ public static Variant parseJson(String json, boolean 
allowDuplicateKeys) throws
   /**
    * Similar {@link #parseJson(String, boolean)}, but takes a JSON parser 
instead of string input.

Review Comment:
   We should add a comment about the `variantMetrics` parameter.



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala:
##########
@@ -83,9 +85,59 @@ class FileScanRDD(
     val readSchema: StructType,
     val metadataColumns: Seq[AttributeReference] = Seq.empty,
     metadataExtractors: Map[String, PartitionedFile => Any] = Map.empty,
-    options: FileSourceOptions = new 
FileSourceOptions(CaseInsensitiveMap(Map.empty)))
+    options: FileSourceOptions = new 
FileSourceOptions(CaseInsensitiveMap(Map.empty)),
+    topLevelVariantMetrics: Option[VariantMetrics] = None,

Review Comment:
   I wonder if we should have a more generic metrics parameter, instead of 3 
variant-specific parameters? The variant ones can be contained within the 
generic metrics "container", and in the future other metrics could potentially 
go into that. I just think it might be strange to have 3 variant-specific 
parameters, when many rdds may not even have any variants at all.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-49723][SQL] Add Variant metrics to the JSON File Scan node [spark]

Reply via email to