gene-db commented on code in PR #48172:
URL: https://github.com/apache/spark/pull/48172#discussion_r1771664820
##########
common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java:
##########
@@ -60,13 +60,21 @@ public static Variant parseJson(String json, boolean
allowDuplicateKeys) throws
/**
* Similar {@link #parseJson(String, boolean)}, but takes a JSON parser
instead of string input.
*/
- public static Variant parseJson(JsonParser parser, boolean
allowDuplicateKeys)
- throws IOException {
+ public static Variant parseJson(JsonParser parser, boolean
allowDuplicateKeys,
+ VariantMetrics variantMetrics) throws
IOException {
VariantBuilder builder = new VariantBuilder(allowDuplicateKeys);
- builder.buildJson(parser);
+ builder.buildJson(parser, variantMetrics, 0);
+ variantMetrics.variantCount += 1;
+ Variant v = builder.result();
+ variantMetrics.byteSize += v.value.length + v.metadata.length;
return builder.result();
}
+ public static Variant parseJson(JsonParser parser, boolean
allowDuplicateKeys)
Review Comment:
There should be a method comment for this public method.
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -225,7 +241,8 @@ class JacksonParser(
* Create a converter which converts the JSON documents held by the
`JsonParser`
* to a value according to a desired schema.
Review Comment:
Can you add comments and explain what `isRoot` and `isChildOfRoot` are? Why
do we need both?
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -119,15 +121,28 @@ class JacksonParser(
private val variantAllowDuplicateKeys =
SQLConf.get.getConf(SQLConf.VARIANT_ALLOW_DUPLICATE_KEYS)
- protected final def parseVariant(parser: JsonParser): VariantVal = {
+ // isTopLevel helps in distinguishing between top level variant and nested
variants for metrics.
+ private final def parseVariant(isTopLevel: Boolean)(parser: JsonParser):
VariantVal = {
// Skips `FIELD_NAME` at the beginning. This check is adapted from
`parseJsonToken`, but we
// cannot directly use the function here because it also handles the
`VALUE_NULL` token and
// returns null (representing a SQL NULL). Instead, we want to return a
variant null.
if (parser.getCurrentToken == FIELD_NAME) {
parser.nextToken()
}
try {
- val v = VariantBuilder.parseJson(parser, variantAllowDuplicateKeys)
+ val v = if (isTopLevel) {
+ topLevelVariantMetrics match {
Review Comment:
I think the 2 match clauses are the same, right? Maybe we could do something
like
```
val v = if (isTopLevel) topLevelVariantMetrics else nestedVariantMetrics
v match {
case Some(metrics) => ...
case None => ...
}
```
I think this would reduce code duplication?
##########
common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java:
##########
@@ -60,13 +60,21 @@ public static Variant parseJson(String json, boolean
allowDuplicateKeys) throws
/**
* Similar {@link #parseJson(String, boolean)}, but takes a JSON parser
instead of string input.
Review Comment:
We should add a comment about the `variantMetrics` parameter.
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala:
##########
@@ -83,9 +85,59 @@ class FileScanRDD(
val readSchema: StructType,
val metadataColumns: Seq[AttributeReference] = Seq.empty,
metadataExtractors: Map[String, PartitionedFile => Any] = Map.empty,
- options: FileSourceOptions = new
FileSourceOptions(CaseInsensitiveMap(Map.empty)))
+ options: FileSourceOptions = new
FileSourceOptions(CaseInsensitiveMap(Map.empty)),
+ topLevelVariantMetrics: Option[VariantMetrics] = None,
Review Comment:
I wonder if we should have a more generic metrics parameter, instead of 3
variant-specific parameters? The variant ones can be contained within the
generic metrics "container", and in the future other metrics could potentially
go into that. I just think it might be strange to have 3 variant-specific
parameters, when many rdds may not even have any variants at all.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]