voonhous commented on code in PR #18062:
URL: https://github.com/apache/hudi/pull/18062#discussion_r2940988271


##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieStorageConfig.java:
##########
@@ -168,6 +168,36 @@ public class HoodieStorageConfig extends HoodieConfig {
       .withDocumentation("Control whether to write bloom filter or not. 
Default true. "
           + "We can set to false in non bloom index cases for CPU resource 
saving.");
 
+  public static final ConfigProperty<Boolean> 
PARQUET_VARIANT_WRITE_SHREDDING_ENABLED = ConfigProperty
+      .key("hoodie.parquet.variant.write.shredding.enabled")
+      .defaultValue(true)
+      .sinceVersion("1.1.0")
+      .withDocumentation("Controls whether variant columns are written in 
shredded format. "
+          + "When enabled (default), variant columns with shredding 
information in the schema will be written "
+          + "in shredded format with typed_value columns. When disabled, 
variant columns are always written "
+          + "in unshredded format regardless of the schema. "
+          + "Equivalent to Spark's spark.sql.variant.writeShredding.enabled.");
+
+  public static final ConfigProperty<String> 
PARQUET_VARIANT_FORCE_SHREDDING_SCHEMA_FOR_TEST = ConfigProperty
+      .key("hoodie.parquet.variant.force.shredding.schema.for.test")
+      .noDefaultValue()
+      .markAdvanced()
+      .sinceVersion("1.1.0")
+      .withDocumentation("Forces a specific shredding schema for all variant 
columns, intended for testing. "

Review Comment:
   As of now, shredding is entirely determined by spark. There's some 
heuristics that Spark decides on what to shred. 
   
   The shredding schema is inferred here:
   
https://github.com/apache/spark/blob/cbbbd41bd5d95c94cfca0b4bafcfbe90df4d7e0e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/InferVariantShreddingSchema.scala#L35
   
   FWIU, shredding logic is as such:
   
   1. Only Variant columns at the top level or nested inside structs are 
candidates. Variants inside arrays or maps are never shredded.
   2. Buffers up to 4096/64MB of rows per batch and uses that batch to infer 
what to shred. 
   
   As to the specifics on what they are doing, haven't really analysed yet but 
it's out of the scope of this PR anyways.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to