voonhous commented on issue #17744:
URL: https://github.com/apache/hudi/issues/17744#issuecomment-3822090390

   Example of root-level shredding:
   
   ```
   from pyspark.sql import SparkSession
   
   # Initialize Spark session
   spark = SparkSession.builder \
       .master("local") \
       .appName("VariantShreddingExample") \
       .getOrCreate()
   
   
   # Enable reading shredded data (if not enabled by default)
   spark.conf.set("spark.sql.variant.allowReadingShredded", "true")
   
   # Set configurations equivalent to the Scala snippet
   spark.conf.set("spark.sql.variant.writeShredding.enabled", "true")
   
   # Force a specific shredding schema for testing
   spark.conf.set("spark.sql.variant.forceShreddingSchemaForTest", "bigint")
   
   # Create DataFrame using SQL logic exactly as provided
   # We use triple quotes for the multi-line SQL string
   print("--- Creating DataFrame with SQL ---")
   df = spark.sql("""
       select case
           when id = 0 then parse_json('100')
           when id = 1 then parse_json('"hello_world"')
           when id = 2 then parse_json('{"A": 1, "c": 1.23}')
       end as v 
       from range(3)
   """)
   # Write to SHREDDED Parquet
   output_path = "/tmp/delta-data/shredded_test_root_level_schema_pyspark"
   print(f"--- Writing to {output_path} ---")
   df.write.mode("overwrite").parquet(output_path)
   
   # Read back and show
   print("--- Reading back from Parquet ---")
   read_df = spark.read.parquet(output_path)
   read_df.show(truncate=False)
   read_df.printSchema()
   
   # Set configurations equivalent to the Scala snippet
   spark.conf.set("spark.sql.variant.writeShredding.enabled", "false")
   
   # Force a specific shredding schema for testing
   spark.conf.set("spark.sql.variant.forceShreddingSchemaForTest", "")
   
   # Write to UNSHREDDED Parquet
   output_path = "/tmp/delta-data/unshredded_test_root_level_schema_pyspark"
   print(f"--- Writing to {output_path} ---")
   df.write.mode("overwrite").parquet(output_path)
   
   # Read back and show
   print("--- Reading back from Parquet ---")
   read_df = spark.read.parquet(output_path)
   read_df.show(truncate=False)
   read_df.printSchema()
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to