voonhous commented on issue #17744:
URL: https://github.com/apache/hudi/issues/17744#issuecomment-3822090390
Example of root-level shredding:
```
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder \
.master("local") \
.appName("VariantShreddingExample") \
.getOrCreate()
# Enable reading shredded data (if not enabled by default)
spark.conf.set("spark.sql.variant.allowReadingShredded", "true")
# Set configurations equivalent to the Scala snippet
spark.conf.set("spark.sql.variant.writeShredding.enabled", "true")
# Force a specific shredding schema for testing
spark.conf.set("spark.sql.variant.forceShreddingSchemaForTest", "bigint")
# Create DataFrame using SQL logic exactly as provided
# We use triple quotes for the multi-line SQL string
print("--- Creating DataFrame with SQL ---")
df = spark.sql("""
select case
when id = 0 then parse_json('100')
when id = 1 then parse_json('"hello_world"')
when id = 2 then parse_json('{"A": 1, "c": 1.23}')
end as v
from range(3)
""")
# Write to SHREDDED Parquet
output_path = "/tmp/delta-data/shredded_test_root_level_schema_pyspark"
print(f"--- Writing to {output_path} ---")
df.write.mode("overwrite").parquet(output_path)
# Read back and show
print("--- Reading back from Parquet ---")
read_df = spark.read.parquet(output_path)
read_df.show(truncate=False)
read_df.printSchema()
# Set configurations equivalent to the Scala snippet
spark.conf.set("spark.sql.variant.writeShredding.enabled", "false")
# Force a specific shredding schema for testing
spark.conf.set("spark.sql.variant.forceShreddingSchemaForTest", "")
# Write to UNSHREDDED Parquet
output_path = "/tmp/delta-data/unshredded_test_root_level_schema_pyspark"
print(f"--- Writing to {output_path} ---")
df.write.mode("overwrite").parquet(output_path)
# Read back and show
print("--- Reading back from Parquet ---")
read_df = spark.read.parquet(output_path)
read_df.show(truncate=False)
read_df.printSchema()
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]