Re: [I] Variant Data Type Support [iceberg]

via GitHub Fri, 12 Sep 2025 17:25:22 -0700


soumilshah1995 commented on issue #10392:
URL: https://github.com/apache/iceberg/issues/10392#issuecomment-3287238779


   HI  @aihuaxu 
   ```
   from pyspark.sql import SparkSession
   import os
   import sys
   
   os.environ["JAVA_HOME"] = 
"/opt/homebrew/Cellar/openjdk@17/17.0.14/libexec/openjdk.jdk/Contents/Home"
   SUBMIT_ARGS = (
       "--packages org.apache.iceberg:iceberg-spark-runtime-4.0_2.13:1.10.0 "
       "--repositories https://repo1.maven.org/maven2 "
       "pyspark-shell"
   )
   os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
   os.environ["PYSPARK_PYTHON"] = sys.executable
   
   local_warehouse_path = "/Users/soumilshah/Desktop/warehouse"
   
   spark = (
       SparkSession.builder
       .config("spark.sql.extensions", 
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
       .config("spark.sql.catalog.dev", "org.apache.iceberg.spark.SparkCatalog")
       .config("spark.sql.catalog.dev.type", "hadoop")
       .config("spark.sql.catalog.dev.warehouse", local_warehouse_path)
       .config("spark.sql.catalog.dev.format-version", "3")
       .config("spark.sql.join.preferSortMergeJoin", "false")
       .getOrCreate()
   )
   spark.sql("""
   CREATE TABLE IF NOT EXISTS dev.default.test_table (
       id BIGINT,
       properties VARIANT
   )
   USING iceberg
   TBLPROPERTIES (
     'format-version' = '3'
   )
   """)
   
   
   # Complex JSON string data
   json_string = '''{
       "user": {
           "name": "Alice",
           "age": 30,
           "hobbies": ["reading", "swimming", "hiking"],
           "address": {
               "city": "Wonderland",
               "zip": "12345"
           }
       },
       "status": "active",
       "tags": ["admin", "user", "editor"]
   }'''
   
   # Create DataFrame to insert
   df = spark.createDataFrame([(1, json_string)], ["id", "properties"])
   
   # Cast JSON string column to VARIANT type
   df = df.withColumn("properties", col("properties").cast("variant"))
   
   # Append data to Iceberg table
   df.writeTo("dev.default.test_table").append()
   
   # Query nested fields in VARIANT column
   spark.sql("""
   SELECT
     id,
     variant_get(properties, '$.user.name', 'string') as user_name,
     variant_get(properties, '$.user.age', 'int') as user_age,
     variant_get(properties, '$.user.address.city', 'string') as city,
     variant_get(properties, '$.status', 'string') as status
   FROM dev.default.test_table
   """).show(truncate=False)
   
   +---+---------+--------+----+------+
   |id |user_name|user_age|city|status|
   +---+---------+--------+----+------+
   |1  |NULL     |NULL    |NULL|NULL  |
   |1  |NULL     |NULL    |NULL|NULL  |
   +---+---------+--------+----+------+
   ```
   
   Any pointers on why I am seeing NULL on querying varient types


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Variant Data Type Support [iceberg]

Reply via email to