bgt-cdedels opened a new issue #2456:
URL: https://github.com/apache/iceberg/issues/2456


   In my spark jobs I am reading from JSON and merging to iceberg.  In my 
iceberg tables I would like to have NOT NULL constraints.  However, when 
loading data from JSON, spark doesn't enforce the schema nullability 
constraints.  To work around this I have discovered two alternatives:
   
   ```
   input_dataset = 
spark.read.schema(my_schema).json("s3://my_bucket/my_folder/").filter(col("my_key").isNotNull())
   input_dataset = spark.createDataFrame(input_dataset.rdd, schema=my_schema)
   ```
   or
   ```
   --conf spark.sql.storeAssignmentPolicy=LEGACY
   input_dataset = 
spark.read.schema(my_schema).json("s3://my_bucket/my_folder/").filter(col("my_key").isNotNull())
   ```
   
   The first option is very slow, adding 40-60 minutes to the processing time 
of my spark application.  The second option seems too permissive.  I have 
noticed that there is a configuration option named 
spark.sql.iceberg.check-nullability in the code.  I'd like to propose that this 
option be included in the AssignmentAlignmentTrait to allow writers to bypass 
NULL constraints while preserving the other types of compatibility checks.
   
   
https://github.com/apache/iceberg/blob/apache-iceberg-0.11.1/spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/AssignmentAlignmentSupport.scala#L152
   
   Thanks for consideration.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to