bgt-cdedels opened a new issue #2456:
URL: https://github.com/apache/iceberg/issues/2456
In my spark jobs I am reading from JSON and merging to iceberg. In my
iceberg tables I would like to have NOT NULL constraints. However, when
loading data from JSON, spark doesn't enforce the schema nullability
constraints. To work around this I have discovered two alternatives:
```
input_dataset =
spark.read.schema(my_schema).json("s3://my_bucket/my_folder/").filter(col("my_key").isNotNull())
input_dataset = spark.createDataFrame(input_dataset.rdd, schema=my_schema)
```
or
```
--conf spark.sql.storeAssignmentPolicy=LEGACY
input_dataset =
spark.read.schema(my_schema).json("s3://my_bucket/my_folder/").filter(col("my_key").isNotNull())
```
The first option is very slow, adding 40-60 minutes to the processing time
of my spark application. The second option seems too permissive. I have
noticed that there is a configuration option named
spark.sql.iceberg.check-nullability in the code. I'd like to propose that this
option be included in the AssignmentAlignmentTrait to allow writers to bypass
NULL constraints while preserving the other types of compatibility checks.
https://github.com/apache/iceberg/blob/apache-iceberg-0.11.1/spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/AssignmentAlignmentSupport.scala#L152
Thanks for consideration.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]