aokolnychyi opened a new pull request #3132:
URL: https://github.com/apache/iceberg/pull/3132


   I think we can improve the way we handle our Spark configuration right now. 
Specifically, I see a few issues.
   
   - Our SQL configs are scattered across a number of classes.
   
   For example, we have some SQL configs in `SparkUtil`, `IcebergSource`, 
`SparkWriteBuilder`, `Spark3Util`. I think having a separate class for such 
constants would make sense, given that we already have `SparkReadOptions` and 
`SparkWriteOptions`.
   
   - We repeat the same code to parse the read/write options, session conf, 
table metadata in multiple places.
   
   For example, we duplicate the code for getting the write file format, split 
related configs, whether to check nullability and order of incoming columns, 
etc. This happens partially because we have Spark 2 and Spark 3 integrations 
but also because we have multiple scans/writers in Spark 3. With upcoming 
support for merge-on-read, there will be even more classes where we would need 
to parse the same arguments.
   
   - The config resolution logic complicates classes where it is defined.
   
   Our writer and scan builders are already complicated so removing the config 
resolution logic and dedicating a separate class for that seems like a 
promising idea.
   
   - We have inconsistent precedence order for Spark configs.
   
   Historically, we interpreted whether a read/write option should take 
precedence over the session config inconsistently. At some point, we had a 
[discussion](https://github.com/apache/iceberg/pull/2248#discussion_r580693954) 
and reached a consensus that the most common way to think about it is read 
option -> session conf -> table metadata. There are still places where the 
session config overrides options. I think we should finally fix that and be 
consistent even though it may be a behavior change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to