aokolnychyi opened a new pull request #3132: URL: https://github.com/apache/iceberg/pull/3132
I think we can improve the way we handle our Spark configuration right now. Specifically, I see a few issues. - Our SQL configs are scattered across a number of classes. For example, we have some SQL configs in `SparkUtil`, `IcebergSource`, `SparkWriteBuilder`, `Spark3Util`. I think having a separate class for such constants would make sense, given that we already have `SparkReadOptions` and `SparkWriteOptions`. - We repeat the same code to parse the read/write options, session conf, table metadata in multiple places. For example, we duplicate the code for getting the write file format, split related configs, whether to check nullability and order of incoming columns, etc. This happens partially because we have Spark 2 and Spark 3 integrations but also because we have multiple scans/writers in Spark 3. With upcoming support for merge-on-read, there will be even more classes where we would need to parse the same arguments. - The config resolution logic complicates classes where it is defined. Our writer and scan builders are already complicated so removing the config resolution logic and dedicating a separate class for that seems like a promising idea. - We have inconsistent precedence order for Spark configs. Historically, we interpreted whether a read/write option should take precedence over the session config inconsistently. At some point, we had a [discussion](https://github.com/apache/iceberg/pull/2248#discussion_r580693954) and reached a consensus that the most common way to think about it is read option -> session conf -> table metadata. There are still places where the session config overrides options. I think we should finally fix that and be consistent even though it may be a behavior change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
