Here is the PR: https://github.com/apache/iceberg/pull/3132 <https://github.com/apache/iceberg/pull/3132>
- Anton > On 16 Sep 2021, at 10:45, Anton Okolnychyi <aokolnyc...@apple.com.INVALID> > wrote: > > Hey devs, > > I think we can improve the way we handle our Spark configuration right now. > Specifically, I see a few issues. > > Our SQL configs are scattered across a number of classes. > > For example, we have some SQL configs in SparkUtil, IcebergSource, > SparkWriteBuilder, Spark3Util. I think having a separate class for such > constants would make sense, given that we already have SparkReadOptions and > SparkWriteOptions. > > We repeat the same code to parse the read/write options, session conf, table > metadata in multiple places. > > For example, we duplicate the code for getting the write file format, split > related configs, whether to check nullability and order of incoming columns, > etc. This happens partially because we have Spark 2 and Spark 3 integrations > but also because we have multiple scans/writers in Spark 3. With upcoming > support for merge-on-read, there will be even more classes where we would > need to parse the same arguments. > > The config resolution logic complicates classes where it is defined. > > Our writer and scan builders are already complicated so removing the config > resolution logic and dedicating a separate class for that seems like a > promising idea. > > We have inconsistent precedence order for Spark configs. > > Historically, we interpreted whether a read/write option should take > precedence over the corresponding session config inconsistently. At some > point, we had a discussion [1] and reached consensus that the most common way > to think about it is read option -> session conf -> table metadata. There are > still places where the session config overrides options. I think we should > finally fix that and be consistent even though it may be a behavior change. > > I’ve put together a PR to cover these points and I’d appreciate your > feedback. I mention this on the dev list because the last point is a behavior > change and I want to make sure everyone is OK with that. > > Anton > > [1] - https://github.com/apache/iceberg/pull/2248#discussion_r580693954 > <https://github.com/apache/iceberg/pull/2248#discussion_r580693954>