Here is the PR:

https://github.com/apache/iceberg/pull/3132 
<https://github.com/apache/iceberg/pull/3132>

- Anton

> On 16 Sep 2021, at 10:45, Anton Okolnychyi <aokolnyc...@apple.com.INVALID> 
> wrote:
> 
> Hey devs,
> 
> I think we can improve the way we handle our Spark configuration right now. 
> Specifically, I see a few issues.
> 
> Our SQL configs are scattered across a number of classes.
> 
> For example, we have some SQL configs in SparkUtil, IcebergSource, 
> SparkWriteBuilder, Spark3Util. I think having a separate class for such 
> constants would make sense, given that we already have SparkReadOptions and 
> SparkWriteOptions.
> 
> We repeat the same code to parse the read/write options, session conf, table 
> metadata in multiple places.
> 
> For example, we duplicate the code for getting the write file format, split 
> related configs, whether to check nullability and order of incoming columns, 
> etc. This happens partially because we have Spark 2 and Spark 3 integrations 
> but also because we have multiple scans/writers in Spark 3. With upcoming 
> support for merge-on-read, there will be even more classes where we would 
> need to parse the same arguments.
> 
> The config resolution logic complicates classes where it is defined.
> 
> Our writer and scan builders are already complicated so removing the config 
> resolution logic and dedicating a separate class for that seems like a 
> promising idea.
> 
> We have inconsistent precedence order for Spark configs.
> 
> Historically, we interpreted whether a read/write option should take 
> precedence over the corresponding session config inconsistently. At some 
> point, we had a discussion [1] and reached consensus that the most common way 
> to think about it is read option -> session conf -> table metadata. There are 
> still places where the session config overrides options. I think we should 
> finally fix that and be consistent even though it may be a behavior change.
> 
> I’ve put together a PR to cover these points and I’d appreciate your 
> feedback. I mention this on the dev list because the last point is a behavior 
> change and I want to make sure everyone is OK with that.
> 
> Anton
> 
> [1] - https://github.com/apache/iceberg/pull/2248#discussion_r580693954 
> <https://github.com/apache/iceberg/pull/2248#discussion_r580693954>

Reply via email to