Re: allowing configs to be specified in SQLConf for Spark reads/writes

Wing Yew Poon Wed, 26 Jul 2023 12:46:14 -0700

We are talking about DELETE/UPDATE/MERGE operations. There is only SQL
support for these operations. There is no DataFrame API support for them.*
Therefore write options are not applicable. Thus SQLConf is the only
available mechanism I can use to override the table property.
For reference, we currently support setting distribution mode using write
option, SQLConf and table property. It seems to me that
https://github.com/apache/iceberg/pull/6838/ is a precedent for what I'd
like to do.


* It would be of interest to support performing DELETE/UPDATE/MERGE from
DataFrames, but that is a whole other topic.


On Wed, Jul 26, 2023 at 12:04 PM Ryan Blue <[email protected]> wrote:

> I think we should aim to have the same behavior across properties that are
> set in SQL conf, table config, and write options. Having SQL conf override
> table config for this doesn't make sense to me. If the need is to override
> table configuration, then write options are the right way to do it.
>
> On Wed, Jul 26, 2023 at 10:10 AM Wing Yew Poon <[email protected]>
> wrote:
>
>> I was on vacation.
>> Currently, write modes (copy-on-write/merge-on-read) can only be set as
>> table properties, and default to copy-on-write. We have a customer who
>> wants to use copy-on-write for certain Spark jobs that write to some
>> Iceberg table and merge-on-read for other Spark jobs writing to the same
>> table, because of the write characteristics of those jobs. This seems like
>> a use case that should be supported. The only way they can do this
>> currently is to toggle the table property as needed before doing the
>> writes. This is not a sustainable workaround.
>> Hence, I think it would be useful to be able to configure the write mode
>> as a SQLConf. I also disagree that the table property should always win. If
>> this is the case, there is no way to override it. The existing behavior in
>> SparkConfParser is to use the option if set, else use the session conf if
>> set, else use the table property. This applies across the board.
>> - Wing Yew
>>
>>
>>
>>
>>
>>
>> On Sun, Jul 16, 2023 at 4:48 PM Ryan Blue <[email protected]> wrote:
>>
>>> Yes, I agree that there is value for administrators from having some
>>> things exposed as Spark SQL configuration. That gets much harder when you
>>> want to use the SQLConf for table-level settings, though. For example, the
>>> target split size is something that was an engine setting in the Hadoop
>>> world, even though it makes no sense to use the same setting across vastly
>>> different tables --- think about joining a fact table with a dimension
>>> table.
>>>
>>> Settings like write mode are table-level settings. It matters what is
>>> downstream of the table. You may want to set a *default* write mode, but
>>> the table-level setting should always win. Currently, there are limits to
>>> overriding the write mode in SQL. That's why we should add hints. For
>>> anything beyond that, I think we need to discuss what you're trying to do.
>>> If it's to override a table-level setting with a SQL global, then we should
>>> understand the use case better.
>>>
>>> On Fri, Jul 14, 2023 at 6:09 PM Wing Yew Poon
>>> <[email protected]> wrote:
>>>
>>>> Also, in the case of write mode (I mean write.delete.mode,
>>>> write.update.mode, write.merge.mode), these cannot be set as options
>>>> currently; they are only settable as table properties.
>>>>
>>>> On Fri, Jul 14, 2023 at 5:58 PM Wing Yew Poon <[email protected]>
>>>> wrote:
>>>>
>>>>> I think that different use cases benefit from or even require
>>>>> different solutions. I think enabling options in Spark SQL is helpful, but
>>>>> allowing some configurations to be done in SQLConf is also helpful.
>>>>> For Cheng Pan's use case (to disable locality), I think providing a
>>>>> conf (which can be added to spark-defaults.conf by a cluster admin) is
>>>>> useful.
>>>>> For my customer's use case (
>>>>> https://github.com/apache/iceberg/pull/7790), being able to set the
>>>>> write mode per Spark job (where right now it can only be set as a table
>>>>> property) is useful. Allowing this to be done in the SQL with an
>>>>> option/hint could also work, but as I understand it, Szehon's PR (
>>>>> https://github.com/apache/spark/pull/416830) is only applicable to
>>>>> reads, not writes.
>>>>>
>>>>> - Wing Yew
>>>>>
>>>>>
>>>>> On Thu, Jul 13, 2023 at 1:04 AM Cheng Pan <[email protected]> wrote:
>>>>>
>>>>>> Ryan, I understand that option should be job-specific, and
>>>>>> introducing an OPTIONS HINT can make Spark SQL achieves similar
>>>>>> capabilities as DataFrame API does.
>>>>>>
>>>>>> My point is, some of the Iceberg options should not be job-specific.
>>>>>>
>>>>>> For example, Iceberg has an option “locality” which only allows
>>>>>> setting at the job level, but Spark has a configuration
>>>>>> “spark.shuffle.reduceLocality.enabled” which allows setting at the 
>>>>>> cluster
>>>>>> level, this is a gap block Spark administers migrate to Iceberg because
>>>>>> they can not disable it at the cluster level.
>>>>>>
>>>>>> So, what’s the principle in the Iceberg of classifying a
>>>>>> configuration into SQLConf or OPTION?
>>>>>>
>>>>>> Thanks,
>>>>>> Cheng Pan
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> > On Jul 5, 2023, at 16:26, Cheng Pan <[email protected]> wrote:
>>>>>> >
>>>>>> > I would argue that the SQLConf way is more in line with Spark
>>>>>> user/administrator habits.
>>>>>> >
>>>>>> > It’s a common practice that Spark administrators set configurations
>>>>>> in spark-defaults.conf at the cluster level , and when the user has 
>>>>>> issues
>>>>>> with their Spark SQL/Jobs, the first question they asked mostly is: can 
>>>>>> it
>>>>>> be fixed by adding a spark configuration?
>>>>>> >
>>>>>> > The OPTIONS way brings additional learning efforts to Spark users
>>>>>> and how can Spark administrators set them at cluster level?
>>>>>> >
>>>>>> > Thanks,
>>>>>> > Cheng Pan
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >> On Jun 17, 2023, at 04:01, Wing Yew Poon
>>>>>> <[email protected]> wrote:
>>>>>> >>
>>>>>> >> Hi,
>>>>>> >> I recently put up a PR,
>>>>>> https://github.com/apache/iceberg/pull/7790, to allow the write mode
>>>>>> (copy-on-write/merge-on-read) to be specified in SQLConf. The use case is
>>>>>> explained in the PR.
>>>>>> >> Cheng Pan has an open PR,
>>>>>> https://github.com/apache/iceberg/pull/7733, to allow locality to be
>>>>>> specified in SQLConf.
>>>>>> >> In the recent past, https://github.com/apache/iceberg/pull/6838/
>>>>>> was a PR to allow the write distribution mode to be specified in SQLConf.
>>>>>> This was merged.
>>>>>> >> Cheng Pan asks if there is any guidance on when we should allow
>>>>>> configs to be specified in SQLConf.
>>>>>> >> Thanks,
>>>>>> >> Wing Yew
>>>>>> >>
>>>>>> >> ps. The above open PRs could use reviews by committers.
>>>>>> >>
>>>>>> >
>>>>>>
>>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Ryan Blue
> Tabular
>

Re: allowing configs to be specified in SQLConf for Spark reads/writes

Reply via email to