Re: DataSourceV2 capability API

Reynold Xin Fri, 09 Nov 2018 11:32:39 -0800

How do we deal with forward compatibility? Consider, Spark adds a new
"property". In the past the data source supports that property, but since
it was not explicitly defined, in the new version of Spark that data source
would be considered not supporting that property, and thus throwing an
exception.



On Fri, Nov 9, 2018 at 9:11 AM Ryan Blue <rb...@netflix.com> wrote:

> I'd have two places. First, a class that defines properties supported and
> identified by Spark, like the SQLConf definitions. Second, in documentation
> for the v2 table API.
>
> On Fri, Nov 9, 2018 at 9:00 AM Felix Cheung <felixcheun...@hotmail.com>
> wrote:
>
>> One question is where will the list of capability strings be defined?
>>
>>
>> ------------------------------
>> *From:* Ryan Blue <rb...@netflix.com.invalid>
>> *Sent:* Thursday, November 8, 2018 2:09 PM
>> *To:* Reynold Xin
>> *Cc:* Spark Dev List
>> *Subject:* Re: DataSourceV2 capability API
>>
>>
>> Yes, we currently use traits that have methods. Something like “supports
>> reading missing columns” doesn’t need to deliver methods. The other example
>> is where we don’t have an object to test for a trait (
>> scan.isInstanceOf[SupportsBatch]) until we have a Scan with pushdown
>> done. That could be expensive so we can use a capability to fail faster.
>>
>> On Thu, Nov 8, 2018 at 1:54 PM Reynold Xin <r...@databricks.com> wrote:
>>
>>> This is currently accomplished by having traits that data sources can
>>> extend, as well as runtime exceptions right? It's hard to argue one way vs
>>> another without knowing how things will evolve (e.g. how many different
>>> capabilities there will be).
>>>
>>>
>>> On Thu, Nov 8, 2018 at 12:50 PM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I’d like to propose an addition to DataSourceV2 tables, a capability
>>>> API. This API would allow Spark to query a table to determine whether it
>>>> supports a capability or not:
>>>>
>>>> val table = catalog.load(identifier)
>>>> val supportsContinuous = table.isSupported("continuous-streaming")
>>>>
>>>> There are a couple of use cases for this. First, we want to be able to
>>>> fail fast when a user tries to stream a table that doesn’t support it. The
>>>> design of our read implementation doesn’t necessarily support this. If we
>>>> want to share the same “scan” across streaming and batch, then we need to
>>>> “branch” in the API after that point, but that is at odds with failing
>>>> fast. We could use capabilities to fail fast and not worry about that
>>>> concern in the read design.
>>>>
>>>> I also want to use capabilities to change the behavior of some
>>>> validation rules. The rule that validates appends, for example, doesn’t
>>>> allow a write that is missing an optional column. That’s because the
>>>> current v1 sources don’t support reading when columns are missing. But
>>>> Iceberg does support reading a missing column as nulls, so that users can
>>>> add a column to a table without breaking a scheduled job that populates the
>>>> table. To fix this problem, I would use a table capability, like
>>>> read-missing-columns-as-null.
>>>>
>>>> Any comments on this approach?
>>>>
>>>> rb
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: DataSourceV2 capability API

Reply via email to