Re: [DISCUSS] USING syntax for Datasource V2

Russell Spitzer Tue, 21 Aug 2018 19:58:27 -0700

So the external catalogue is really more of a connector to different source
of truth fo table listings? That makes more sense.


On Tue, Aug 21, 2018 at 6:16 PM Ryan Blue <rb...@netflix.com> wrote:

> I don’t understand why a Cassandra Catalogue wouldn’t be able to store
> metadata references for a parquet table just as a Hive Catalogue can store
> references to a C* datastource.
>
> Sorry for the confusion. I’m not talking about a catalog that stores its
> information in Cassandra. I’m talking about a catalog that talks directly
> to Cassandra to expose Cassandra tables. It wouldn’t make sense because
> Cassandra doesn’t store data in Parquet files. The reason you’d want to
> have a catalog like this is to maintain a single source of truth for that
> metadata instead of using the canonical metadata for a table in the system
> that manages that table, and other metadata in Spark’s metastore catalog.
>
> You could certainly build a catalog implementation that stores its data in
> Cassandra or JDBC and supports the same tables that Spark does today.
> That’s just not what I’m talking about here.
>
> On Mon, Aug 20, 2018 at 7:31 PM Russell Spitzer <russell.spit...@gmail.com>
> wrote:
>
>> I'm not sure I follow what the discussion topic is here
>>
>> > For example, a Cassandra catalog or a JDBC catalog that exposes tables
>> in those systems will definitely not support users marking tables with the
>> “parquet” data source.
>>
>> I don't understand why a Cassandra Catalogue wouldn't be able to store
>> metadata references for a parquet table just as a Hive Catalogue can store
>> references to a C* datastource. We currently store HiveMetastore data in a
>> C* table and this allows us to store tables with any underlying format even
>> though the catalogues' implantation is written in C*.
>>
>> Is the idea here that a table can't have multiple underlying formats in a
>> given catalogue? And the USING can then be used on read to force a
>> particular format?
>>
>> > I think other catalogs should be able to choose what to do with the
>> USING string
>>
>> This makes sense to me, but i'm not sure why any catalogue would want to
>> ignore this?
>>
>> It would be helpful to me to have a few examples written out if that is
>> possible with Old Implementation and New Implementation
>>
>> Thanks for your time,
>> Russ
>>
>> On Mon, Aug 20, 2018 at 11:33 AM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> Thanks for posting this discussion to the dev list, it would be great to
>>> hear what everyone thinks about the idea that USING should be a
>>> catalog-specific storage configuration.
>>>
>>> Related to this, I’ve updated the catalog PR, #21306
>>> <https://github.com/apache/spark/pull/21306>, to include an
>>> implementation that translates from the v2 TableCatalog API
>>> <https://github.com/apache/spark/pull/21306/files#diff-a9d913d11630b965ef5dd3d3a02ca452>
>>> to the current catalog API. That shows how this would fit together with v1,
>>> at least for the catalog part. This will enable all of the new query plans
>>> to be written to the TableCatalog API, even if they end up using the
>>> default v1 catalog.
>>>
>>> On Mon, Aug 20, 2018 at 12:19 AM Hyukjin Kwon <gurwls...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I have been trying to follow `USING` syntax support since that looks
>>>> currently not supported whereas `format` API supports this. I have been
>>>> trying to understand why and talked with Ryan.
>>>>
>>>> Ryan knows all the details and, He and I thought it's good to post here
>>>> - I just started to look into this.
>>>> Here is Ryan's response:
>>>>
>>>>
>>>> >USING is currently used to select the underlying data source
>>>> implementation directly. The string passed in USING or format in the
>>>> DF API is used to resolve an implementation class.
>>>>
>>>> The existing catalog supports tables that specify their datasource
>>>> implementation, but this will not be the case for all catalogs when Spark
>>>> adds multiple catalog support. For example, a Cassandra catalog or a JDBC
>>>> catalog that exposes tables in those systems will definitely not support
>>>> users marking tables with the “parquet” data source. The catalog must have
>>>> the ability to determine the data source implementation. That’s why I think
>>>> it is valuable to think of the current ExternalCatalog as one that can
>>>> track tables with any read/write implementation. Other catalogs can’t and
>>>> won’t do that.
>>>>
>>>> > In the catalog v2 API <https://github.com/apache/spark/pull/21306> I’ve
>>>> proposed, everything from CREATE TABLE is passed to the catalog. Then
>>>> the catalog determines what source to use and returns a Table instance
>>>> that uses some class for its ReadSupport and WriteSupport implementation.
>>>> An ExternalCatalog exposed through that API would receive the USING or
>>>> format string as a table property and would return a Table that uses
>>>> the correct ReadSupport, so tables stored in an ExternalCatalog will
>>>> work as they do today.
>>>>
>>>> > I think other catalogs should be able to choose what to do with the
>>>> USING string. An Iceberg <https://github.com/Netflix/iceberg> catalog
>>>> might use this to determine the underlying file format, which could be
>>>> parquet, orc, or avro. Or, a JDBC catalog might use it for the
>>>> underlying table implementation in the DB. This would make the property
>>>> more of a storage hint for the catalog, which is going to determine the
>>>> read/write implementation anyway.
>>>>
>>>> > For cases where there is no catalog involved, the current plan is to
>>>> use the reflection-based approach from v1 with the USING or format string.
>>>> In v2, that should resolve a ReadSupportProvider, which is used to
>>>> create a ReadSupport directly from options. I think this is a good
>>>> approach for backward-compatibility, but it can’t provide the same features
>>>> as a catalog-based table. Catalogs are how we have decided to build
>>>> reliable behavior for CTAS and the other standard logical plans
>>>> <https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d>.
>>>> CTAS is a create and then an insert, and a write implementation alone can’t
>>>> provide that create operation.
>>>>
>>>> I was targeting the last case (where there is no catalog involved) in
>>>> particular. I was thinking that approach is also good since `USING` syntax
>>>> compatibility should be kept anyway - this should reduce migration cost as
>>>> well. Was wondering about what you guys think about this.
>>>> If you guys could think the last case should be supported anyway, I was
>>>> thinking we could just orthogonally proceed. If you guys think other issues
>>>> should be resolved first, I think we (at least I will) should take a look
>>>> for the set of catalog APIs.
>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] USING syntax for Datasource V2

Reply via email to