So the external catalogue is really more of a connector to different source of truth fo table listings? That makes more sense.
On Tue, Aug 21, 2018 at 6:16 PM Ryan Blue <rb...@netflix.com> wrote: > I don’t understand why a Cassandra Catalogue wouldn’t be able to store > metadata references for a parquet table just as a Hive Catalogue can store > references to a C* datastource. > > Sorry for the confusion. I’m not talking about a catalog that stores its > information in Cassandra. I’m talking about a catalog that talks directly > to Cassandra to expose Cassandra tables. It wouldn’t make sense because > Cassandra doesn’t store data in Parquet files. The reason you’d want to > have a catalog like this is to maintain a single source of truth for that > metadata instead of using the canonical metadata for a table in the system > that manages that table, and other metadata in Spark’s metastore catalog. > > You could certainly build a catalog implementation that stores its data in > Cassandra or JDBC and supports the same tables that Spark does today. > That’s just not what I’m talking about here. > > On Mon, Aug 20, 2018 at 7:31 PM Russell Spitzer <russell.spit...@gmail.com> > wrote: > >> I'm not sure I follow what the discussion topic is here >> >> > For example, a Cassandra catalog or a JDBC catalog that exposes tables >> in those systems will definitely not support users marking tables with the >> “parquet” data source. >> >> I don't understand why a Cassandra Catalogue wouldn't be able to store >> metadata references for a parquet table just as a Hive Catalogue can store >> references to a C* datastource. We currently store HiveMetastore data in a >> C* table and this allows us to store tables with any underlying format even >> though the catalogues' implantation is written in C*. >> >> Is the idea here that a table can't have multiple underlying formats in a >> given catalogue? And the USING can then be used on read to force a >> particular format? >> >> > I think other catalogs should be able to choose what to do with the >> USING string >> >> This makes sense to me, but i'm not sure why any catalogue would want to >> ignore this? >> >> It would be helpful to me to have a few examples written out if that is >> possible with Old Implementation and New Implementation >> >> Thanks for your time, >> Russ >> >> On Mon, Aug 20, 2018 at 11:33 AM Ryan Blue <rb...@netflix.com.invalid> >> wrote: >> >>> Thanks for posting this discussion to the dev list, it would be great to >>> hear what everyone thinks about the idea that USING should be a >>> catalog-specific storage configuration. >>> >>> Related to this, I’ve updated the catalog PR, #21306 >>> <https://github.com/apache/spark/pull/21306>, to include an >>> implementation that translates from the v2 TableCatalog API >>> <https://github.com/apache/spark/pull/21306/files#diff-a9d913d11630b965ef5dd3d3a02ca452> >>> to the current catalog API. That shows how this would fit together with v1, >>> at least for the catalog part. This will enable all of the new query plans >>> to be written to the TableCatalog API, even if they end up using the >>> default v1 catalog. >>> >>> On Mon, Aug 20, 2018 at 12:19 AM Hyukjin Kwon <gurwls...@gmail.com> >>> wrote: >>> >>>> Hi all, >>>> >>>> I have been trying to follow `USING` syntax support since that looks >>>> currently not supported whereas `format` API supports this. I have been >>>> trying to understand why and talked with Ryan. >>>> >>>> Ryan knows all the details and, He and I thought it's good to post here >>>> - I just started to look into this. >>>> Here is Ryan's response: >>>> >>>> >>>> >USING is currently used to select the underlying data source >>>> implementation directly. The string passed in USING or format in the >>>> DF API is used to resolve an implementation class. >>>> >>>> The existing catalog supports tables that specify their datasource >>>> implementation, but this will not be the case for all catalogs when Spark >>>> adds multiple catalog support. For example, a Cassandra catalog or a JDBC >>>> catalog that exposes tables in those systems will definitely not support >>>> users marking tables with the “parquet” data source. The catalog must have >>>> the ability to determine the data source implementation. That’s why I think >>>> it is valuable to think of the current ExternalCatalog as one that can >>>> track tables with any read/write implementation. Other catalogs can’t and >>>> won’t do that. >>>> >>>> > In the catalog v2 API <https://github.com/apache/spark/pull/21306> I’ve >>>> proposed, everything from CREATE TABLE is passed to the catalog. Then >>>> the catalog determines what source to use and returns a Table instance >>>> that uses some class for its ReadSupport and WriteSupport implementation. >>>> An ExternalCatalog exposed through that API would receive the USING or >>>> format string as a table property and would return a Table that uses >>>> the correct ReadSupport, so tables stored in an ExternalCatalog will >>>> work as they do today. >>>> >>>> > I think other catalogs should be able to choose what to do with the >>>> USING string. An Iceberg <https://github.com/Netflix/iceberg> catalog >>>> might use this to determine the underlying file format, which could be >>>> parquet, orc, or avro. Or, a JDBC catalog might use it for the >>>> underlying table implementation in the DB. This would make the property >>>> more of a storage hint for the catalog, which is going to determine the >>>> read/write implementation anyway. >>>> >>>> > For cases where there is no catalog involved, the current plan is to >>>> use the reflection-based approach from v1 with the USING or format string. >>>> In v2, that should resolve a ReadSupportProvider, which is used to >>>> create a ReadSupport directly from options. I think this is a good >>>> approach for backward-compatibility, but it can’t provide the same features >>>> as a catalog-based table. Catalogs are how we have decided to build >>>> reliable behavior for CTAS and the other standard logical plans >>>> <https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d>. >>>> CTAS is a create and then an insert, and a write implementation alone can’t >>>> provide that create operation. >>>> >>>> I was targeting the last case (where there is no catalog involved) in >>>> particular. I was thinking that approach is also good since `USING` syntax >>>> compatibility should be kept anyway - this should reduce migration cost as >>>> well. Was wondering about what you guys think about this. >>>> If you guys could think the last case should be supported anyway, I was >>>> thinking we could just orthogonally proceed. If you guys think other issues >>>> should be resolved first, I think we (at least I will) should take a look >>>> for the set of catalog APIs. >>>> >>>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> > > -- > Ryan Blue > Software Engineer > Netflix >