Hi Ryan,

Thanks for the explanation! This shed lights on areas but also triggered
some questions.

The main conclusion to me on the Kafka connector side is to keep the v1 as
default. Let the users some time to migrate to v2 and later delete v1 when
its stable (which makes sense from my perspective).

The interesting part is that the Kafka microbatch already uses v2 as
default which I don't fully understand how to fit into this.
Please see this test:
https://github.com/apache/spark/blob/54da3bbfb2c936827897c52ed6e5f0f428b98e9f/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala#L1084
Since https://issues.apache.org/jira/browse/SPARK-23362 merged into 2.4 it
shouldn't be breaking (I assume batch part should be similar).

We can continue the discussion about Kafka batch v1/v2 default on
https://github.com/apache/spark/pull/24738 not to bomb everybody.

Please send me an invite to the sync meeting. Not sure when exactly that
happens but presume it's in the night from CET timezone perspective.
Try to organize my time to participate...

BR,
G


On Thu, Jun 20, 2019 at 8:24 PM Ryan Blue <rb...@netflix.com> wrote:

> Hi Gabor,
>
> First, a little context... one of the goals of DSv2 is to standardize the
> behavior of SQL operations in Spark. For example, running CTAS when a table
> exists will fail, not take some action depending on what the source
> chooses, like drop & CTAS, inserting, or failing.
>
> Unfortunately, this means that DSv1 can't be easily replaced because it
> has behavior differences between sources. In addition, we're not really
> sure how DSv1 works in all cases -- it really depends on what seemed
> reasonable to authors at the time. For example, we don't have a good
> understanding of how file-based tables behave (those not backed by a
> Metastore). There are also changes that we know are breaking and are okay
> with, like only inserting safe casts when writing with v2.
>
> Because of this, we can't just replace v1 with v2 transparently, so the
> plan is to allow deployments to migrate to v2 in stages. Here's the plan:
> 1. Use v1 by default so all existing queries work as they do today for
> identifiers like `db.table`
> 2. Allow users to add additional v2 catalogs that will be used when
> identifiers specifically start with one, like `test_catalog.db.table`
> 3. Add a v2 catalog that delegates to the session catalog, so that v2
> read/write implementations can be used, but are stored just like v1 tables
> in the session catalog
> 4. Add a setting to use a v2 catalog as the default. Setting this would
> use a v2 catalog for all identifiers without a catalog, like `db.table`
> 5. Add a way for a v2 catalog to return a table that gets converted to v1.
> This is what `CatalogTableAsV2` does in #24768
> <https://github.com/apache/spark/pull/24768>.
>
> PR #24768 <https://github.com/apache/spark/pull/24768> implements the
> rest of these changes. Specifically, we initially used the default catalog
> for v2 sources, but that causes namespace problems, so we need the v2
> session catalog (point #3) as the default when there is no default v2
> catalog.
>
> I hope that answers your question. If not, I'm happy to answer follow-ups
> and we can add this as a topic in the next v2 sync on Wednesday. I'm also
> planning on talking about metadata columns or function push-down from the
> Kafka v2 PR at that sync, so you may want to attend.
>
> rb
>
>
> On Thu, Jun 20, 2019 at 4:45 AM Gabor Somogyi <gabor.g.somo...@gmail.com>
> wrote:
>
>> Hi All,
>>
>>   I've taken a look at the code and docs to find out when DSv1 sources
>> has to be removed (in case of DSv2 replacement is implemented). After some
>> digging I've found DSv1 sources which are already removed but in some cases
>> v1 and v2 still exists in parallel.
>>
>> Can somebody please tell me what's the overall plan in this area?
>>
>> BR,
>> G
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Reply via email to