Re: DataSourceV2 sync notes - 20 Feb 2019

2019-03-05 Thread Stavros Kontopoulos
Thanks Ryan!

On Tue, Mar 5, 2019 at 7:19 PM Ryan Blue  wrote:

> Everyone is welcome to join this discussion. Just send me an e-mail to get
> added to the invite.
>
> Stavros, I'll add you.
>
> rb
>
> On Tue, Mar 5, 2019 at 5:43 AM Stavros Kontopoulos <
> stavros.kontopou...@lightbend.com> wrote:
>
>> Thanks for the update, is this meeting open for other people to join?
>>
>> Stavros
>>
>> On Thu, Feb 21, 2019 at 10:56 PM Ryan Blue 
>> wrote:
>>
>>> Here are my notes from the DSv2 sync last night. As always, if you have
>>> corrections, please reply with them. And if you’d like to be included on
>>> the invite to participate in the next sync (6 March), send me an email.
>>>
>>> Here’s a quick summary of the topics where we had consensus last night:
>>>
>>>- The behavior of v1 sources needs to be documented to come up with
>>>a migration plan
>>>- Spark 3.0 should include DSv2, even if it would delay the release
>>>(pending community discussion and vote)
>>>- Design for the v2 Catalog plugin system
>>>- V2 catalog approach of separate TableCatalog, FunctionCatalog, and
>>>ViewCatalog interfaces
>>>- Common v2 Table metadata should be schema, partitioning, and
>>>string-map of properties; leaving out sorting for now. (Ready to vote on
>>>metadata SPIP.)
>>>
>>> *Topics*:
>>>
>>>- Issues raised by ORC v2 commit
>>>- Migration to v2 sources
>>>- Roadmap and current blockers
>>>- Catalog plugin system
>>>- Catalog API separate interfaces approach
>>>- Catalog API metadata (schema, partitioning, and properties)
>>>- Public catalog API proposal
>>>
>>> *Notes*:
>>>
>>>- Issues raised by ORC v2 commit
>>>   - Ryan: Disabled change to use v2 by default in PR for overwrite
>>>   plans: tests rely on CTAS, which is not implemented in v2.
>>>   - Wenchen: suggested using a StagedTable to work around not
>>>   having a CTAS finished. TableProvider could create a staged table.
>>>   - Ryan: Using StagedTable doesn’t make sense to me. It was
>>>   intended to solve a different problem (atomicity). Adding an 
>>> interface to
>>>   create a staged table either requires the same metadata as CTAS or 
>>> requires
>>>   a blank staged table, which isn’t the same concept: these staged 
>>> tables
>>>   would behave entirely differently than the ones for atomic operations.
>>>   Better to spend time getting CTAS done and work through the long-term 
>>> plan
>>>   than to hack around it.
>>>   - Second issue raised by the ORC work: how to support tables that
>>>   use different validations.
>>>   - Ryan: What Gengliang’s PRs are missing is a clear definition of
>>>   what tables require different validation and what that validation 
>>> should
>>>   be. In some cases, CTAS is validated against existing data [Ed: this 
>>> is
>>>   PreprocessTableCreation] and in some cases, Append has no validation
>>>   because the table doesn’t exist. What isn’t clear is when these 
>>> validations
>>>   are applied.
>>>   - Ryan: Without knowing exactly how v1 works, we can’t mirror
>>>   that behavior in v2. Building a way to turn off validation is going 
>>> to be
>>>   needed, but is insufficient without knowing when to apply it.
>>>   - Ryan: We also don’t know if it will make sense to maintain all
>>>   of these rules to mimic v1 behavior. In v1, CTAS and Append can both 
>>> write
>>>   to existing tables, but use different rules to validate. What are the
>>>   differences between them? It is unlikely that Spark will support both 
>>> as
>>>   options, if that is even possible. [Ed: see later discussion on 
>>> migration
>>>   that continues this.]
>>>   - Gengliang: Using SaveMode is an option.
>>>   - Ryan: Using SaveMode only appears to fix this, but doesn’t
>>>   actually test v2. Using SaveMode appears to work because it disables 
>>> all
>>>   validation and uses code from v1 that will “create” tables by 
>>> writing. But
>>>   this isn’t helpful for the v2 goal of having defined and reliable 
>>> behavior.
>>>   - Gengliang: SaveMode is not correctly translated. Append could
>>>   mean AppendData or CTAS.
>>>   - Ryan: This is why we need to focus on finishing the v2 plans:
>>>   so we can correctly translate the SaveMode into the right plan. That
>>>   depends on having a catalog for CTAS and to check the existence of a 
>>> table.
>>>   - Wenchen: Catalog doesn’t support path tables, so how does this
>>>   help?
>>>   - Ryan: The multi-catalog identifiers proposal includes a way to
>>>   pass paths as CatalogIdentifiers. [Ed: see PathIdentifier]. This 
>>> allows a
>>>   catalog implementation to handle path-based tables. The identifier 
>>> will
>>>   also have a method to test whether the identifier is a path 
>>> identifier and
>>>   catalogs are not required to support path 

Re: DataSourceV2 sync notes - 20 Feb 2019

2019-03-05 Thread Ryan Blue
Everyone is welcome to join this discussion. Just send me an e-mail to get
added to the invite.

Stavros, I'll add you.

rb

On Tue, Mar 5, 2019 at 5:43 AM Stavros Kontopoulos <
stavros.kontopou...@lightbend.com> wrote:

> Thanks for the update, is this meeting open for other people to join?
>
> Stavros
>
> On Thu, Feb 21, 2019 at 10:56 PM Ryan Blue 
> wrote:
>
>> Here are my notes from the DSv2 sync last night. As always, if you have
>> corrections, please reply with them. And if you’d like to be included on
>> the invite to participate in the next sync (6 March), send me an email.
>>
>> Here’s a quick summary of the topics where we had consensus last night:
>>
>>- The behavior of v1 sources needs to be documented to come up with a
>>migration plan
>>- Spark 3.0 should include DSv2, even if it would delay the release
>>(pending community discussion and vote)
>>- Design for the v2 Catalog plugin system
>>- V2 catalog approach of separate TableCatalog, FunctionCatalog, and
>>ViewCatalog interfaces
>>- Common v2 Table metadata should be schema, partitioning, and
>>string-map of properties; leaving out sorting for now. (Ready to vote on
>>metadata SPIP.)
>>
>> *Topics*:
>>
>>- Issues raised by ORC v2 commit
>>- Migration to v2 sources
>>- Roadmap and current blockers
>>- Catalog plugin system
>>- Catalog API separate interfaces approach
>>- Catalog API metadata (schema, partitioning, and properties)
>>- Public catalog API proposal
>>
>> *Notes*:
>>
>>- Issues raised by ORC v2 commit
>>   - Ryan: Disabled change to use v2 by default in PR for overwrite
>>   plans: tests rely on CTAS, which is not implemented in v2.
>>   - Wenchen: suggested using a StagedTable to work around not having
>>   a CTAS finished. TableProvider could create a staged table.
>>   - Ryan: Using StagedTable doesn’t make sense to me. It was
>>   intended to solve a different problem (atomicity). Adding an interface 
>> to
>>   create a staged table either requires the same metadata as CTAS or 
>> requires
>>   a blank staged table, which isn’t the same concept: these staged tables
>>   would behave entirely differently than the ones for atomic operations.
>>   Better to spend time getting CTAS done and work through the long-term 
>> plan
>>   than to hack around it.
>>   - Second issue raised by the ORC work: how to support tables that
>>   use different validations.
>>   - Ryan: What Gengliang’s PRs are missing is a clear definition of
>>   what tables require different validation and what that validation 
>> should
>>   be. In some cases, CTAS is validated against existing data [Ed: this is
>>   PreprocessTableCreation] and in some cases, Append has no validation
>>   because the table doesn’t exist. What isn’t clear is when these 
>> validations
>>   are applied.
>>   - Ryan: Without knowing exactly how v1 works, we can’t mirror that
>>   behavior in v2. Building a way to turn off validation is going to be
>>   needed, but is insufficient without knowing when to apply it.
>>   - Ryan: We also don’t know if it will make sense to maintain all
>>   of these rules to mimic v1 behavior. In v1, CTAS and Append can both 
>> write
>>   to existing tables, but use different rules to validate. What are the
>>   differences between them? It is unlikely that Spark will support both 
>> as
>>   options, if that is even possible. [Ed: see later discussion on 
>> migration
>>   that continues this.]
>>   - Gengliang: Using SaveMode is an option.
>>   - Ryan: Using SaveMode only appears to fix this, but doesn’t
>>   actually test v2. Using SaveMode appears to work because it disables 
>> all
>>   validation and uses code from v1 that will “create” tables by writing. 
>> But
>>   this isn’t helpful for the v2 goal of having defined and reliable 
>> behavior.
>>   - Gengliang: SaveMode is not correctly translated. Append could
>>   mean AppendData or CTAS.
>>   - Ryan: This is why we need to focus on finishing the v2 plans: so
>>   we can correctly translate the SaveMode into the right plan. That 
>> depends
>>   on having a catalog for CTAS and to check the existence of a table.
>>   - Wenchen: Catalog doesn’t support path tables, so how does this
>>   help?
>>   - Ryan: The multi-catalog identifiers proposal includes a way to
>>   pass paths as CatalogIdentifiers. [Ed: see PathIdentifier]. This 
>> allows a
>>   catalog implementation to handle path-based tables. The identifier will
>>   also have a method to test whether the identifier is a path identifier 
>> and
>>   catalogs are not required to support path identifiers.
>>- Migration to v2 sources
>>   - Hyukjin: Once the ORC upgrade is done how will we move from v1
>>   to v2?
>>   - Ryan: We will need to develop v1 and v2 in 

Re: DataSourceV2 sync notes - 20 Feb 2019

2019-03-05 Thread Stavros Kontopoulos
Thanks for the update, is this meeting open for other people to join?

Stavros

On Thu, Feb 21, 2019 at 10:56 PM Ryan Blue 
wrote:

> Here are my notes from the DSv2 sync last night. As always, if you have
> corrections, please reply with them. And if you’d like to be included on
> the invite to participate in the next sync (6 March), send me an email.
>
> Here’s a quick summary of the topics where we had consensus last night:
>
>- The behavior of v1 sources needs to be documented to come up with a
>migration plan
>- Spark 3.0 should include DSv2, even if it would delay the release
>(pending community discussion and vote)
>- Design for the v2 Catalog plugin system
>- V2 catalog approach of separate TableCatalog, FunctionCatalog, and
>ViewCatalog interfaces
>- Common v2 Table metadata should be schema, partitioning, and
>string-map of properties; leaving out sorting for now. (Ready to vote on
>metadata SPIP.)
>
> *Topics*:
>
>- Issues raised by ORC v2 commit
>- Migration to v2 sources
>- Roadmap and current blockers
>- Catalog plugin system
>- Catalog API separate interfaces approach
>- Catalog API metadata (schema, partitioning, and properties)
>- Public catalog API proposal
>
> *Notes*:
>
>- Issues raised by ORC v2 commit
>   - Ryan: Disabled change to use v2 by default in PR for overwrite
>   plans: tests rely on CTAS, which is not implemented in v2.
>   - Wenchen: suggested using a StagedTable to work around not having
>   a CTAS finished. TableProvider could create a staged table.
>   - Ryan: Using StagedTable doesn’t make sense to me. It was intended
>   to solve a different problem (atomicity). Adding an interface to create 
> a
>   staged table either requires the same metadata as CTAS or requires a 
> blank
>   staged table, which isn’t the same concept: these staged tables would
>   behave entirely differently than the ones for atomic operations. Better 
> to
>   spend time getting CTAS done and work through the long-term plan than to
>   hack around it.
>   - Second issue raised by the ORC work: how to support tables that
>   use different validations.
>   - Ryan: What Gengliang’s PRs are missing is a clear definition of
>   what tables require different validation and what that validation should
>   be. In some cases, CTAS is validated against existing data [Ed: this is
>   PreprocessTableCreation] and in some cases, Append has no validation
>   because the table doesn’t exist. What isn’t clear is when these 
> validations
>   are applied.
>   - Ryan: Without knowing exactly how v1 works, we can’t mirror that
>   behavior in v2. Building a way to turn off validation is going to be
>   needed, but is insufficient without knowing when to apply it.
>   - Ryan: We also don’t know if it will make sense to maintain all of
>   these rules to mimic v1 behavior. In v1, CTAS and Append can both write 
> to
>   existing tables, but use different rules to validate. What are the
>   differences between them? It is unlikely that Spark will support both as
>   options, if that is even possible. [Ed: see later discussion on 
> migration
>   that continues this.]
>   - Gengliang: Using SaveMode is an option.
>   - Ryan: Using SaveMode only appears to fix this, but doesn’t
>   actually test v2. Using SaveMode appears to work because it disables all
>   validation and uses code from v1 that will “create” tables by writing. 
> But
>   this isn’t helpful for the v2 goal of having defined and reliable 
> behavior.
>   - Gengliang: SaveMode is not correctly translated. Append could
>   mean AppendData or CTAS.
>   - Ryan: This is why we need to focus on finishing the v2 plans: so
>   we can correctly translate the SaveMode into the right plan. That 
> depends
>   on having a catalog for CTAS and to check the existence of a table.
>   - Wenchen: Catalog doesn’t support path tables, so how does this
>   help?
>   - Ryan: The multi-catalog identifiers proposal includes a way to
>   pass paths as CatalogIdentifiers. [Ed: see PathIdentifier]. This allows 
> a
>   catalog implementation to handle path-based tables. The identifier will
>   also have a method to test whether the identifier is a path identifier 
> and
>   catalogs are not required to support path identifiers.
>- Migration to v2 sources
>   - Hyukjin: Once the ORC upgrade is done how will we move from v1 to
>   v2?
>   - Ryan: We will need to develop v1 and v2 in parallel. There are
>   many code paths in v1 and we don’t know exactly what they do. We first 
> need
>   to know what they do and make a migration plan after that.
>   - Hyukjin: What if there are many behavior differences? Will this
>   require an API to opt in for each one?
>   - Ryan: Without knowing how v1 

DataSourceV2 sync notes - 20 Feb 2019

2019-02-21 Thread Ryan Blue
Here are my notes from the DSv2 sync last night. As always, if you have
corrections, please reply with them. And if you’d like to be included on
the invite to participate in the next sync (6 March), send me an email.

Here’s a quick summary of the topics where we had consensus last night:

   - The behavior of v1 sources needs to be documented to come up with a
   migration plan
   - Spark 3.0 should include DSv2, even if it would delay the release
   (pending community discussion and vote)
   - Design for the v2 Catalog plugin system
   - V2 catalog approach of separate TableCatalog, FunctionCatalog, and
   ViewCatalog interfaces
   - Common v2 Table metadata should be schema, partitioning, and
   string-map of properties; leaving out sorting for now. (Ready to vote on
   metadata SPIP.)

*Topics*:

   - Issues raised by ORC v2 commit
   - Migration to v2 sources
   - Roadmap and current blockers
   - Catalog plugin system
   - Catalog API separate interfaces approach
   - Catalog API metadata (schema, partitioning, and properties)
   - Public catalog API proposal

*Notes*:

   - Issues raised by ORC v2 commit
  - Ryan: Disabled change to use v2 by default in PR for overwrite
  plans: tests rely on CTAS, which is not implemented in v2.
  - Wenchen: suggested using a StagedTable to work around not having a
  CTAS finished. TableProvider could create a staged table.
  - Ryan: Using StagedTable doesn’t make sense to me. It was intended
  to solve a different problem (atomicity). Adding an interface to create a
  staged table either requires the same metadata as CTAS or
requires a blank
  staged table, which isn’t the same concept: these staged tables would
  behave entirely differently than the ones for atomic operations.
Better to
  spend time getting CTAS done and work through the long-term plan than to
  hack around it.
  - Second issue raised by the ORC work: how to support tables that use
  different validations.
  - Ryan: What Gengliang’s PRs are missing is a clear definition of
  what tables require different validation and what that validation should
  be. In some cases, CTAS is validated against existing data [Ed: this is
  PreprocessTableCreation] and in some cases, Append has no validation
  because the table doesn’t exist. What isn’t clear is when these
validations
  are applied.
  - Ryan: Without knowing exactly how v1 works, we can’t mirror that
  behavior in v2. Building a way to turn off validation is going to be
  needed, but is insufficient without knowing when to apply it.
  - Ryan: We also don’t know if it will make sense to maintain all of
  these rules to mimic v1 behavior. In v1, CTAS and Append can
both write to
  existing tables, but use different rules to validate. What are the
  differences between them? It is unlikely that Spark will support both as
  options, if that is even possible. [Ed: see later discussion on migration
  that continues this.]
  - Gengliang: Using SaveMode is an option.
  - Ryan: Using SaveMode only appears to fix this, but doesn’t actually
  test v2. Using SaveMode appears to work because it disables all
validation
  and uses code from v1 that will “create” tables by writing. But
this isn’t
  helpful for the v2 goal of having defined and reliable behavior.
  - Gengliang: SaveMode is not correctly translated. Append could mean
  AppendData or CTAS.
  - Ryan: This is why we need to focus on finishing the v2 plans: so we
  can correctly translate the SaveMode into the right plan. That depends on
  having a catalog for CTAS and to check the existence of a table.
  - Wenchen: Catalog doesn’t support path tables, so how does this help?
  - Ryan: The multi-catalog identifiers proposal includes a way to pass
  paths as CatalogIdentifiers. [Ed: see PathIdentifier]. This allows a
  catalog implementation to handle path-based tables. The identifier will
  also have a method to test whether the identifier is a path
identifier and
  catalogs are not required to support path identifiers.
   - Migration to v2 sources
  - Hyukjin: Once the ORC upgrade is done how will we move from v1 to
  v2?
  - Ryan: We will need to develop v1 and v2 in parallel. There are many
  code paths in v1 and we don’t know exactly what they do. We first need to
  know what they do and make a migration plan after that.
  - Hyukjin: What if there are many behavior differences? Will this
  require an API to opt in for each one?
  - Ryan: Without knowing how v1 behaves, we can only speculate. But I
  don’t think that we will want to support many of these special
cases. That
  is a lot of work and maintenance.
  - Gengliang: When can we change the default to v2? Until we change
  the default, v2 is not tested. The v2 work is blocked by this.
  - Ryan: v2 work should not be