Re: DSv2 sync - 4 September 2019

Nicholas Chammas Sun, 08 Sep 2019 09:47:12 -0700

A quick question about failure modes, as a casual observer of the DSv2
effort:


I was considering filing a JIRA ticket about enhancing the DataFrameReader
to include the failure *reason* in addition to the corrupt record when the
mode is PERMISSIVE. So if you are loading a CSV, for example, and a value
cannot be automatically cast to the type you specify in the schema, you'll
get the corrupt record in the column configured by
columnNameOfCorruptRecord, but you'll also get some detail about what
exactly made the record corrupt, perhaps in a new column specified by
something like columnNameOfCorruptReason.

Is this an enhancement that would be possible in DSv2?

On Fri, Sep 6, 2019 at 6:28 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Here are my notes from the latest sync. Feel free to reply with
> clarifications if I’ve missed anything.
>
> *Attendees*:
>
> Ryan Blue
> John Zhuge
> Russell Spitzer
> Matt Cheah
> Gengliang Wang
> Priyanka Gomatam
> Holden Karau
>
> *Topics*:
>
>    - DataFrameWriterV2 insert vs append (recap)
>    - ANSI and strict modes for inserting casts
>    - Separating identifier resolution from table lookup
>    - Open PRs
>       - SHOW NAMESPACES - https://github.com/apache/spark/pull/25601
>       - DataFrameWriterV2 - https://github.com/apache/spark/pull/25681
>       - TableProvider API update -
>       https://github.com/apache/spark/pull/25651
>       - UPDATE - https://github.com/apache/spark/pull/25626
>
> *Discussion*:
>
>    - DataFrameWriterV2 insert vs append discussion recapped the agreement
>    from last sync
>    - ANSI and strict modes for inserting casts:
>       - Russell: Failure modes are important. ANSI behavior is to fail at
>       runtime, not analysis time. If a cast is allowed, but doesn’t throw an
>       exception at runtime then this can’t be considered ANSI behavior.
>       - Gengliang: ANSI adds the cast
>       - Matt: Sounds like there are two conflicting views of the world.
>       Is the default ANSI behavior to insert a cast that may produce NULL or 
> to
>       fail at runtime?
>       - Ryan: So analysis and runtime behaviors can’t be separate?
>       - Matt: Analysis behavior is influenced by behavior at runtime.
>       Maybe the vote should cover both?
>       - Russell: (linked to the standard) There are 3 steps: if numeric
>       and same type, use the data value. If the value can be rounded or
>       truncated, round or truncate. Otherwise, throw an exception that a value
>       can’t be cast. These are runtime requirements.
>       - Ryan: Another consideration is that we can make Spark more
>       permissive, but can’t make Spark more strict in future releases.
>       - Matt: v1 silently corrupts data
>       - Russell: ANSI is fine, as long as the runtime matches (is ANSI).
>       Don’t tell people it’s ANSI and not do ANSI completely.
>       - Gengliang: people are concerned about long-running jobs failing
>       at the end
>       - Ryan: That’s okay because they can change the defaults: use
>       strict analysis-time validation, or allow casts to produce NULL values.
>       - Matt: As long as this is well documented, it should be fine
>       - Ryan: Can we run tests to find out what exactly the behavior is?
>       - Gengliang: sqlfiddle.com
>       - Russell ran tests in MySQL and Postgres. Both threw runtime
>       failures.
>       - Matt: Let’s move on, but add the runtime behavior to the VOTE
>    - Identifier resolution and table lookup
>       - Ryan: recent changes merged identifier resolution and table
>       lookup together because identifiers owned by the session catalog need 
> to be
>       loaded to find out whether to use v1 or v2 plans. I think this should be
>       separated so that identifier resolution happens independently to ensure
>       that the two separate tasks don’t end up getting done at the same time 
> and
>       over-complicating the analyzer.
>    - SHOW NAMESPACES - Ready for final review
>    - DataFrameWriterV2:
>       - Ryan: Tests failed after passing on the PR. Anyone know why that
>       would happen?
>       - Gengliang: tests failed in maven
>       - Holden: PR validation runs SBT tests
>    - TableProvider API update: skipped because Wenchen didn’t make it
>    - UPDATE support PR
>       - Ryan: There is a PR to add a SQL UPDATE command, but it delegates
>       entirely to the data source, which seems strange.
>       - Matt: What is Spark’s purpose here? Why would Spark parse a SQL
>       statement only to pass it entirely to another engine?
>       - Ryan: It does make sense to do this. If Spark eventually supports
>       MERGE INTO and other row-level operations, then it makes sense to push 
> down
>       the operation to some sources, like JDBC. I just find it backward to add
>       the pushdown API before adding an implementation that handles this 
> inside
>       Spark — pushdown is usually an optimization.
>       - Russell: Would this be safe? Spark retries lots of operations.
>       - Ryan: I think it would be safe because Spark won’t retry
>       top-level operations and this is a single method call. Nothing would get
>       retried.
>       - Ryan: I’ll ask what the PR author’s use case is. Maybe that would
>       help clarify why this is a good idea.
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: DSv2 sync - 4 September 2019

Reply via email to