Re: DSv2 sync - 4 September 2019

Nicholas Chammas Mon, 09 Sep 2019 05:21:32 -0700

Ah yes, on rereading the original email I see that the sync discussion was
different. Thanks for the clarification! I’ll file a JIRA about PERMISSIVE.


2019년 9월 9일 (월) 오전 6:05, Wenchen Fan <cloud0...@gmail.com>님이 작성:

> Hi Nicholas,
>
> You are talking about a different thing. The PERMISSIVE mode is the
> failure mode for reading text-based data source (json, csv, etc.). It's not
> the general failure mode for Spark table insertion.
>
> I agree with you that the PERMISSIVE mode is hard to use. Feel free to
> open a JIRA ticket if you have some better ideas.
>
> Thanks,
> Wenchen
>
> On Mon, Sep 9, 2019 at 12:46 AM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> A quick question about failure modes, as a casual observer of the DSv2
>> effort:
>>
>> I was considering filing a JIRA ticket about enhancing the
>> DataFrameReader to include the failure *reason* in addition to the
>> corrupt record when the mode is PERMISSIVE. So if you are loading a CSV,
>> for example, and a value cannot be automatically cast to the type you
>> specify in the schema, you'll get the corrupt record in the column
>> configured by columnNameOfCorruptRecord, but you'll also get some detail
>> about what exactly made the record corrupt, perhaps in a new column
>> specified by something like columnNameOfCorruptReason.
>>
>> Is this an enhancement that would be possible in DSv2?
>>
>> On Fri, Sep 6, 2019 at 6:28 PM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> Here are my notes from the latest sync. Feel free to reply with
>>> clarifications if I’ve missed anything.
>>>
>>> *Attendees*:
>>>
>>> Ryan Blue
>>> John Zhuge
>>> Russell Spitzer
>>> Matt Cheah
>>> Gengliang Wang
>>> Priyanka Gomatam
>>> Holden Karau
>>>
>>> *Topics*:
>>>
>>>    - DataFrameWriterV2 insert vs append (recap)
>>>    - ANSI and strict modes for inserting casts
>>>    - Separating identifier resolution from table lookup
>>>    - Open PRs
>>>       - SHOW NAMESPACES - https://github.com/apache/spark/pull/25601
>>>       - DataFrameWriterV2 - https://github.com/apache/spark/pull/25681
>>>       - TableProvider API update -
>>>       https://github.com/apache/spark/pull/25651
>>>       - UPDATE - https://github.com/apache/spark/pull/25626
>>>
>>> *Discussion*:
>>>
>>>    - DataFrameWriterV2 insert vs append discussion recapped the
>>>    agreement from last sync
>>>    - ANSI and strict modes for inserting casts:
>>>       - Russell: Failure modes are important. ANSI behavior is to fail
>>>       at runtime, not analysis time. If a cast is allowed, but doesn’t 
>>> throw an
>>>       exception at runtime then this can’t be considered ANSI behavior.
>>>       - Gengliang: ANSI adds the cast
>>>       - Matt: Sounds like there are two conflicting views of the world.
>>>       Is the default ANSI behavior to insert a cast that may produce NULL 
>>> or to
>>>       fail at runtime?
>>>       - Ryan: So analysis and runtime behaviors can’t be separate?
>>>       - Matt: Analysis behavior is influenced by behavior at runtime.
>>>       Maybe the vote should cover both?
>>>       - Russell: (linked to the standard) There are 3 steps: if numeric
>>>       and same type, use the data value. If the value can be rounded or
>>>       truncated, round or truncate. Otherwise, throw an exception that a 
>>> value
>>>       can’t be cast. These are runtime requirements.
>>>       - Ryan: Another consideration is that we can make Spark more
>>>       permissive, but can’t make Spark more strict in future releases.
>>>       - Matt: v1 silently corrupts data
>>>       - Russell: ANSI is fine, as long as the runtime matches (is
>>>       ANSI). Don’t tell people it’s ANSI and not do ANSI completely.
>>>       - Gengliang: people are concerned about long-running jobs failing
>>>       at the end
>>>       - Ryan: That’s okay because they can change the defaults: use
>>>       strict analysis-time validation, or allow casts to produce NULL 
>>> values.
>>>       - Matt: As long as this is well documented, it should be fine
>>>       - Ryan: Can we run tests to find out what exactly the behavior is?
>>>       - Gengliang: sqlfiddle.com
>>>       - Russell ran tests in MySQL and Postgres. Both threw runtime
>>>       failures.
>>>       - Matt: Let’s move on, but add the runtime behavior to the VOTE
>>>    - Identifier resolution and table lookup
>>>       - Ryan: recent changes merged identifier resolution and table
>>>       lookup together because identifiers owned by the session catalog need 
>>> to be
>>>       loaded to find out whether to use v1 or v2 plans. I think this should 
>>> be
>>>       separated so that identifier resolution happens independently to 
>>> ensure
>>>       that the two separate tasks don’t end up getting done at the same 
>>> time and
>>>       over-complicating the analyzer.
>>>    - SHOW NAMESPACES - Ready for final review
>>>    - DataFrameWriterV2:
>>>       - Ryan: Tests failed after passing on the PR. Anyone know why
>>>       that would happen?
>>>       - Gengliang: tests failed in maven
>>>       - Holden: PR validation runs SBT tests
>>>    - TableProvider API update: skipped because Wenchen didn’t make it
>>>    - UPDATE support PR
>>>       - Ryan: There is a PR to add a SQL UPDATE command, but it
>>>       delegates entirely to the data source, which seems strange.
>>>       - Matt: What is Spark’s purpose here? Why would Spark parse a SQL
>>>       statement only to pass it entirely to another engine?
>>>       - Ryan: It does make sense to do this. If Spark eventually
>>>       supports MERGE INTO and other row-level operations, then it makes 
>>> sense to
>>>       push down the operation to some sources, like JDBC. I just find it 
>>> backward
>>>       to add the pushdown API before adding an implementation that handles 
>>> this
>>>       inside Spark — pushdown is usually an optimization.
>>>       - Russell: Would this be safe? Spark retries lots of operations.
>>>       - Ryan: I think it would be safe because Spark won’t retry
>>>       top-level operations and this is a single method call. Nothing would 
>>> get
>>>       retried.
>>>       - Ryan: I’ll ask what the PR author’s use case is. Maybe that
>>>       would help clarify why this is a good idea.
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>

Re: DSv2 sync - 4 September 2019

Reply via email to