Re: [DISCUSS] PostgreSQL dialect

Driesprong, Fokko Sun, 01 Dec 2019 03:25:43 -0800

+1 (non-binding)

Cheers, Fokko


Op do 28 nov. 2019 om 03:47 schreef Dongjoon Hyun <[email protected]>:

> +1
>
> Bests,
> Dongjoon.
>
> On Tue, Nov 26, 2019 at 3:52 PM Takeshi Yamamuro <[email protected]>
> wrote:
>
>> Yea, +1, that looks pretty reasonable to me.
>> > Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it
>> from the codebase before it's too late. Curently we only have 3 features
>> under PostgreSQL dialect:
>> I personally think we could at least stop work about the Dialect until
>> 3.0 released.
>>
>>
>> On Wed, Nov 27, 2019 at 2:41 AM Gengliang Wang <
>> [email protected]> wrote:
>>
>>> +1 with the practical proposal.
>>> To me, the major concern is that the code base becomes complicated,
>>> while the PostgreSQL dialect has very limited features. I tried introducing
>>> one big flag `spark.sql.dialect` and isolating related code in #25697
>>> <https://github.com/apache/spark/pull/25697>, but it seems hard to be
>>> clean.
>>> Furthermore, the PostgreSQL dialect configuration overlaps with the ANSI
>>> mode, which can be confusing sometimes.
>>>
>>> Gengliang
>>>
>>> On Tue, Nov 26, 2019 at 8:57 AM Xiao Li <[email protected]> wrote:
>>>
>>>> +1
>>>>
>>>>
>>>>> One particular negative effect has been that new postgresql tests add
>>>>> well over an hour to tests,
>>>>
>>>>
>>>> Adding postgresql tests is for improving the test coverage of Spark
>>>> SQL. We should continue to do this by importing more test cases. The
>>>> quality of Spark highly depends on the test coverage. We can further
>>>> paralyze the test execution to reduce the test time.
>>>>
>>>> Migrating PostgreSQL workloads to Spark SQL
>>>>
>>>>
>>>> This should not be our current focus. In the near future, it is
>>>> impossible to be fully compatible with PostgreSQL. We should focus on
>>>> adding features that are useful to Spark community. PostgreSQL is a good
>>>> reference, but we do not need to blindly follow it. We already closed
>>>> multiple related JIRAs that try to add some PostgreSQL features that are
>>>> not commonly used.
>>>>
>>>> Cheers,
>>>>
>>>> Xiao
>>>>
>>>>
>>>> On Tue, Nov 26, 2019 at 8:30 AM Maciej Szymkiewicz <
>>>> [email protected]> wrote:
>>>>
>>>>> I think it is important to distinguish between two different concepts:
>>>>>
>>>>>    - Adherence to standards and their well established
>>>>>    implementations.
>>>>>    - Enabling migrations from some product X to Spark.
>>>>>
>>>>> While these two problems are related, there are independent and one
>>>>> can be achieved without the other.
>>>>>
>>>>>    - The former approach doesn't imply that all features of SQL
>>>>>    standard (or its specific implementation) are provided. It is 
>>>>> sufficient
>>>>>    that commonly used features that are implemented, are standard 
>>>>> compliant.
>>>>>    Therefore if end user applies some well known pattern, thing will work 
>>>>> as
>>>>>    expected. I
>>>>>
>>>>>    In my personal opinion that's something that is worth the required
>>>>>    development resources, and in general should happen within the project.
>>>>>
>>>>>
>>>>>    - The latter one is more complicated. First of all the premise
>>>>>    that one can "migrate PostgreSQL workloads to Spark" seems to be 
>>>>> flawed.
>>>>>    While both Spark and PostgreSQL evolve, and probably have more in 
>>>>> common
>>>>>    today, than a few years ago, they're not even close enough to pretend 
>>>>> that
>>>>>    one can be replacement for the other. In contrast, existing 
>>>>> compatibility
>>>>>    layers between major vendors make sense, because feature disparity (at
>>>>>    least when it comes to core functionality) is usually minimal. And that
>>>>>    doesn't even touch the problem that PostgreSQL provides extensively 
>>>>> used
>>>>>    extension points that enable broad and evolving ecosystem (what should 
>>>>> we
>>>>>    do about continuous queries? Should Structured Streaming provide some
>>>>>    compatibility layer as well?).
>>>>>
>>>>>    More realistically Spark could provide a compatibility layer with
>>>>>    some analytical tools that itself provide some PostgreSQL 
>>>>> compatibility,
>>>>>    but these are not always fully compatible with upstream PostgreSQL, nor
>>>>>    necessarily follow the latest PostgreSQL development.
>>>>>
>>>>>    Furthermore compatibility layer can be, within certain limits
>>>>>    (i.e. availability of required primitives), maintained as a separate
>>>>>    project, without putting more strain on existing resources. Effectively
>>>>>    what we care about here is if we can translate certain SQL string into
>>>>>    logical or physical plan.
>>>>>
>>>>>
>>>>> On 11/26/19 3:26 PM, Wenchen Fan wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> Recently we start an effort to achieve feature parity between Spark
>>>>> and PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764
>>>>>
>>>>> This goes very well. We've added many missing features(parser rules,
>>>>> built-in functions, etc.) to Spark, and also corrected several
>>>>> inappropriate behaviors of Spark to follow SQL standard and PostgreSQL.
>>>>> Many thanks to all the people that contribute to it!
>>>>>
>>>>> There are several cases when adding a PostgreSQL feature:
>>>>> 1. Spark doesn't have this feature: just add it.
>>>>> 2. Spark has this feature, but the behavior is different:
>>>>>     2.1 Spark's behavior doesn't make sense: change it to follow SQL
>>>>> standard and PostgreSQL, with a legacy config to restore the behavior.
>>>>>     2.2 Spark's behavior makes sense but violates SQL standard: change
>>>>> the behavior to follow SQL standard and PostgreSQL, when the ansi mode is
>>>>> enabled (default false).
>>>>>     2.3 Spark's behavior makes sense and doesn't violate SQL standard:
>>>>> adds the PostgreSQL behavior under the PostgreSQL dialect (default is 
>>>>> Spark
>>>>> native dialect).
>>>>>
>>>>> The PostgreSQL dialect itself is a good idea. It can help users to
>>>>> migrate PostgreSQL workloads to Spark. Other databases have this strategy
>>>>> too. For example, DB2 provides an oracle dialect
>>>>> <https://www.ibm.com/developerworks/data/library/techarticle/dm-0907oracleappsondb2/index.html>
>>>>> .
>>>>>
>>>>> However, there are so many differences between Spark and PostgreSQL,
>>>>> including SQL parsing, type coercion, function/operator behavior, data
>>>>> types, etc. I'm afraid that we may spend a lot of effort on it, and make
>>>>> the Spark codebase pretty complicated, but still not able to provide a
>>>>> usable PostgreSQL dialect.
>>>>>
>>>>> Furthermore, it's not clear to me how many users have the requirement
>>>>> of migrating PostgreSQL workloads. I think it's much more important to 
>>>>> make
>>>>> Spark ANSI-compliant first, which doesn't need that much of work.
>>>>>
>>>>> Recently I've seen multiple PRs adding PostgreSQL cast functions,
>>>>> while our own cast function is not ANSI-compliant yet. This makes me think
>>>>> that, we should do something to properly prioritize ANSI mode over other
>>>>> dialects.
>>>>>
>>>>> Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it
>>>>> from the codebase before it's too late. Curently we only have 3 features
>>>>> under PostgreSQL dialect:
>>>>> 1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are
>>>>> also allowed as true string.
>>>>> 2. `date - date`  returns interval in Spark (SQL standard behavior),
>>>>> but return int in PostgreSQL
>>>>> 3. `int / int` returns double in Spark, but returns int in PostgreSQL.
>>>>> (there is no standard)
>>>>>
>>>>> We should still add PostgreSQL features that Spark doesn't have, or
>>>>> Spark's behavior violates SQL standard. But for others, let's just update
>>>>> the answer files of PostgreSQL tests.
>>>>>
>>>>> Any comments are welcome!
>>>>>
>>>>> Thanks,
>>>>> Wenchen
>>>>>
>>>>> --
>>>>> Best regards,
>>>>> Maciej
>>>>>
>>>>>
>>>>
>>>> --
>>>> [image: Databricks Summit - Watch the talks]
>>>> <https://databricks.com/sparkaisummit/north-america>
>>>>
>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>

Re: [DISCUSS] PostgreSQL dialect

Reply via email to