+1 with the practical proposal.
To me, the major concern is that the code base becomes complicated, while
the PostgreSQL dialect has very limited features. I tried introducing one
big flag `spark.sql.dialect` and isolating related code in #25697
<https://github.com/apache/spark/pull/25697>, but it seems hard to be clean.
Furthermore, the PostgreSQL dialect configuration overlaps with the ANSI
mode, which can be confusing sometimes.

Gengliang

On Tue, Nov 26, 2019 at 8:57 AM Xiao Li <lix...@databricks.com> wrote:

> +1
>
>
>> One particular negative effect has been that new postgresql tests add
>> well over an hour to tests,
>
>
> Adding postgresql tests is for improving the test coverage of Spark SQL.
> We should continue to do this by importing more test cases. The quality of
> Spark highly depends on the test coverage. We can further paralyze the test
> execution to reduce the test time.
>
> Migrating PostgreSQL workloads to Spark SQL
>
>
> This should not be our current focus. In the near future, it is impossible
> to be fully compatible with PostgreSQL. We should focus on adding features
> that are useful to Spark community. PostgreSQL is a good reference, but we
> do not need to blindly follow it. We already closed multiple related JIRAs
> that try to add some PostgreSQL features that are not commonly used.
>
> Cheers,
>
> Xiao
>
>
> On Tue, Nov 26, 2019 at 8:30 AM Maciej Szymkiewicz <mszymkiew...@gmail.com>
> wrote:
>
>> I think it is important to distinguish between two different concepts:
>>
>>    - Adherence to standards and their well established implementations.
>>    - Enabling migrations from some product X to Spark.
>>
>> While these two problems are related, there are independent and one can
>> be achieved without the other.
>>
>>    - The former approach doesn't imply that all features of SQL standard
>>    (or its specific implementation) are provided. It is sufficient that
>>    commonly used features that are implemented, are standard compliant.
>>    Therefore if end user applies some well known pattern, thing will work as
>>    expected. I
>>
>>    In my personal opinion that's something that is worth the required
>>    development resources, and in general should happen within the project.
>>
>>
>>    - The latter one is more complicated. First of all the premise that
>>    one can "migrate PostgreSQL workloads to Spark" seems to be flawed. While
>>    both Spark and PostgreSQL evolve, and probably have more in common today,
>>    than a few years ago, they're not even close enough to pretend that one 
>> can
>>    be replacement for the other. In contrast, existing compatibility layers
>>    between major vendors make sense, because feature disparity (at least
>>    when it comes to core functionality) is usually minimal. And that doesn't
>>    even touch the problem that PostgreSQL provides extensively used extension
>>    points that enable broad and evolving ecosystem (what should we do about
>>    continuous queries? Should Structured Streaming provide some compatibility
>>    layer as well?).
>>
>>    More realistically Spark could provide a compatibility layer with
>>    some analytical tools that itself provide some PostgreSQL compatibility,
>>    but these are not always fully compatible with upstream PostgreSQL, nor
>>    necessarily follow the latest PostgreSQL development.
>>
>>    Furthermore compatibility layer can be, within certain limits (i.e.
>>    availability of required primitives), maintained as a separate project,
>>    without putting more strain on existing resources. Effectively what we 
>> care
>>    about here is if we can translate certain SQL string into logical or
>>    physical plan.
>>
>>
>> On 11/26/19 3:26 PM, Wenchen Fan wrote:
>>
>> Hi all,
>>
>> Recently we start an effort to achieve feature parity between Spark and
>> PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764
>>
>> This goes very well. We've added many missing features(parser rules,
>> built-in functions, etc.) to Spark, and also corrected several
>> inappropriate behaviors of Spark to follow SQL standard and PostgreSQL.
>> Many thanks to all the people that contribute to it!
>>
>> There are several cases when adding a PostgreSQL feature:
>> 1. Spark doesn't have this feature: just add it.
>> 2. Spark has this feature, but the behavior is different:
>>     2.1 Spark's behavior doesn't make sense: change it to follow SQL
>> standard and PostgreSQL, with a legacy config to restore the behavior.
>>     2.2 Spark's behavior makes sense but violates SQL standard: change
>> the behavior to follow SQL standard and PostgreSQL, when the ansi mode is
>> enabled (default false).
>>     2.3 Spark's behavior makes sense and doesn't violate SQL standard:
>> adds the PostgreSQL behavior under the PostgreSQL dialect (default is Spark
>> native dialect).
>>
>> The PostgreSQL dialect itself is a good idea. It can help users to
>> migrate PostgreSQL workloads to Spark. Other databases have this strategy
>> too. For example, DB2 provides an oracle dialect
>> <https://www.ibm.com/developerworks/data/library/techarticle/dm-0907oracleappsondb2/index.html>
>> .
>>
>> However, there are so many differences between Spark and PostgreSQL,
>> including SQL parsing, type coercion, function/operator behavior, data
>> types, etc. I'm afraid that we may spend a lot of effort on it, and make
>> the Spark codebase pretty complicated, but still not able to provide a
>> usable PostgreSQL dialect.
>>
>> Furthermore, it's not clear to me how many users have the requirement of
>> migrating PostgreSQL workloads. I think it's much more important to make
>> Spark ANSI-compliant first, which doesn't need that much of work.
>>
>> Recently I've seen multiple PRs adding PostgreSQL cast functions, while
>> our own cast function is not ANSI-compliant yet. This makes me think that,
>> we should do something to properly prioritize ANSI mode over other dialects.
>>
>> Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it
>> from the codebase before it's too late. Curently we only have 3 features
>> under PostgreSQL dialect:
>> 1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are also
>> allowed as true string.
>> 2. `date - date`  returns interval in Spark (SQL standard behavior), but
>> return int in PostgreSQL
>> 3. `int / int` returns double in Spark, but returns int in PostgreSQL.
>> (there is no standard)
>>
>> We should still add PostgreSQL features that Spark doesn't have, or
>> Spark's behavior violates SQL standard. But for others, let's just update
>> the answer files of PostgreSQL tests.
>>
>> Any comments are welcome!
>>
>> Thanks,
>> Wenchen
>>
>> --
>> Best regards,
>> Maciej
>>
>>
>
> --
> [image: Databricks Summit - Watch the talks]
> <https://databricks.com/sparkaisummit/north-america>
>

Reply via email to