Without knowing much about it, I have had the same question. How much is
how important about this to justify the effort? One particular negative
effect has been that new postgresql tests add well over an hour to tests,
IIRC. So, tend to agree about drawing any reasonable line on compatibility
and maybe focusing elsewhere

On Tue, Nov 26, 2019, 8:26 AM Wenchen Fan <cloud0...@gmail.com> wrote:

> Hi all,
>
> Recently we start an effort to achieve feature parity between Spark and
> PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764
>
> This goes very well. We've added many missing features(parser rules,
> built-in functions, etc.) to Spark, and also corrected several
> inappropriate behaviors of Spark to follow SQL standard and PostgreSQL.
> Many thanks to all the people that contribute to it!
>
> There are several cases when adding a PostgreSQL feature:
> 1. Spark doesn't have this feature: just add it.
> 2. Spark has this feature, but the behavior is different:
>     2.1 Spark's behavior doesn't make sense: change it to follow SQL
> standard and PostgreSQL, with a legacy config to restore the behavior.
>     2.2 Spark's behavior makes sense but violates SQL standard: change the
> behavior to follow SQL standard and PostgreSQL, when the ansi mode is
> enabled (default false).
>     2.3 Spark's behavior makes sense and doesn't violate SQL standard:
> adds the PostgreSQL behavior under the PostgreSQL dialect (default is Spark
> native dialect).
>
> The PostgreSQL dialect itself is a good idea. It can help users to migrate
> PostgreSQL workloads to Spark. Other databases have this strategy too. For
> example, DB2 provides an oracle dialect
> <https://www.ibm.com/developerworks/data/library/techarticle/dm-0907oracleappsondb2/index.html>
> .
>
> However, there are so many differences between Spark and PostgreSQL,
> including SQL parsing, type coercion, function/operator behavior, data
> types, etc. I'm afraid that we may spend a lot of effort on it, and make
> the Spark codebase pretty complicated, but still not able to provide a
> usable PostgreSQL dialect.
>
> Furthermore, it's not clear to me how many users have the requirement of
> migrating PostgreSQL workloads. I think it's much more important to make
> Spark ANSI-compliant first, which doesn't need that much of work.
>
> Recently I've seen multiple PRs adding PostgreSQL cast functions, while
> our own cast function is not ANSI-compliant yet. This makes me think that,
> we should do something to properly prioritize ANSI mode over other dialects.
>
> Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it
> from the codebase before it's too late. Curently we only have 3 features
> under PostgreSQL dialect:
> 1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are also
> allowed as true string.
> 2. `date - date`  returns interval in Spark (SQL standard behavior), but
> return int in PostgreSQL
> 3. `int / int` returns double in Spark, but returns int in PostgreSQL.
> (there is no standard)
>
> We should still add PostgreSQL features that Spark doesn't have, or
> Spark's behavior violates SQL standard. But for others, let's just update
> the answer files of PostgreSQL tests.
>
> Any comments are welcome!
>
> Thanks,
> Wenchen
>

Reply via email to