+1 with the practical proposal. To me, the major concern is that the code base becomes complicated, while the PostgreSQL dialect has very limited features. I tried introducing one big flag `spark.sql.dialect` and isolating related code in #25697 <https://github.com/apache/spark/pull/25697>, but it seems hard to be clean. Furthermore, the PostgreSQL dialect configuration overlaps with the ANSI mode, which can be confusing sometimes.
Gengliang On Tue, Nov 26, 2019 at 8:57 AM Xiao Li <lix...@databricks.com> wrote: > +1 > > >> One particular negative effect has been that new postgresql tests add >> well over an hour to tests, > > > Adding postgresql tests is for improving the test coverage of Spark SQL. > We should continue to do this by importing more test cases. The quality of > Spark highly depends on the test coverage. We can further paralyze the test > execution to reduce the test time. > > Migrating PostgreSQL workloads to Spark SQL > > > This should not be our current focus. In the near future, it is impossible > to be fully compatible with PostgreSQL. We should focus on adding features > that are useful to Spark community. PostgreSQL is a good reference, but we > do not need to blindly follow it. We already closed multiple related JIRAs > that try to add some PostgreSQL features that are not commonly used. > > Cheers, > > Xiao > > > On Tue, Nov 26, 2019 at 8:30 AM Maciej Szymkiewicz <mszymkiew...@gmail.com> > wrote: > >> I think it is important to distinguish between two different concepts: >> >> - Adherence to standards and their well established implementations. >> - Enabling migrations from some product X to Spark. >> >> While these two problems are related, there are independent and one can >> be achieved without the other. >> >> - The former approach doesn't imply that all features of SQL standard >> (or its specific implementation) are provided. It is sufficient that >> commonly used features that are implemented, are standard compliant. >> Therefore if end user applies some well known pattern, thing will work as >> expected. I >> >> In my personal opinion that's something that is worth the required >> development resources, and in general should happen within the project. >> >> >> - The latter one is more complicated. First of all the premise that >> one can "migrate PostgreSQL workloads to Spark" seems to be flawed. While >> both Spark and PostgreSQL evolve, and probably have more in common today, >> than a few years ago, they're not even close enough to pretend that one >> can >> be replacement for the other. In contrast, existing compatibility layers >> between major vendors make sense, because feature disparity (at least >> when it comes to core functionality) is usually minimal. And that doesn't >> even touch the problem that PostgreSQL provides extensively used extension >> points that enable broad and evolving ecosystem (what should we do about >> continuous queries? Should Structured Streaming provide some compatibility >> layer as well?). >> >> More realistically Spark could provide a compatibility layer with >> some analytical tools that itself provide some PostgreSQL compatibility, >> but these are not always fully compatible with upstream PostgreSQL, nor >> necessarily follow the latest PostgreSQL development. >> >> Furthermore compatibility layer can be, within certain limits (i.e. >> availability of required primitives), maintained as a separate project, >> without putting more strain on existing resources. Effectively what we >> care >> about here is if we can translate certain SQL string into logical or >> physical plan. >> >> >> On 11/26/19 3:26 PM, Wenchen Fan wrote: >> >> Hi all, >> >> Recently we start an effort to achieve feature parity between Spark and >> PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764 >> >> This goes very well. We've added many missing features(parser rules, >> built-in functions, etc.) to Spark, and also corrected several >> inappropriate behaviors of Spark to follow SQL standard and PostgreSQL. >> Many thanks to all the people that contribute to it! >> >> There are several cases when adding a PostgreSQL feature: >> 1. Spark doesn't have this feature: just add it. >> 2. Spark has this feature, but the behavior is different: >> 2.1 Spark's behavior doesn't make sense: change it to follow SQL >> standard and PostgreSQL, with a legacy config to restore the behavior. >> 2.2 Spark's behavior makes sense but violates SQL standard: change >> the behavior to follow SQL standard and PostgreSQL, when the ansi mode is >> enabled (default false). >> 2.3 Spark's behavior makes sense and doesn't violate SQL standard: >> adds the PostgreSQL behavior under the PostgreSQL dialect (default is Spark >> native dialect). >> >> The PostgreSQL dialect itself is a good idea. It can help users to >> migrate PostgreSQL workloads to Spark. Other databases have this strategy >> too. For example, DB2 provides an oracle dialect >> <https://www.ibm.com/developerworks/data/library/techarticle/dm-0907oracleappsondb2/index.html> >> . >> >> However, there are so many differences between Spark and PostgreSQL, >> including SQL parsing, type coercion, function/operator behavior, data >> types, etc. I'm afraid that we may spend a lot of effort on it, and make >> the Spark codebase pretty complicated, but still not able to provide a >> usable PostgreSQL dialect. >> >> Furthermore, it's not clear to me how many users have the requirement of >> migrating PostgreSQL workloads. I think it's much more important to make >> Spark ANSI-compliant first, which doesn't need that much of work. >> >> Recently I've seen multiple PRs adding PostgreSQL cast functions, while >> our own cast function is not ANSI-compliant yet. This makes me think that, >> we should do something to properly prioritize ANSI mode over other dialects. >> >> Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it >> from the codebase before it's too late. Curently we only have 3 features >> under PostgreSQL dialect: >> 1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are also >> allowed as true string. >> 2. `date - date` returns interval in Spark (SQL standard behavior), but >> return int in PostgreSQL >> 3. `int / int` returns double in Spark, but returns int in PostgreSQL. >> (there is no standard) >> >> We should still add PostgreSQL features that Spark doesn't have, or >> Spark's behavior violates SQL standard. But for others, let's just update >> the answer files of PostgreSQL tests. >> >> Any comments are welcome! >> >> Thanks, >> Wenchen >> >> -- >> Best regards, >> Maciej >> >> > > -- > [image: Databricks Summit - Watch the talks] > <https://databricks.com/sparkaisummit/north-america> >