Is this a candidate for the version 1.X/2.0 split? 2015-12-09 16:29 GMT-08:00 Michael Armbrust <mich...@databricks.com>:
> Yeah, I would like to address any actual gaps in functionality that are > present. > > On Wed, Dec 9, 2015 at 4:24 PM, Cristian Opris <cristian.b.op...@gmail.com > > wrote: > >> The reason I'm asking is because it's important in larger projects to be >> able to stick to a particular programming style. Some people are more >> comfortable with SQL, others might find the DF api more suitable, but it's >> important to have full expressivity in both to make it easier to adopt one >> approach rather than have to mix and match to achieve full functionality. >> >> On 9 December 2015 at 19:41, Xiao Li <gatorsm...@gmail.com> wrote: >> >>> That sounds great! When it is decided, please let us know and we can add >>> more features and make it ANSI SQL compliant. >>> >>> Thank you! >>> >>> Xiao Li >>> >>> >>> 2015-12-09 11:31 GMT-08:00 Michael Armbrust <mich...@databricks.com>: >>> >>>> I don't plan to abandon HiveQL compatibility, but I'd like to see us >>>> move towards something with more SQL compliance (perhaps just newer >>>> versions of the HiveQL parser). Exactly which parser will do that for us >>>> is under investigation. >>>> >>>> On Wed, Dec 9, 2015 at 11:02 AM, Xiao Li <gatorsm...@gmail.com> wrote: >>>> >>>>> Hi, Michael, >>>>> >>>>> Does that mean SqlContext will be built on HiveQL in the near future? >>>>> >>>>> Thanks, >>>>> >>>>> Xiao Li >>>>> >>>>> >>>>> 2015-12-09 10:36 GMT-08:00 Michael Armbrust <mich...@databricks.com>: >>>>> >>>>>> I think that it is generally good to have parity when the >>>>>> functionality is useful. However, in some cases various features are >>>>>> there >>>>>> just to maintain compatibility with other system. For example CACHE >>>>>> TABLE >>>>>> is eager because Shark's cache table was. df.cache() is lazy because >>>>>> Spark's cache is. Does that mean that we need to add some eager caching >>>>>> mechanism to dataframes to have parity? Probably not, users can just >>>>>> call >>>>>> .count() if they want to force materialization. >>>>>> >>>>>> Regarding the differences between HiveQL and the SQLParser, I think >>>>>> we should get rid of the SQL parser. Its kind of a hack that I built >>>>>> just >>>>>> so that there was some SQL story for people who didn't compile with Hive. >>>>>> Moving forward, I'd like to see the distinction between the HiveContext >>>>>> and >>>>>> SQLContext removed and we can standardize on a single parser. For this >>>>>> reason I'd be opposed to spending a lot of dev/reviewer time on adding >>>>>> features there. >>>>>> >>>>>> On Wed, Dec 9, 2015 at 8:34 AM, Cristian O < >>>>>> cristian.b.op...@googlemail.com> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I was wondering what the "official" view is on feature parity >>>>>>> between SQL and DF apis. Docs are pretty sparse on the SQL front, and it >>>>>>> seems that some features are only supported at various times in only >>>>>>> one of >>>>>>> Spark SQL dialect, HiveQL dialect and DF API. DF.cube(), DISTRIBUTE BY, >>>>>>> CACHE LAZY are some examples >>>>>>> >>>>>>> Is there an explicit goal of having consistent support for all >>>>>>> features in both DF and SQL ? >>>>>>> >>>>>>> Thanks, >>>>>>> Cristian >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >