Re: SQL language vs DataFrame API

Stephen Boesch Wed, 09 Dec 2015 17:02:38 -0800

Is this a candidate for the version 1.X/2.0 split?

2015-12-09 16:29 GMT-08:00 Michael Armbrust <mich...@databricks.com>:


> Yeah, I would like to address any actual gaps in functionality that are
> present.
>
> On Wed, Dec 9, 2015 at 4:24 PM, Cristian Opris <cristian.b.op...@gmail.com
> > wrote:
>
>> The reason I'm asking is because it's important in larger projects to be
>> able to stick to a particular programming style. Some people are more
>> comfortable with SQL, others might find the DF api more suitable, but it's
>> important to have full expressivity in both to make it easier to adopt one
>> approach rather than have to mix and match to achieve full functionality.
>>
>> On 9 December 2015 at 19:41, Xiao Li <gatorsm...@gmail.com> wrote:
>>
>>> That sounds great! When it is decided, please let us know and we can add
>>> more features and make it ANSI SQL compliant.
>>>
>>> Thank you!
>>>
>>> Xiao Li
>>>
>>>
>>> 2015-12-09 11:31 GMT-08:00 Michael Armbrust <mich...@databricks.com>:
>>>
>>>> I don't plan to abandon HiveQL compatibility, but I'd like to see us
>>>> move towards something with more SQL compliance (perhaps just newer
>>>> versions of the HiveQL parser).  Exactly which parser will do that for us
>>>> is under investigation.
>>>>
>>>> On Wed, Dec 9, 2015 at 11:02 AM, Xiao Li <gatorsm...@gmail.com> wrote:
>>>>
>>>>> Hi, Michael,
>>>>>
>>>>> Does that mean SqlContext will be built on HiveQL in the near future?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Xiao Li
>>>>>
>>>>>
>>>>> 2015-12-09 10:36 GMT-08:00 Michael Armbrust <mich...@databricks.com>:
>>>>>
>>>>>> I think that it is generally good to have parity when the
>>>>>> functionality is useful.  However, in some cases various features are 
>>>>>> there
>>>>>> just to maintain compatibility with other system.  For example CACHE 
>>>>>> TABLE
>>>>>> is eager because Shark's cache table was.  df.cache() is lazy because
>>>>>> Spark's cache is.  Does that mean that we need to add some eager caching
>>>>>> mechanism to dataframes to have parity?  Probably not, users can just 
>>>>>> call
>>>>>> .count() if they want to force materialization.
>>>>>>
>>>>>> Regarding the differences between HiveQL and the SQLParser, I think
>>>>>> we should get rid of the SQL parser.  Its kind of a hack that I built 
>>>>>> just
>>>>>> so that there was some SQL story for people who didn't compile with Hive.
>>>>>> Moving forward, I'd like to see the distinction between the HiveContext 
>>>>>> and
>>>>>> SQLContext removed and we can standardize on a single parser.  For this
>>>>>> reason I'd be opposed to spending a lot of dev/reviewer time on adding
>>>>>> features there.
>>>>>>
>>>>>> On Wed, Dec 9, 2015 at 8:34 AM, Cristian O <
>>>>>> cristian.b.op...@googlemail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I was wondering what the "official" view is on feature parity
>>>>>>> between SQL and DF apis. Docs are pretty sparse on the SQL front, and it
>>>>>>> seems that some features are only supported at various times in only 
>>>>>>> one of
>>>>>>> Spark SQL dialect, HiveQL dialect and DF API. DF.cube(), DISTRIBUTE BY,
>>>>>>> CACHE LAZY are some examples
>>>>>>>
>>>>>>> Is there an explicit goal of having consistent support for all
>>>>>>> features in both DF and SQL ?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Cristian
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: SQL language vs DataFrame API

Reply via email to