Re: PySpark syntax vs Pandas syntax

Reynold Xin Tue, 26 Mar 2019 10:08:28 -0700

We just made the repo public: https://github.com/databricks/spark-pandas


On Tue, Mar 26, 2019 at 1:20 AM, Timothee Hunter < timhun...@databricks.com > 
wrote:

> 
> To add more details to what Reynold mentioned. As you said, there is going
> to be some slight differences in any case between Pandas and Spark in any
> case, simply because Spark needs to know the return types of the
> functions. In your case, you would need to slightly refactor your apply
> method to the following (in python 3) to add type hints:
> 
> 
> ```
> def f(x) -> float: return x * 3.0
> df['col3'] = df['col1'].apply(f)
> ```
> 
> 
> This has the benefit of keeping your code fully compliant with both pandas
> and pyspark. We will share more information in the future.
> 
> 
> Tim
> 
> On Tue, Mar 26, 2019 at 8:08 AM Hyukjin Kwon < gurwls223@ gmail. com (
> gurwls...@gmail.com ) > wrote:
> 
> 
>> BTW, I am working on the documentation related with this subject at https:/
>> / issues. apache. org/ jira/ browse/ SPARK-26022 (
>> https://issues.apache.org/jira/browse/SPARK-26022 ) to describe the
>> difference
>> 
>> 2019년 3월 26일 (화) 오후 3:34, Reynold Xin < rxin@ databricks. com (
>> r...@databricks.com ) >님이 작성:
>> 
>> 
>>> We have some early stuff there but not quite ready to talk about it in
>>> public yet (I hope soon though). Will shoot you a separate email on it.
>>> 
>>> On Mon, Mar 25, 2019 at 11:32 PM Abdeali Kothari < abdealikothari@ gmail. 
>>> com
>>> ( abdealikoth...@gmail.com ) > wrote:
>>> 
>>> 
>>>> Thanks for the reply Reynold - Has this shim project started ?
>>>> I'd love to contribute to it - as it looks like I have started making a
>>>> bunch of helper functions to do something similar for my current task and
>>>> would prefer not doing it in isolation.
>>>> Was considering making a git repo and pushing stuff there just today
>>>> morning. But if there's already folks working on it - I'd prefer
>>>> collaborating.
>>>> 
>>>> 
>>>> Note - I'm not recommending we make the logical plan mutable (as I am
>>>> scared of that too!). I think there are other ways of handling that - but
>>>> we can go into details later.
>>>> 
>>>> On Tue, Mar 26, 2019 at 11:58 AM Reynold Xin < rxin@ databricks. com (
>>>> r...@databricks.com ) > wrote:
>>>> 
>>>> 
>>>>> We have been thinking about some of these issues. Some of them are harder
>>>>> to do, e.g. Spark DataFrames are fundamentally immutable, and making the
>>>>> logical plan mutable is a significant deviation from the current paradigm
>>>>> that might confuse the hell out of some users. We are considering building
>>>>> a shim layer as a separate project on top of Spark (so we can make rapid
>>>>> releases based on feedback) just to test this out and see how well it
>>>>> could work in practice.
>>>>> 
>>>>> On Mon, Mar 25, 2019 at 11:04 PM Abdeali Kothari < abdealikothari@ gmail. 
>>>>> com
>>>>> ( abdealikoth...@gmail.com ) > wrote:
>>>>> 
>>>>> 
>>>>>> Hi,
>>>>>> I was doing some spark to pandas (and vice versa) conversion because some
>>>>>> of the pandas codes we have don't work on huge data. And some spark codes
>>>>>> work very slow on small data.
>>>>>> 
>>>>>> It was nice to see that pyspark had some similar syntax for the common
>>>>>> pandas operations that the python community is used to.
>>>>>> 
>>>>>> 
>>>>>> GroupBy aggs: df.groupby(['col2']).agg({'col2': 'count'}).show()
>>>>>> 
>>>>>> Column selects: df[['col1', 'col2']]
>>>>>> 
>>>>>> Row Filters: df[df['col1'] < 3.0]
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> I was wondering about a bunch of other functions in pandas which seemed
>>>>>> common. And thought there must've been a discussion about it in the
>>>>>> community - hence started this thread.
>>>>>> 
>>>>>> 
>>>>>> I was wondering whether there has been discussion on adding the following
>>>>>> functions:
>>>>>> *
>>>>>> Column setters* :
>>>>>> In Pandas:
>>>>>> df['col3'] = df['col1'] * 3.0
>>>>>> 
>>>>>> While I do the following in PySpark:
>>>>>> 
>>>>>> df = df.withColumn('col3', df['col1'] * 3.0)
>>>>>> 
>>>>>> 
>>>>>> *Column apply()* :
>>>>>> In Pandas:
>>>>>> df['col3'] = df['col1'].apply(lambda x: x * 3.0)
>>>>>> 
>>>>>> While I do the following in PySpark:
>>>>>> df = df.withColumn('col3', F.udf(lambda x: x * 3.0, 'float')( 
>>>>>> df['col1']))
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> I understand that this one cannot be as simple as in pandas due to the
>>>>>> output-type that's needed here. But could be done like:
>>>>>> df['col3'] = df['col1'].apply((lambda x: x * 3.0), 'float')
>>>>>> 
>>>>>> 
>>>>>> Multi column in pandas is:
>>>>>> df['col3'] = df[['col1', 'col2']].apply(lambda x: x.col1 * 3.0)
>>>>>> 
>>>>>> Maybe this can be done in pyspark as or if we can send a pyspark.sql.Row
>>>>>> directly it would be similar (?):
>>>>>> df['col3'] = df[['col1', 'col2']].apply((lambda col1, col2: col1 * 3.0),
>>>>>> 'float')
>>>>>> 
>>>>>> 
>>>>>> *Rename* :
>>>>>> In Pandas:
>>>>>> df.rename(columns={...})
>>>>>> While I do the following in PySpark:
>>>>>> 
>>>>>> df.toDF(*[{'col2': 'col3'}.get(i, i) for i in df.columns])
>>>>>> 
>>>>>> 
>>>>>> *To Dictionary* :
>>>>>> In Pandas:
>>>>>> df.to_dict(orient='list')
>>>>>> While I do the following in PySpark:
>>>>>> 
>>>>>> { f. name ( http://f.name ) : [row[i] for row in df.collect()] for i, f 
>>>>>> in
>>>>>> enumerate(df.schema.fields)}
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> I thought I'd start the discussion with these and come back to some of 
>>>>>> the
>>>>>> others I see that could be helpful.
>>>>>> 
>>>>>> 
>>>>>> *Note* : (with the column functions in mind) I understand the concept of
>>>>>> the DataFrame cannot be modified. And I am not suggesting we change that
>>>>>> nor any underlying principle. Just trying to add syntactic sugar here.
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: PySpark syntax vs Pandas syntax

Reply via email to