Re: PySpark syntax vs Pandas syntax

Abdeali Kothari Tue, 26 Mar 2019 12:24:46 -0700

Nice, will test it out +1

On Tue, Mar 26, 2019, 22:38 Reynold Xin <r...@databricks.com> wrote:


> We just made the repo public: https://github.com/databricks/spark-pandas
>
>
> On Tue, Mar 26, 2019 at 1:20 AM, Timothee Hunter <timhun...@databricks.com
> > wrote:
>
>> To add more details to what Reynold mentioned. As you said, there is
>> going to be some slight differences in any case between Pandas and Spark in
>> any case, simply because Spark needs to know the return types of the
>> functions. In your case, you would need to slightly refactor your apply
>> method to the following (in python 3) to add type hints:
>>
>> ```
>> def f(x) -> float: return x * 3.0
>> df['col3'] = df['col1'].apply(f)
>> ```
>>
>> This has the benefit of keeping your code fully compliant with both
>> pandas and pyspark. We will share more information in the future.
>>
>> Tim
>>
>> On Tue, Mar 26, 2019 at 8:08 AM Hyukjin Kwon <gurwls...@gmail.com> wrote:
>>
>> BTW, I am working on the documentation related with this subject at
>> https://issues.apache.org/jira/browse/SPARK-26022 to describe the
>> difference
>>
>> 2019년 3월 26일 (화) 오후 3:34, Reynold Xin <r...@databricks.com>님이 작성:
>>
>> We have some early stuff there but not quite ready to talk about it in
>> public yet (I hope soon though). Will shoot you a separate email on it.
>>
>> On Mon, Mar 25, 2019 at 11:32 PM Abdeali Kothari <
>> abdealikoth...@gmail.com> wrote:
>>
>> Thanks for the reply Reynold - Has this shim project started ?
>> I'd love to contribute to it - as it looks like I have started making a
>> bunch of helper functions to do something similar for my current task and
>> would prefer not doing it in isolation.
>> Was considering making a git repo and pushing stuff there just today
>> morning. But if there's already folks working on it - I'd prefer
>> collaborating.
>>
>> Note - I'm not recommending we make the logical plan mutable (as I am
>> scared of that too!). I think there are other ways of handling that - but
>> we can go into details later.
>>
>> On Tue, Mar 26, 2019 at 11:58 AM Reynold Xin <r...@databricks.com> wrote:
>>
>> We have been thinking about some of these issues. Some of them are harder
>> to do, e.g. Spark DataFrames are fundamentally immutable, and making the
>> logical plan mutable is a significant deviation from the current paradigm
>> that might confuse the hell out of some users. We are considering building
>> a shim layer as a separate project on top of Spark (so we can make rapid
>> releases based on feedback) just to test this out and see how well it could
>> work in practice.
>>
>> On Mon, Mar 25, 2019 at 11:04 PM Abdeali Kothari <
>> abdealikoth...@gmail.com> wrote:
>>
>> Hi,
>> I was doing some spark to pandas (and vice versa) conversion because some
>> of the pandas codes we have don't work on huge data. And some spark codes
>> work very slow on small data.
>>
>> It was nice to see that pyspark had some similar syntax for the common
>> pandas operations that the python community is used to.
>>
>> GroupBy aggs: df.groupby(['col2']).agg({'col2': 'count'}).show()
>> Column selects: df[['col1', 'col2']]
>> Row Filters: df[df['col1'] < 3.0]
>>
>> I was wondering about a bunch of other functions in pandas which seemed
>> common. And thought there must've been a discussion about it in the
>> community - hence started this thread.
>>
>> I was wondering whether there has been discussion on adding the following
>> functions:
>>
>> *Column setters*:
>> In Pandas:
>> df['col3'] = df['col1'] * 3.0
>> While I do the following in PySpark:
>> df = df.withColumn('col3', df['col1'] * 3.0)
>>
>> *Column apply()*:
>> In Pandas:
>> df['col3'] = df['col1'].apply(lambda x: x * 3.0)
>> While I do the following in PySpark:
>> df = df.withColumn('col3', F.udf(lambda x: x * 3.0, 'float')(df['col1']))
>>
>> I understand that this one cannot be as simple as in pandas due to the
>> output-type that's needed here. But could be done like:
>> df['col3'] = df['col1'].apply((lambda x: x * 3.0), 'float')
>>
>> Multi column in pandas is:
>> df['col3'] = df[['col1', 'col2']].apply(lambda x: x.col1 * 3.0)
>> Maybe this can be done in pyspark as or if we can send a pyspark.sql.Row
>> directly it would be similar (?):
>> df['col3'] = df[['col1', 'col2']].apply((lambda col1, col2: col1 * 3.0),
>> 'float')
>>
>> *Rename*:
>> In Pandas:
>> df.rename(columns={...})
>> While I do the following in PySpark:
>> df.toDF(*[{'col2': 'col3'}.get(i, i) for i in df.columns])
>>
>> *To Dictionary*:
>> In Pandas:
>> df.to_dict(orient='list')
>> While I do the following in PySpark:
>> {f.name: [row[i] for row in df.collect()] for i, f in
>> enumerate(df.schema.fields)}
>>
>> I thought I'd start the discussion with these and come back to some of
>> the others I see that could be helpful.
>>
>> *Note*: (with the column functions in mind) I understand the concept of
>> the DataFrame cannot be modified. And I am not suggesting we change that
>> nor any underlying principle. Just trying to add syntactic sugar here.
>>
>>
>

Re: PySpark syntax vs Pandas syntax

Reply via email to