Re: PySpark syntax vs Pandas syntax

Reynold Xin Mon, 25 Mar 2019 23:29:11 -0700

We have been thinking about some of these issues. Some of them are harder
to do, e.g. Spark DataFrames are fundamentally immutable, and making the
logical plan mutable is a significant deviation from the current paradigm
that might confuse the hell out of some users. We are considering building
a shim layer as a separate project on top of Spark (so we can make rapid
releases based on feedback) just to test this out and see how well it could
work in practice.


On Mon, Mar 25, 2019 at 11:04 PM Abdeali Kothari <abdealikoth...@gmail.com>
wrote:

> Hi,
> I was doing some spark to pandas (and vice versa) conversion because some
> of the pandas codes we have don't work on huge data. And some spark codes
> work very slow on small data.
>
> It was nice to see that pyspark had some similar syntax for the common
> pandas operations that the python community is used to.
>
> GroupBy aggs: df.groupby(['col2']).agg({'col2': 'count'}).show()
> Column selects: df[['col1', 'col2']]
> Row Filters: df[df['col1'] < 3.0]
>
> I was wondering about a bunch of other functions in pandas which seemed
> common. And thought there must've been a discussion about it in the
> community - hence started this thread.
>
> I was wondering whether there has been discussion on adding the following
> functions:
>
> *Column setters*:
> In Pandas:
> df['col3'] = df['col1'] * 3.0
> While I do the following in PySpark:
> df = df.withColumn('col3', df['col1'] * 3.0)
>
> *Column apply()*:
> In Pandas:
> df['col3'] = df['col1'].apply(lambda x: x * 3.0)
> While I do the following in PySpark:
> df = df.withColumn('col3', F.udf(lambda x: x * 3.0, 'float')(df['col1']))
>
> I understand that this one cannot be as simple as in pandas due to the
> output-type that's needed here. But could be done like:
> df['col3'] = df['col1'].apply((lambda x: x * 3.0), 'float')
>
> Multi column in pandas is:
> df['col3'] = df[['col1', 'col2']].apply(lambda x: x.col1 * 3.0)
> Maybe this can be done in pyspark as or if we can send a pyspark.sql.Row
> directly it would be similar (?):
> df['col3'] = df[['col1', 'col2']].apply((lambda col1, col2: col1 * 3.0),
> 'float')
>
> *Rename*:
> In Pandas:
> df.rename(columns={...})
> While I do the following in PySpark:
> df.toDF(*[{'col2': 'col3'}.get(i, i) for i in df.columns])
>
> *To Dictionary*:
> In Pandas:
> df.to_dict(orient='list')
> While I do the following in PySpark:
> {f.name: [row[i] for row in df.collect()] for i, f in
> enumerate(df.schema.fields)}
>
> I thought I'd start the discussion with these and come back to some of the
> others I see that could be helpful.
>
> *Note*: (with the column functions in mind) I understand the concept of
> the DataFrame cannot be modified. And I am not suggesting we change that
> nor any underlying principle. Just trying to add syntactic sugar here.
>
>

Re: PySpark syntax vs Pandas syntax

Reply via email to