Thanks for the reply Reynold - Has this shim project started ? I'd love to contribute to it - as it looks like I have started making a bunch of helper functions to do something similar for my current task and would prefer not doing it in isolation. Was considering making a git repo and pushing stuff there just today morning. But if there's already folks working on it - I'd prefer collaborating.
Note - I'm not recommending we make the logical plan mutable (as I am scared of that too!). I think there are other ways of handling that - but we can go into details later. On Tue, Mar 26, 2019 at 11:58 AM Reynold Xin <r...@databricks.com> wrote: > We have been thinking about some of these issues. Some of them are harder > to do, e.g. Spark DataFrames are fundamentally immutable, and making the > logical plan mutable is a significant deviation from the current paradigm > that might confuse the hell out of some users. We are considering building > a shim layer as a separate project on top of Spark (so we can make rapid > releases based on feedback) just to test this out and see how well it could > work in practice. > > On Mon, Mar 25, 2019 at 11:04 PM Abdeali Kothari <abdealikoth...@gmail.com> > wrote: > >> Hi, >> I was doing some spark to pandas (and vice versa) conversion because some >> of the pandas codes we have don't work on huge data. And some spark codes >> work very slow on small data. >> >> It was nice to see that pyspark had some similar syntax for the common >> pandas operations that the python community is used to. >> >> GroupBy aggs: df.groupby(['col2']).agg({'col2': 'count'}).show() >> Column selects: df[['col1', 'col2']] >> Row Filters: df[df['col1'] < 3.0] >> >> I was wondering about a bunch of other functions in pandas which seemed >> common. And thought there must've been a discussion about it in the >> community - hence started this thread. >> >> I was wondering whether there has been discussion on adding the following >> functions: >> >> *Column setters*: >> In Pandas: >> df['col3'] = df['col1'] * 3.0 >> While I do the following in PySpark: >> df = df.withColumn('col3', df['col1'] * 3.0) >> >> *Column apply()*: >> In Pandas: >> df['col3'] = df['col1'].apply(lambda x: x * 3.0) >> While I do the following in PySpark: >> df = df.withColumn('col3', F.udf(lambda x: x * 3.0, 'float')(df['col1'])) >> >> I understand that this one cannot be as simple as in pandas due to the >> output-type that's needed here. But could be done like: >> df['col3'] = df['col1'].apply((lambda x: x * 3.0), 'float') >> >> Multi column in pandas is: >> df['col3'] = df[['col1', 'col2']].apply(lambda x: x.col1 * 3.0) >> Maybe this can be done in pyspark as or if we can send a pyspark.sql.Row >> directly it would be similar (?): >> df['col3'] = df[['col1', 'col2']].apply((lambda col1, col2: col1 * 3.0), >> 'float') >> >> *Rename*: >> In Pandas: >> df.rename(columns={...}) >> While I do the following in PySpark: >> df.toDF(*[{'col2': 'col3'}.get(i, i) for i in df.columns]) >> >> *To Dictionary*: >> In Pandas: >> df.to_dict(orient='list') >> While I do the following in PySpark: >> {f.name: [row[i] for row in df.collect()] for i, f in >> enumerate(df.schema.fields)} >> >> I thought I'd start the discussion with these and come back to some of >> the others I see that could be helpful. >> >> *Note*: (with the column functions in mind) I understand the concept of >> the DataFrame cannot be modified. And I am not suggesting we change that >> nor any underlying principle. Just trying to add syntactic sugar here. >> >>