Nice, will test it out +1 On Tue, Mar 26, 2019, 22:38 Reynold Xin <r...@databricks.com> wrote:
> We just made the repo public: https://github.com/databricks/spark-pandas > > > On Tue, Mar 26, 2019 at 1:20 AM, Timothee Hunter <timhun...@databricks.com > > wrote: > >> To add more details to what Reynold mentioned. As you said, there is >> going to be some slight differences in any case between Pandas and Spark in >> any case, simply because Spark needs to know the return types of the >> functions. In your case, you would need to slightly refactor your apply >> method to the following (in python 3) to add type hints: >> >> ``` >> def f(x) -> float: return x * 3.0 >> df['col3'] = df['col1'].apply(f) >> ``` >> >> This has the benefit of keeping your code fully compliant with both >> pandas and pyspark. We will share more information in the future. >> >> Tim >> >> On Tue, Mar 26, 2019 at 8:08 AM Hyukjin Kwon <gurwls...@gmail.com> wrote: >> >> BTW, I am working on the documentation related with this subject at >> https://issues.apache.org/jira/browse/SPARK-26022 to describe the >> difference >> >> 2019년 3월 26일 (화) 오후 3:34, Reynold Xin <r...@databricks.com>님이 작성: >> >> We have some early stuff there but not quite ready to talk about it in >> public yet (I hope soon though). Will shoot you a separate email on it. >> >> On Mon, Mar 25, 2019 at 11:32 PM Abdeali Kothari < >> abdealikoth...@gmail.com> wrote: >> >> Thanks for the reply Reynold - Has this shim project started ? >> I'd love to contribute to it - as it looks like I have started making a >> bunch of helper functions to do something similar for my current task and >> would prefer not doing it in isolation. >> Was considering making a git repo and pushing stuff there just today >> morning. But if there's already folks working on it - I'd prefer >> collaborating. >> >> Note - I'm not recommending we make the logical plan mutable (as I am >> scared of that too!). I think there are other ways of handling that - but >> we can go into details later. >> >> On Tue, Mar 26, 2019 at 11:58 AM Reynold Xin <r...@databricks.com> wrote: >> >> We have been thinking about some of these issues. Some of them are harder >> to do, e.g. Spark DataFrames are fundamentally immutable, and making the >> logical plan mutable is a significant deviation from the current paradigm >> that might confuse the hell out of some users. We are considering building >> a shim layer as a separate project on top of Spark (so we can make rapid >> releases based on feedback) just to test this out and see how well it could >> work in practice. >> >> On Mon, Mar 25, 2019 at 11:04 PM Abdeali Kothari < >> abdealikoth...@gmail.com> wrote: >> >> Hi, >> I was doing some spark to pandas (and vice versa) conversion because some >> of the pandas codes we have don't work on huge data. And some spark codes >> work very slow on small data. >> >> It was nice to see that pyspark had some similar syntax for the common >> pandas operations that the python community is used to. >> >> GroupBy aggs: df.groupby(['col2']).agg({'col2': 'count'}).show() >> Column selects: df[['col1', 'col2']] >> Row Filters: df[df['col1'] < 3.0] >> >> I was wondering about a bunch of other functions in pandas which seemed >> common. And thought there must've been a discussion about it in the >> community - hence started this thread. >> >> I was wondering whether there has been discussion on adding the following >> functions: >> >> *Column setters*: >> In Pandas: >> df['col3'] = df['col1'] * 3.0 >> While I do the following in PySpark: >> df = df.withColumn('col3', df['col1'] * 3.0) >> >> *Column apply()*: >> In Pandas: >> df['col3'] = df['col1'].apply(lambda x: x * 3.0) >> While I do the following in PySpark: >> df = df.withColumn('col3', F.udf(lambda x: x * 3.0, 'float')(df['col1'])) >> >> I understand that this one cannot be as simple as in pandas due to the >> output-type that's needed here. But could be done like: >> df['col3'] = df['col1'].apply((lambda x: x * 3.0), 'float') >> >> Multi column in pandas is: >> df['col3'] = df[['col1', 'col2']].apply(lambda x: x.col1 * 3.0) >> Maybe this can be done in pyspark as or if we can send a pyspark.sql.Row >> directly it would be similar (?): >> df['col3'] = df[['col1', 'col2']].apply((lambda col1, col2: col1 * 3.0), >> 'float') >> >> *Rename*: >> In Pandas: >> df.rename(columns={...}) >> While I do the following in PySpark: >> df.toDF(*[{'col2': 'col3'}.get(i, i) for i in df.columns]) >> >> *To Dictionary*: >> In Pandas: >> df.to_dict(orient='list') >> While I do the following in PySpark: >> {f.name: [row[i] for row in df.collect()] for i, f in >> enumerate(df.schema.fields)} >> >> I thought I'd start the discussion with these and come back to some of >> the others I see that could be helpful. >> >> *Note*: (with the column functions in mind) I understand the concept of >> the DataFrame cannot be modified. And I am not suggesting we change that >> nor any underlying principle. Just trying to add syntactic sugar here. >> >> >