We just made the repo public: https://github.com/databricks/spark-pandas
On Tue, Mar 26, 2019 at 1:20 AM, Timothee Hunter < timhun...@databricks.com > wrote: > > To add more details to what Reynold mentioned. As you said, there is going > to be some slight differences in any case between Pandas and Spark in any > case, simply because Spark needs to know the return types of the > functions. In your case, you would need to slightly refactor your apply > method to the following (in python 3) to add type hints: > > > ``` > def f(x) -> float: return x * 3.0 > df['col3'] = df['col1'].apply(f) > ``` > > > This has the benefit of keeping your code fully compliant with both pandas > and pyspark. We will share more information in the future. > > > Tim > > On Tue, Mar 26, 2019 at 8:08 AM Hyukjin Kwon < gurwls223@ gmail. com ( > gurwls...@gmail.com ) > wrote: > > >> BTW, I am working on the documentation related with this subject at https:/ >> / issues. apache. org/ jira/ browse/ SPARK-26022 ( >> https://issues.apache.org/jira/browse/SPARK-26022 ) to describe the >> difference >> >> 2019년 3월 26일 (화) 오후 3:34, Reynold Xin < rxin@ databricks. com ( >> r...@databricks.com ) >님이 작성: >> >> >>> We have some early stuff there but not quite ready to talk about it in >>> public yet (I hope soon though). Will shoot you a separate email on it. >>> >>> On Mon, Mar 25, 2019 at 11:32 PM Abdeali Kothari < abdealikothari@ gmail. >>> com >>> ( abdealikoth...@gmail.com ) > wrote: >>> >>> >>>> Thanks for the reply Reynold - Has this shim project started ? >>>> I'd love to contribute to it - as it looks like I have started making a >>>> bunch of helper functions to do something similar for my current task and >>>> would prefer not doing it in isolation. >>>> Was considering making a git repo and pushing stuff there just today >>>> morning. But if there's already folks working on it - I'd prefer >>>> collaborating. >>>> >>>> >>>> Note - I'm not recommending we make the logical plan mutable (as I am >>>> scared of that too!). I think there are other ways of handling that - but >>>> we can go into details later. >>>> >>>> On Tue, Mar 26, 2019 at 11:58 AM Reynold Xin < rxin@ databricks. com ( >>>> r...@databricks.com ) > wrote: >>>> >>>> >>>>> We have been thinking about some of these issues. Some of them are harder >>>>> to do, e.g. Spark DataFrames are fundamentally immutable, and making the >>>>> logical plan mutable is a significant deviation from the current paradigm >>>>> that might confuse the hell out of some users. We are considering building >>>>> a shim layer as a separate project on top of Spark (so we can make rapid >>>>> releases based on feedback) just to test this out and see how well it >>>>> could work in practice. >>>>> >>>>> On Mon, Mar 25, 2019 at 11:04 PM Abdeali Kothari < abdealikothari@ gmail. >>>>> com >>>>> ( abdealikoth...@gmail.com ) > wrote: >>>>> >>>>> >>>>>> Hi, >>>>>> I was doing some spark to pandas (and vice versa) conversion because some >>>>>> of the pandas codes we have don't work on huge data. And some spark codes >>>>>> work very slow on small data. >>>>>> >>>>>> It was nice to see that pyspark had some similar syntax for the common >>>>>> pandas operations that the python community is used to. >>>>>> >>>>>> >>>>>> GroupBy aggs: df.groupby(['col2']).agg({'col2': 'count'}).show() >>>>>> >>>>>> Column selects: df[['col1', 'col2']] >>>>>> >>>>>> Row Filters: df[df['col1'] < 3.0] >>>>>> >>>>>> >>>>>> >>>>>> I was wondering about a bunch of other functions in pandas which seemed >>>>>> common. And thought there must've been a discussion about it in the >>>>>> community - hence started this thread. >>>>>> >>>>>> >>>>>> I was wondering whether there has been discussion on adding the following >>>>>> functions: >>>>>> * >>>>>> Column setters* : >>>>>> In Pandas: >>>>>> df['col3'] = df['col1'] * 3.0 >>>>>> >>>>>> While I do the following in PySpark: >>>>>> >>>>>> df = df.withColumn('col3', df['col1'] * 3.0) >>>>>> >>>>>> >>>>>> *Column apply()* : >>>>>> In Pandas: >>>>>> df['col3'] = df['col1'].apply(lambda x: x * 3.0) >>>>>> >>>>>> While I do the following in PySpark: >>>>>> df = df.withColumn('col3', F.udf(lambda x: x * 3.0, 'float')( >>>>>> df['col1'])) >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> I understand that this one cannot be as simple as in pandas due to the >>>>>> output-type that's needed here. But could be done like: >>>>>> df['col3'] = df['col1'].apply((lambda x: x * 3.0), 'float') >>>>>> >>>>>> >>>>>> Multi column in pandas is: >>>>>> df['col3'] = df[['col1', 'col2']].apply(lambda x: x.col1 * 3.0) >>>>>> >>>>>> Maybe this can be done in pyspark as or if we can send a pyspark.sql.Row >>>>>> directly it would be similar (?): >>>>>> df['col3'] = df[['col1', 'col2']].apply((lambda col1, col2: col1 * 3.0), >>>>>> 'float') >>>>>> >>>>>> >>>>>> *Rename* : >>>>>> In Pandas: >>>>>> df.rename(columns={...}) >>>>>> While I do the following in PySpark: >>>>>> >>>>>> df.toDF(*[{'col2': 'col3'}.get(i, i) for i in df.columns]) >>>>>> >>>>>> >>>>>> *To Dictionary* : >>>>>> In Pandas: >>>>>> df.to_dict(orient='list') >>>>>> While I do the following in PySpark: >>>>>> >>>>>> { f. name ( http://f.name ) : [row[i] for row in df.collect()] for i, f >>>>>> in >>>>>> enumerate(df.schema.fields)} >>>>>> >>>>>> >>>>>> >>>>>> I thought I'd start the discussion with these and come back to some of >>>>>> the >>>>>> others I see that could be helpful. >>>>>> >>>>>> >>>>>> *Note* : (with the column functions in mind) I understand the concept of >>>>>> the DataFrame cannot be modified. And I am not suggesting we change that >>>>>> nor any underlying principle. Just trying to add syntactic sugar here. >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> > >