Re: On adding applyInArrow to groupBy and cogroup
Sounds good, I'll review the PR. On Fri, 3 Nov 2023 at 14:08, Abdeali Kothari wrote: > Seeing more support for arrow based functions would be great. > Gives more control to application developers. And so pandas just becomes 1 > of the available options. > > On Fri, 3 Nov 2023, 21:23 Luca Canali, wrote: > >> Hi Enrico, >> >> >> >> +1 on supporting Arrow on par with Pandas. Besides the frameworks and >> libraries that you mentioned I add awkward array, a library used in High >> Energy Physics >> >> (for those interested more details on how we tested awkward array with >> Spark from back when mapInArrow was introduced can be found at >> https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_MapInArrow.md >> ) >> >> >> >> Cheers, >> >> Luca >> >> >> >> *From:* Enrico Minack >> *Sent:* Thursday, October 26, 2023 15:33 >> *To:* dev >> *Subject:* On adding applyInArrow to groupBy and cogroup >> >> >> >> Hi devs, >> >> PySpark allows to transform a DataFrame via Pandas *and* Arrow API: >> >> df.mapInArrow(map_arrow, schema="...") >> df.mapInPandas(map_pandas, schema="...") >> >> For df.groupBy(...) and df.groupBy(...).cogroup(...), there is *only* a >> Pandas interface, no Arrow interface: >> >> df.groupBy("id").applyInPandas(apply_pandas, schema="...") >> >> Providing a pure Arrow interface allows user code to use *any* >> Arrow-based data framework, not only Pandas, e.g. Polars. Adding Arrow >> interfaces reduces the need to add more framework-specific support. >> >> We need your thoughts on whether PySpark should support Arrow on a par >> with Pandas, or not: https://github.com/apache/spark/pull/38624 >> >> Cheers, >> Enrico >> >
Re: On adding applyInArrow to groupBy and cogroup
Seeing more support for arrow based functions would be great. Gives more control to application developers. And so pandas just becomes 1 of the available options. On Fri, 3 Nov 2023, 21:23 Luca Canali, wrote: > Hi Enrico, > > > > +1 on supporting Arrow on par with Pandas. Besides the frameworks and > libraries that you mentioned I add awkward array, a library used in High > Energy Physics > > (for those interested more details on how we tested awkward array with > Spark from back when mapInArrow was introduced can be found at > https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_MapInArrow.md > ) > > > > Cheers, > > Luca > > > > *From:* Enrico Minack > *Sent:* Thursday, October 26, 2023 15:33 > *To:* dev > *Subject:* On adding applyInArrow to groupBy and cogroup > > > > Hi devs, > > PySpark allows to transform a DataFrame via Pandas *and* Arrow API: > > df.mapInArrow(map_arrow, schema="...") > df.mapInPandas(map_pandas, schema="...") > > For df.groupBy(...) and df.groupBy(...).cogroup(...), there is *only* a > Pandas interface, no Arrow interface: > > df.groupBy("id").applyInPandas(apply_pandas, schema="...") > > Providing a pure Arrow interface allows user code to use *any* > Arrow-based data framework, not only Pandas, e.g. Polars. Adding Arrow > interfaces reduces the need to add more framework-specific support. > > We need your thoughts on whether PySpark should support Arrow on a par > with Pandas, or not: https://github.com/apache/spark/pull/38624 > > Cheers, > Enrico >
RE: On adding applyInArrow to groupBy and cogroup
Hi Enrico, +1 on supporting Arrow on par with Pandas. Besides the frameworks and libraries that you mentioned I add awkward array, a library used in High Energy Physics (for those interested more details on how we tested awkward array with Spark from back when mapInArrow was introduced can be found at https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_MapInArrow.md ) Cheers, Luca From: Enrico Minack Sent: Thursday, October 26, 2023 15:33 To: dev Subject: On adding applyInArrow to groupBy and cogroup Hi devs, PySpark allows to transform a DataFrame via Pandas and Arrow API: df.mapInArrow(map_arrow, schema="...") df.mapInPandas(map_pandas, schema="...") For df.groupBy(...) and df.groupBy(...).cogroup(...), there is only a Pandas interface, no Arrow interface: df.groupBy("id").applyInPandas(apply_pandas, schema="...") Providing a pure Arrow interface allows user code to use any Arrow-based data framework, not only Pandas, e.g. Polars. Adding Arrow interfaces reduces the need to add more framework-specific support. We need your thoughts on whether PySpark should support Arrow on a par with Pandas, or not: https://github.com/apache/spark/pull/38624 Cheers, Enrico
Re: On adding applyInArrow to groupBy and cogroup
I'm definitely +1 to include this. - It seems like an odd feature parity gap to have a map function but no group apply function. - There's currently no way to use large arrow types with applyInPandas, which can lead to errors hitting the 2 GiB max string/binary array size. I have a PR to Arrow that would let Spark be able to use the large types for pandas operations, but that still hasn't been merged, will require another Arrow release, and then require a lot of updates on the Spark side to accommodate. Hyukjin already knocked most of this out in a closed PR, so hopefully it won't take too long to resurrect that after the Arrow PR. Until then though the only way to support large string/binary types would be through a direct applyInArrow function. - "Just use applyInPandas and convert that to arrow" or "just use mapInArrow and do a custom manual grouping" seem like odd arguments against including it, as performance is the main reason to use other libraries like Polars (or just avoiding an extra conversion to Pandas for no reason), and by the same reasoning there would be no reason to ever have created mapInArrow or applyInPandas, respectively. - Based on multiple people commenting on the PR, it doesn't seem so much of a niche corner case. But aren't there a lot of performance improvements trying to optimize corner cases anyway? Adam On Thu, Oct 26, 2023 at 9:35 AM Enrico Minack wrote: > Hi devs, > > PySpark allows to transform a DataFrame via Pandas *and* Arrow API: > > df.mapInArrow(map_arrow, schema="...") > df.mapInPandas(map_pandas, schema="...") > > For df.groupBy(...) and df.groupBy(...).cogroup(...), there is *only* a > Pandas interface, no Arrow interface: > > df.groupBy("id").applyInPandas(apply_pandas, schema="...") > > Providing a pure Arrow interface allows user code to use *any* > Arrow-based data framework, not only Pandas, e.g. Polars. Adding Arrow > interfaces reduces the need to add more framework-specific support. > > We need your thoughts on whether PySpark should support Arrow on a par > with Pandas, or not: https://github.com/apache/spark/pull/38624 > Cheers, > Enrico > > -- Adam Binford