I am not a Spark committer and haven't been working on Spark for a while. However, I was heavily involved in the original cogroup work and we are using cogroup functionality pretty heavily and I want to give my two cents here.
I think this is a nice improvement and I hope someone from the PySpark side can take a look at this. On Mon, Feb 6, 2023 at 5:29 AM Santosh Pingale <santosh.ping...@adyen.com.invalid> wrote: > Created a PR: https://github.com/apache/spark/pull/39902 > > > On 24 Jan 2023, at 15:04, Santosh Pingale <santosh.ping...@adyen.com> > wrote: > > Hey all > > I have an interesting problem in hand. We have cases where we want to pass > multiple(20 to 30) data frames to cogroup.applyInPandas function. > > RDD currently supports cogroup with upto 4 dataframes > (ZippedPartitionsRDD4) where as cogroup with pandas can handle only 2 > dataframes (with ZippedPartitionsRDD2). In our use case, we do not have > much control over how many data frames we may need in the > cogroup.applyInPandas function. > > To achieve this, we can: > (a) Implement ZippedPartitionsRDD5, > ZippedPartitionsRDD..ZippedPartitionsRDD30..ZippedPartitionsRDD50 with > respective iterators, serializers and so on. This ensures we keep type > safety intact but a lot more boilerplate code has to be written to achieve > this. > (b) Do not use cogroup.applyInPandas, rather use RDD.keyBy.cogroup and > then getItem in a nested fashion. Then convert data to pandas df in the > python function. This looks like a good workaround but mistakes are very > easy to happen. We also don't look at typesafety here from user's point of > view. > (c) Implement ZippedPartitionsRDDN and NaryLike with childrenNodes type > set to Seq[T] which allows for arbitrary number of children to be set. Here > we have very little boilerplate but we sacrifice type safety. > (d) ... some new suggestions... ? > > I have done preliminary work on option (c). It works like a charm but > before I proceed, is my concern about sacrificed type safety overblown, and > do we have an approach (d)? > (a) is something that is too much of an investment for it to be useful. > (b) is okay enough workaround, but it is not very efficient. > > >