I have opened two PRs: One that tries to maintain backwards compatibility: https://github.com/apache/spark/pull/39902 <https://github.com/apache/spark/pull/39902> One that breaks the API to make it cleaner: https://github.com/apache/spark/pull/40122 <https://github.com/apache/spark/pull/40122>
Note this API has been marked experimental so imagining breaking changes is a possibility at the moment, whether we do it or not in practice is something we need to decide. > On 7 Feb 2023, at 22:52, Li Jin <ice.xell...@gmail.com> wrote: > > I am not a Spark committer and haven't been working on Spark for a while. > However, I was heavily involved in the original cogroup work and we are using > cogroup functionality pretty heavily and I want to give my two cents here. > > I think this is a nice improvement and I hope someone from the PySpark side > can take a look at this. > > On Mon, Feb 6, 2023 at 5:29 AM Santosh Pingale > <santosh.ping...@adyen.com.invalid> wrote: > Created a PR: https://github.com/apache/spark/pull/39902 > <https://github.com/apache/spark/pull/39902> > > >> On 24 Jan 2023, at 15:04, Santosh Pingale <santosh.ping...@adyen.com >> <mailto:santosh.ping...@adyen.com>> wrote: >> >> Hey all >> >> I have an interesting problem in hand. We have cases where we want to pass >> multiple(20 to 30) data frames to cogroup.applyInPandas function. >> >> RDD currently supports cogroup with upto 4 dataframes (ZippedPartitionsRDD4) >> where as cogroup with pandas can handle only 2 dataframes (with >> ZippedPartitionsRDD2). In our use case, we do not have much control over how >> many data frames we may need in the cogroup.applyInPandas function. >> >> To achieve this, we can: >> (a) Implement ZippedPartitionsRDD5, >> ZippedPartitionsRDD..ZippedPartitionsRDD30..ZippedPartitionsRDD50 with >> respective iterators, serializers and so on. This ensures we keep type >> safety intact but a lot more boilerplate code has to be written to achieve >> this. >> (b) Do not use cogroup.applyInPandas, rather use RDD.keyBy.cogroup and then >> getItem in a nested fashion. Then convert data to pandas df in the python >> function. This looks like a good workaround but mistakes are very easy to >> happen. We also don't look at typesafety here from user's point of view. >> (c) Implement ZippedPartitionsRDDN and NaryLike with childrenNodes type set >> to Seq[T] which allows for arbitrary number of children to be set. Here we >> have very little boilerplate but we sacrifice type safety. >> (d) ... some new suggestions... ? >> >> I have done preliminary work on option (c). It works like a charm but before >> I proceed, is my concern about sacrificed type safety overblown, and do we >> have an approach (d)? >> (a) is something that is too much of an investment for it to be useful. (b) >> is okay enough workaround, but it is not very efficient. >> >
signature.asc
Description: Message signed with OpenPGP