Re: Pandas UDF cogroup.applyInPandas with multiple dataframes

Santosh Pingale Mon, 06 Feb 2023 02:28:55 -0800

Created  a PR: https://github.com/apache/spark/pull/39902 
<https://github.com/apache/spark/pull/39902>



> On 24 Jan 2023, at 15:04, Santosh Pingale <santosh.ping...@adyen.com> wrote:
> 
> Hey all
> 
> I have an interesting problem in hand. We have cases where we want to pass 
> multiple(20 to 30) data frames to cogroup.applyInPandas function.
> 
> RDD currently supports cogroup with upto 4 dataframes (ZippedPartitionsRDD4)  
> where as cogroup with pandas can handle only 2 dataframes (with 
> ZippedPartitionsRDD2). In our use case, we do not have much control over how 
> many data frames we may need in the cogroup.applyInPandas function.
> 
> To achieve this, we can:
> (a) Implement ZippedPartitionsRDD5, 
> ZippedPartitionsRDD..ZippedPartitionsRDD30..ZippedPartitionsRDD50 with 
> respective iterators, serializers and so on. This ensures we keep type safety 
> intact but a lot more boilerplate code has to be written to achieve this.
> (b) Do not use cogroup.applyInPandas, rather use RDD.keyBy.cogroup and then 
> getItem in a nested fashion. Then convert data to pandas df in the python 
> function. This looks like a good workaround but mistakes are very easy to 
> happen. We also don't look at typesafety here from user's point of view.
> (c) Implement ZippedPartitionsRDDN and NaryLike with childrenNodes type set 
> to Seq[T] which allows for arbitrary number of children to be set. Here we 
> have very little boilerplate but we sacrifice type safety.
> (d) ... some new suggestions... ?
> 
> I have done preliminary work on option (c). It works like a charm but before 
> I proceed, is my concern about sacrificed type safety overblown, and do we 
> have an approach (d)?
> (a) is something that is too much of an investment for it to be useful. (b) 
> is okay enough workaround, but it is not very efficient.
>

signature.asc
Description: Message signed with OpenPGP

Re: Pandas UDF cogroup.applyInPandas with multiple dataframes

Reply via email to