Re: pyspark loop optimization

2022-01-12 Thread Ramesh Natarajan
Sorry for confusing with cume_dist and percent_rank. I was playing around with these to see if the difference in computation made any difference. I must have copied the percent rank accidentally. My requirement is to compute cume_dist. I have a dataframe with a bunch of columns (10+ columns)

Re: pyspark loop optimization

2022-01-11 Thread Gourav Sengupta
Hi, I am not sure what you are trying to achieve here are cume_dist and percent_rank not different? If am able to follow your question correctly, you are looking for filtering our NULLs before applying the function on the dataframe, and I think it will be fine if you just create another

pyspark loop optimization

2022-01-10 Thread Ramesh Natarajan
I want to compute cume_dist on a bunch of columns in a spark dataframe, but want to remove NULL values before doing so. I have this loop in pyspark. While this works, I see the driver runs at 100% while the executors are idle for the most part. I am reading that running a loop is an anti-pattern