Re: How to add a row number column with out reordering my data frame

2022-01-11 Thread Andrew Davidson
Thanks! I will take a look Andy From: Gourav Sengupta Date: Tuesday, January 11, 2022 at 8:42 AM To: Andrew Davidson Cc: Andrew Davidson , "user @spark" Subject: Re: How to add a row number column with out reordering my data frame Hi, I do not think we need to do any of that. Please try

Re: How to add a row number column with out reordering my data frame

2022-01-11 Thread Gourav Sengupta
Hi, I do not think we need to do any of that. Please try repartitionbyrange, dpark 3 has adaptive query execution with configurations to handle skew as well. Regards, Gourav On Tue, Jan 11, 2022 at 4:21 PM Andrew Davidson wrote: > HI Gourav > > > > When I join I get OOM. To address this my

Re: How to add a row number column with out reordering my data frame

2022-01-11 Thread Andrew Davidson
HI Gourav When I join I get OOM. To address this my thought was to split my tables into small batches of rows. And then join the batch together then use union. My assumption is the union is a narrow transform and as such require fewer resources. Let say I have 5 data frames I want to join

Re: How to add a row number column with out reordering my data frame

2022-01-10 Thread Gourav Sengupta
Hi, I am a bit confused here, it is not entirely clear to me why are you creating the row numbers, and how creating the row numbers helps you with the joins? Can you please explain with some sample data? Regards, Gourav On Fri, Jan 7, 2022 at 1:14 AM Andrew Davidson wrote: > Hi > > > > I am

How to add a row number column with out reordering my data frame

2022-01-06 Thread Andrew Davidson
Hi I am trying to work through a OOM error. I have 10411 files. I want to select a single column from each file and then join them into a single table. The files have a row unique id. However it is a very long string. The data file with just the name and column of interest is about 470 M. The