from:"Vaibhav Sinha"

Re: coalesce ending up very unbalanced - but why?

2016-12-14 Thread Vaibhav Sinha

Hi, I see a similar behaviour in an exactly similar scenario at my deployment as well. I am using scala, so the behaviour is not limited to pyspark. In my observation 9 out of 10 partitions (as in my case) are of similar size ~38 GB each and final one is significantly larger ~59 GB. Prime number

Writing parquet table using spark

2016-11-16 Thread Vaibhav Sinha

Hi, I am using hiveContext.sql() method to select data from source table and insert into parquet tables. The query executed from spark takes about 3x more disk space to write the same number of rows compared to when fired from impala. Just wondering if this is normal behaviour and if there's a way