Re: java.lang.OutOfMemoryError: GC overhead limit exceeded when using UDFs in SparkSQL (Spark 2.0.0)

2016-08-09 Thread Zoltan Fedor
Thanks, that makes sense. So it must be that this queue - which is kept because of the UDF - is the one running out of memory, because without the UDF field there is no out of memory error and the UDF fields is pretty small, unlikely that it would take us above the memory limit. In either case,

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded when using UDFs in SparkSQL (Spark 2.0.0)

2016-08-09 Thread Davies Liu
When you have a Python UDF, only the input to UDF are passed into Python process, but all other fields that are used together with the result of UDF are kept in a queue then join with the result from Python. The length of this queue is depend on the number of rows is under processing by Python (or

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded when using UDFs in SparkSQL (Spark 2.0.0)

2016-08-09 Thread Zoltan Fedor
> Does this mean you only have 1.6G memory for executor (others left for Python) ? > The cached table could take 1.5G, it means almost nothing left for other things. True. I have also tried with memoryOverhead being set to 800 (10% of the 8Gb memory), but no difference. The "GC overhead limit

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded when using UDFs in SparkSQL (Spark 2.0.0)

2016-08-08 Thread Davies Liu
On Mon, Aug 8, 2016 at 2:24 PM, Zoltan Fedor wrote: > Hi all, > > I have an interesting issue trying to use UDFs from SparkSQL in Spark 2.0.0 > using pyspark. > > There is a big table (5.6 Billion rows, 450Gb in memory) loaded into 300 > executors's memory in SparkSQL,

java.lang.OutOfMemoryError: GC overhead limit exceeded when using UDFs in SparkSQL (Spark 2.0.0)

2016-08-08 Thread Zoltan Fedor
Hi all, I have an interesting issue trying to use UDFs from SparkSQL in Spark 2.0.0 using pyspark. There is a big table (5.6 Billion rows, 450Gb in memory) loaded into 300 executors's memory in SparkSQL, on which we would do some calculation using UDFs in pyspark. If I run my SQL on only a