Thanks, that makes sense.
So it must be that this queue - which is kept because of the UDF - is the
one running out of memory, because without the UDF field there is no out of
memory error and the UDF fields is pretty small, unlikely that it would
take us above the memory limit.
In either case,
When you have a Python UDF, only the input to UDF are passed into
Python process,
but all other fields that are used together with the result of UDF are
kept in a queue
then join with the result from Python. The length of this queue is depend on the
number of rows is under processing by Python (or
> Does this mean you only have 1.6G memory for executor (others left for
Python) ?
> The cached table could take 1.5G, it means almost nothing left for other
things.
True. I have also tried with memoryOverhead being set to 800 (10% of the
8Gb memory), but no difference. The "GC overhead limit
On Mon, Aug 8, 2016 at 2:24 PM, Zoltan Fedor wrote:
> Hi all,
>
> I have an interesting issue trying to use UDFs from SparkSQL in Spark 2.0.0
> using pyspark.
>
> There is a big table (5.6 Billion rows, 450Gb in memory) loaded into 300
> executors's memory in SparkSQL,
Hi all,
I have an interesting issue trying to use UDFs from SparkSQL in Spark 2.0.0
using pyspark.
There is a big table (5.6 Billion rows, 450Gb in memory) loaded into 300
executors's memory in SparkSQL, on which we would do some calculation using
UDFs in pyspark.
If I run my SQL on only a