Re: Parquet problems

2015-07-22 Thread Anders Arpteg
No, never really resolved the problem, except by increasing the permgem
space which only partially solved it. Still have to restart the job
multiple times so make the whole job complete (it stores intermediate
results).

The parquet data sources have about 70 columns, and yes Cheng, it works
fine when only loading a small sample of the data.

Thankful for any hints,
Anders

On Wed, Jul 22, 2015 at 5:29 PM Cheng Lian lian.cs@gmail.com wrote:

  How many columns are there in these Parquet files? Could you load a small
 portion of the original large dataset successfully?

 Cheng


 On 6/25/15 5:52 PM, Anders Arpteg wrote:

 Yes, both the driver and the executors. Works a little bit better with
 more space, but still a leak that will cause failure after a number of
 reads. There are about 700 different data sources that needs to be loaded,
 lots of data...

  tor 25 jun 2015 08:02 Sabarish Sasidharan 
 sabarish.sasidha...@manthan.comsabarish.sasidha...@manthan.com skrev:

 Did you try increasing the perm gen for the driver?

 Regards
 Sab

 On 24-Jun-2015 4:40 pm, Anders Arpteg arp...@spotify.com wrote:

 When reading large (and many) datasets with the Spark 1.4.0 DataFrames
 parquet reader (the org.apache.spark.sql.parquet format), the following
 exceptions are thrown:

  Exception in thread sk-result-getter-0

 Exception: java.lang.OutOfMemoryError thrown from the
 UncaughtExceptionHandler in thread task-result-getter-0
 Exception in thread task-result-getter-3 java.lang.OutOfMemoryError:
 PermGen space
 Exception in thread task-result-getter-1 java.lang.OutOfMemoryError:
 PermGen space
 Exception in thread task-result-getter-2 java.lang.OutOfMemoryError:
 PermGen space


  and many more like these from different threads. I've tried increasing
 the PermGen space using the -XX:MaxPermSize VM setting, but even after
 tripling the space, the same errors occur. I've also tried storing
 intermediate results, and am able to get the full job completed by running
 it multiple times and starting for the last successful intermediate result.
 There seems to be some memory leak in the parquet format. Any hints on how
 to fix this problem?

  Thanks,
 Anders




Re: Parquet problems

2015-07-22 Thread Cheng Lian
How many columns are there in these Parquet files? Could you load a 
small portion of the original large dataset successfully?


Cheng

On 6/25/15 5:52 PM, Anders Arpteg wrote:


Yes, both the driver and the executors. Works a little bit better with 
more space, but still a leak that will cause failure after a number of 
reads. There are about 700 different data sources that needs to be 
loaded, lots of data...



tor 25 jun 2015 08:02 Sabarish Sasidharan 
sabarish.sasidha...@manthan.com 
mailto:sabarish.sasidha...@manthan.com skrev:


Did you try increasing the perm gen for the driver?

Regards
Sab

On 24-Jun-2015 4:40 pm, Anders Arpteg arp...@spotify.com
mailto:arp...@spotify.com wrote:

When reading large (and many) datasets with the Spark 1.4.0
DataFrames parquet reader (the org.apache.spark.sql.parquet
format), the following exceptions are thrown:

Exception in thread sk-result-getter-0
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread task-result-getter-0
Exception in thread task-result-getter-3
java.lang.OutOfMemoryError: PermGen space
Exception in thread task-result-getter-1
java.lang.OutOfMemoryError: PermGen space
Exception in thread task-result-getter-2
java.lang.OutOfMemoryError: PermGen space

and many more like these from different threads. I've tried
increasing the PermGen space using the -XX:MaxPermSize VM
setting, but even after tripling the space, the same errors
occur. I've also tried storing intermediate results, and am
able to get the full job completed by running it multiple
times and starting for the last successful intermediate
result. There seems to be some memory leak in the parquet
format. Any hints on how to fix this problem?

Thanks,
Anders





Re: Parquet problems

2015-07-22 Thread Jerry Lam
Hi guys,

I noticed that too. Anders, can you confirm that it works on Spark 1.5
snapshot? This is what I tried at the end. It seems it is 1.4 issue.

Best Regards,

Jerry

On Wed, Jul 22, 2015 at 11:46 AM, Anders Arpteg arp...@spotify.com wrote:

 No, never really resolved the problem, except by increasing the permgem
 space which only partially solved it. Still have to restart the job
 multiple times so make the whole job complete (it stores intermediate
 results).

 The parquet data sources have about 70 columns, and yes Cheng, it works
 fine when only loading a small sample of the data.

 Thankful for any hints,
 Anders

 On Wed, Jul 22, 2015 at 5:29 PM Cheng Lian lian.cs@gmail.com wrote:

  How many columns are there in these Parquet files? Could you load a
 small portion of the original large dataset successfully?

 Cheng


 On 6/25/15 5:52 PM, Anders Arpteg wrote:

 Yes, both the driver and the executors. Works a little bit better with
 more space, but still a leak that will cause failure after a number of
 reads. There are about 700 different data sources that needs to be loaded,
 lots of data...

  tor 25 jun 2015 08:02 Sabarish Sasidharan 
 sabarish.sasidha...@manthan.comsabarish.sasidha...@manthan.com skrev:

 Did you try increasing the perm gen for the driver?

 Regards
 Sab

 On 24-Jun-2015 4:40 pm, Anders Arpteg arp...@spotify.com wrote:

 When reading large (and many) datasets with the Spark 1.4.0 DataFrames
 parquet reader (the org.apache.spark.sql.parquet format), the following
 exceptions are thrown:

  Exception in thread sk-result-getter-0

 Exception: java.lang.OutOfMemoryError thrown from the
 UncaughtExceptionHandler in thread task-result-getter-0
 Exception in thread task-result-getter-3 java.lang.OutOfMemoryError:
 PermGen space
 Exception in thread task-result-getter-1 java.lang.OutOfMemoryError:
 PermGen space
 Exception in thread task-result-getter-2 java.lang.OutOfMemoryError:
 PermGen space


  and many more like these from different threads. I've tried
 increasing the PermGen space using the -XX:MaxPermSize VM setting, but even
 after tripling the space, the same errors occur. I've also tried storing
 intermediate results, and am able to get the full job completed by running
 it multiple times and starting for the last successful intermediate result.
 There seems to be some memory leak in the parquet format. Any hints on how
 to fix this problem?

  Thanks,
 Anders




Re: Parquet problems

2015-07-22 Thread Michael Misiewicz
For what it's worth, my data set has around 85 columns in Parquet format as
well. I have tried bumping the permgen up to 512m but I'm still getting
errors in the driver thread.

On Wed, Jul 22, 2015 at 1:20 PM, Jerry Lam chiling...@gmail.com wrote:

 Hi guys,

 I noticed that too. Anders, can you confirm that it works on Spark 1.5
 snapshot? This is what I tried at the end. It seems it is 1.4 issue.

 Best Regards,

 Jerry

 On Wed, Jul 22, 2015 at 11:46 AM, Anders Arpteg arp...@spotify.com
 wrote:

 No, never really resolved the problem, except by increasing the permgem
 space which only partially solved it. Still have to restart the job
 multiple times so make the whole job complete (it stores intermediate
 results).

 The parquet data sources have about 70 columns, and yes Cheng, it works
 fine when only loading a small sample of the data.

 Thankful for any hints,
 Anders

 On Wed, Jul 22, 2015 at 5:29 PM Cheng Lian lian.cs@gmail.com wrote:

  How many columns are there in these Parquet files? Could you load a
 small portion of the original large dataset successfully?

 Cheng


 On 6/25/15 5:52 PM, Anders Arpteg wrote:

 Yes, both the driver and the executors. Works a little bit better with
 more space, but still a leak that will cause failure after a number of
 reads. There are about 700 different data sources that needs to be loaded,
 lots of data...

  tor 25 jun 2015 08:02 Sabarish Sasidharan 
 sabarish.sasidha...@manthan.comsabarish.sasidha...@manthan.com skrev:

 Did you try increasing the perm gen for the driver?

 Regards
 Sab

 On 24-Jun-2015 4:40 pm, Anders Arpteg arp...@spotify.com wrote:

 When reading large (and many) datasets with the Spark 1.4.0 DataFrames
 parquet reader (the org.apache.spark.sql.parquet format), the following
 exceptions are thrown:

  Exception in thread sk-result-getter-0

 Exception: java.lang.OutOfMemoryError thrown from the
 UncaughtExceptionHandler in thread task-result-getter-0
 Exception in thread task-result-getter-3 java.lang.OutOfMemoryError:
 PermGen space
 Exception in thread task-result-getter-1 java.lang.OutOfMemoryError:
 PermGen space
 Exception in thread task-result-getter-2 java.lang.OutOfMemoryError:
 PermGen space


  and many more like these from different threads. I've tried
 increasing the PermGen space using the -XX:MaxPermSize VM setting, but 
 even
 after tripling the space, the same errors occur. I've also tried storing
 intermediate results, and am able to get the full job completed by running
 it multiple times and starting for the last successful intermediate 
 result.
 There seems to be some memory leak in the parquet format. Any hints on how
 to fix this problem?

  Thanks,
 Anders





Re: Parquet problems

2015-07-22 Thread Michael Misiewicz
Hi Anders,

Did you ever get to the bottom of this issue? I'm encountering it too, but
only in yarn-cluster mode running on spark 1.4.0. I was thinking of
trying 1.4.1 today.

Michael

On Thu, Jun 25, 2015 at 5:52 AM, Anders Arpteg arp...@spotify.com wrote:

 Yes, both the driver and the executors. Works a little bit better with
 more space, but still a leak that will cause failure after a number of
 reads. There are about 700 different data sources that needs to be loaded,
 lots of data...

 tor 25 jun 2015 08:02 Sabarish Sasidharan sabarish.sasidha...@manthan.com
 skrev:

 Did you try increasing the perm gen for the driver?

 Regards
 Sab
 On 24-Jun-2015 4:40 pm, Anders Arpteg arp...@spotify.com wrote:

 When reading large (and many) datasets with the Spark 1.4.0 DataFrames
 parquet reader (the org.apache.spark.sql.parquet format), the following
 exceptions are thrown:

 Exception in thread task-result-getter-0
 Exception: java.lang.OutOfMemoryError thrown from the
 UncaughtExceptionHandler in thread task-result-getter-0
 Exception in thread task-result-getter-3 java.lang.OutOfMemoryError:
 PermGen space
 Exception in thread task-result-getter-1 java.lang.OutOfMemoryError:
 PermGen space
 Exception in thread task-result-getter-2 java.lang.OutOfMemoryError:
 PermGen space

 and many more like these from different threads. I've tried increasing
 the PermGen space using the -XX:MaxPermSize VM setting, but even after
 tripling the space, the same errors occur. I've also tried storing
 intermediate results, and am able to get the full job completed by running
 it multiple times and starting for the last successful intermediate result.
 There seems to be some memory leak in the parquet format. Any hints on how
 to fix this problem?

 Thanks,
 Anders




Re: Parquet problems

2015-06-25 Thread Anders Arpteg
Yes, both the driver and the executors. Works a little bit better with more
space, but still a leak that will cause failure after a number of reads.
There are about 700 different data sources that needs to be loaded, lots of
data...

tor 25 jun 2015 08:02 Sabarish Sasidharan sabarish.sasidha...@manthan.com
skrev:

 Did you try increasing the perm gen for the driver?

 Regards
 Sab
 On 24-Jun-2015 4:40 pm, Anders Arpteg arp...@spotify.com wrote:

 When reading large (and many) datasets with the Spark 1.4.0 DataFrames
 parquet reader (the org.apache.spark.sql.parquet format), the following
 exceptions are thrown:

 Exception in thread task-result-getter-0
 Exception: java.lang.OutOfMemoryError thrown from the
 UncaughtExceptionHandler in thread task-result-getter-0
 Exception in thread task-result-getter-3 java.lang.OutOfMemoryError:
 PermGen space
 Exception in thread task-result-getter-1 java.lang.OutOfMemoryError:
 PermGen space
 Exception in thread task-result-getter-2 java.lang.OutOfMemoryError:
 PermGen space

 and many more like these from different threads. I've tried increasing
 the PermGen space using the -XX:MaxPermSize VM setting, but even after
 tripling the space, the same errors occur. I've also tried storing
 intermediate results, and am able to get the full job completed by running
 it multiple times and starting for the last successful intermediate result.
 There seems to be some memory leak in the parquet format. Any hints on how
 to fix this problem?

 Thanks,
 Anders




Re: Parquet problems

2015-06-25 Thread Sabarish Sasidharan
Did you try increasing the perm gen for the driver?

Regards
Sab
On 24-Jun-2015 4:40 pm, Anders Arpteg arp...@spotify.com wrote:

 When reading large (and many) datasets with the Spark 1.4.0 DataFrames
 parquet reader (the org.apache.spark.sql.parquet format), the following
 exceptions are thrown:

 Exception in thread task-result-getter-0
 Exception: java.lang.OutOfMemoryError thrown from the
 UncaughtExceptionHandler in thread task-result-getter-0
 Exception in thread task-result-getter-3 java.lang.OutOfMemoryError:
 PermGen space
 Exception in thread task-result-getter-1 java.lang.OutOfMemoryError:
 PermGen space
 Exception in thread task-result-getter-2 java.lang.OutOfMemoryError:
 PermGen space

 and many more like these from different threads. I've tried increasing the
 PermGen space using the -XX:MaxPermSize VM setting, but even after tripling
 the space, the same errors occur. I've also tried storing intermediate
 results, and am able to get the full job completed by running it multiple
 times and starting for the last successful intermediate result. There seems
 to be some memory leak in the parquet format. Any hints on how to fix this
 problem?

 Thanks,
 Anders



Parquet problems

2015-06-24 Thread Anders Arpteg
When reading large (and many) datasets with the Spark 1.4.0 DataFrames
parquet reader (the org.apache.spark.sql.parquet format), the following
exceptions are thrown:

Exception in thread task-result-getter-0
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread task-result-getter-0
Exception in thread task-result-getter-3 java.lang.OutOfMemoryError:
PermGen space
Exception in thread task-result-getter-1 java.lang.OutOfMemoryError:
PermGen space
Exception in thread task-result-getter-2 java.lang.OutOfMemoryError:
PermGen space

and many more like these from different threads. I've tried increasing the
PermGen space using the -XX:MaxPermSize VM setting, but even after tripling
the space, the same errors occur. I've also tried storing intermediate
results, and am able to get the full job completed by running it multiple
times and starting for the last successful intermediate result. There seems
to be some memory leak in the parquet format. Any hints on how to fix this
problem?

Thanks,
Anders