Re: Parquet problems
No, never really resolved the problem, except by increasing the permgem space which only partially solved it. Still have to restart the job multiple times so make the whole job complete (it stores intermediate results). The parquet data sources have about 70 columns, and yes Cheng, it works fine when only loading a small sample of the data. Thankful for any hints, Anders On Wed, Jul 22, 2015 at 5:29 PM Cheng Lian lian.cs@gmail.com wrote: How many columns are there in these Parquet files? Could you load a small portion of the original large dataset successfully? Cheng On 6/25/15 5:52 PM, Anders Arpteg wrote: Yes, both the driver and the executors. Works a little bit better with more space, but still a leak that will cause failure after a number of reads. There are about 700 different data sources that needs to be loaded, lots of data... tor 25 jun 2015 08:02 Sabarish Sasidharan sabarish.sasidha...@manthan.comsabarish.sasidha...@manthan.com skrev: Did you try increasing the perm gen for the driver? Regards Sab On 24-Jun-2015 4:40 pm, Anders Arpteg arp...@spotify.com wrote: When reading large (and many) datasets with the Spark 1.4.0 DataFrames parquet reader (the org.apache.spark.sql.parquet format), the following exceptions are thrown: Exception in thread sk-result-getter-0 Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread task-result-getter-0 Exception in thread task-result-getter-3 java.lang.OutOfMemoryError: PermGen space Exception in thread task-result-getter-1 java.lang.OutOfMemoryError: PermGen space Exception in thread task-result-getter-2 java.lang.OutOfMemoryError: PermGen space and many more like these from different threads. I've tried increasing the PermGen space using the -XX:MaxPermSize VM setting, but even after tripling the space, the same errors occur. I've also tried storing intermediate results, and am able to get the full job completed by running it multiple times and starting for the last successful intermediate result. There seems to be some memory leak in the parquet format. Any hints on how to fix this problem? Thanks, Anders
Re: Parquet problems
How many columns are there in these Parquet files? Could you load a small portion of the original large dataset successfully? Cheng On 6/25/15 5:52 PM, Anders Arpteg wrote: Yes, both the driver and the executors. Works a little bit better with more space, but still a leak that will cause failure after a number of reads. There are about 700 different data sources that needs to be loaded, lots of data... tor 25 jun 2015 08:02 Sabarish Sasidharan sabarish.sasidha...@manthan.com mailto:sabarish.sasidha...@manthan.com skrev: Did you try increasing the perm gen for the driver? Regards Sab On 24-Jun-2015 4:40 pm, Anders Arpteg arp...@spotify.com mailto:arp...@spotify.com wrote: When reading large (and many) datasets with the Spark 1.4.0 DataFrames parquet reader (the org.apache.spark.sql.parquet format), the following exceptions are thrown: Exception in thread sk-result-getter-0 Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread task-result-getter-0 Exception in thread task-result-getter-3 java.lang.OutOfMemoryError: PermGen space Exception in thread task-result-getter-1 java.lang.OutOfMemoryError: PermGen space Exception in thread task-result-getter-2 java.lang.OutOfMemoryError: PermGen space and many more like these from different threads. I've tried increasing the PermGen space using the -XX:MaxPermSize VM setting, but even after tripling the space, the same errors occur. I've also tried storing intermediate results, and am able to get the full job completed by running it multiple times and starting for the last successful intermediate result. There seems to be some memory leak in the parquet format. Any hints on how to fix this problem? Thanks, Anders
Re: Parquet problems
Hi guys, I noticed that too. Anders, can you confirm that it works on Spark 1.5 snapshot? This is what I tried at the end. It seems it is 1.4 issue. Best Regards, Jerry On Wed, Jul 22, 2015 at 11:46 AM, Anders Arpteg arp...@spotify.com wrote: No, never really resolved the problem, except by increasing the permgem space which only partially solved it. Still have to restart the job multiple times so make the whole job complete (it stores intermediate results). The parquet data sources have about 70 columns, and yes Cheng, it works fine when only loading a small sample of the data. Thankful for any hints, Anders On Wed, Jul 22, 2015 at 5:29 PM Cheng Lian lian.cs@gmail.com wrote: How many columns are there in these Parquet files? Could you load a small portion of the original large dataset successfully? Cheng On 6/25/15 5:52 PM, Anders Arpteg wrote: Yes, both the driver and the executors. Works a little bit better with more space, but still a leak that will cause failure after a number of reads. There are about 700 different data sources that needs to be loaded, lots of data... tor 25 jun 2015 08:02 Sabarish Sasidharan sabarish.sasidha...@manthan.comsabarish.sasidha...@manthan.com skrev: Did you try increasing the perm gen for the driver? Regards Sab On 24-Jun-2015 4:40 pm, Anders Arpteg arp...@spotify.com wrote: When reading large (and many) datasets with the Spark 1.4.0 DataFrames parquet reader (the org.apache.spark.sql.parquet format), the following exceptions are thrown: Exception in thread sk-result-getter-0 Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread task-result-getter-0 Exception in thread task-result-getter-3 java.lang.OutOfMemoryError: PermGen space Exception in thread task-result-getter-1 java.lang.OutOfMemoryError: PermGen space Exception in thread task-result-getter-2 java.lang.OutOfMemoryError: PermGen space and many more like these from different threads. I've tried increasing the PermGen space using the -XX:MaxPermSize VM setting, but even after tripling the space, the same errors occur. I've also tried storing intermediate results, and am able to get the full job completed by running it multiple times and starting for the last successful intermediate result. There seems to be some memory leak in the parquet format. Any hints on how to fix this problem? Thanks, Anders
Re: Parquet problems
For what it's worth, my data set has around 85 columns in Parquet format as well. I have tried bumping the permgen up to 512m but I'm still getting errors in the driver thread. On Wed, Jul 22, 2015 at 1:20 PM, Jerry Lam chiling...@gmail.com wrote: Hi guys, I noticed that too. Anders, can you confirm that it works on Spark 1.5 snapshot? This is what I tried at the end. It seems it is 1.4 issue. Best Regards, Jerry On Wed, Jul 22, 2015 at 11:46 AM, Anders Arpteg arp...@spotify.com wrote: No, never really resolved the problem, except by increasing the permgem space which only partially solved it. Still have to restart the job multiple times so make the whole job complete (it stores intermediate results). The parquet data sources have about 70 columns, and yes Cheng, it works fine when only loading a small sample of the data. Thankful for any hints, Anders On Wed, Jul 22, 2015 at 5:29 PM Cheng Lian lian.cs@gmail.com wrote: How many columns are there in these Parquet files? Could you load a small portion of the original large dataset successfully? Cheng On 6/25/15 5:52 PM, Anders Arpteg wrote: Yes, both the driver and the executors. Works a little bit better with more space, but still a leak that will cause failure after a number of reads. There are about 700 different data sources that needs to be loaded, lots of data... tor 25 jun 2015 08:02 Sabarish Sasidharan sabarish.sasidha...@manthan.comsabarish.sasidha...@manthan.com skrev: Did you try increasing the perm gen for the driver? Regards Sab On 24-Jun-2015 4:40 pm, Anders Arpteg arp...@spotify.com wrote: When reading large (and many) datasets with the Spark 1.4.0 DataFrames parquet reader (the org.apache.spark.sql.parquet format), the following exceptions are thrown: Exception in thread sk-result-getter-0 Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread task-result-getter-0 Exception in thread task-result-getter-3 java.lang.OutOfMemoryError: PermGen space Exception in thread task-result-getter-1 java.lang.OutOfMemoryError: PermGen space Exception in thread task-result-getter-2 java.lang.OutOfMemoryError: PermGen space and many more like these from different threads. I've tried increasing the PermGen space using the -XX:MaxPermSize VM setting, but even after tripling the space, the same errors occur. I've also tried storing intermediate results, and am able to get the full job completed by running it multiple times and starting for the last successful intermediate result. There seems to be some memory leak in the parquet format. Any hints on how to fix this problem? Thanks, Anders
Re: Parquet problems
Hi Anders, Did you ever get to the bottom of this issue? I'm encountering it too, but only in yarn-cluster mode running on spark 1.4.0. I was thinking of trying 1.4.1 today. Michael On Thu, Jun 25, 2015 at 5:52 AM, Anders Arpteg arp...@spotify.com wrote: Yes, both the driver and the executors. Works a little bit better with more space, but still a leak that will cause failure after a number of reads. There are about 700 different data sources that needs to be loaded, lots of data... tor 25 jun 2015 08:02 Sabarish Sasidharan sabarish.sasidha...@manthan.com skrev: Did you try increasing the perm gen for the driver? Regards Sab On 24-Jun-2015 4:40 pm, Anders Arpteg arp...@spotify.com wrote: When reading large (and many) datasets with the Spark 1.4.0 DataFrames parquet reader (the org.apache.spark.sql.parquet format), the following exceptions are thrown: Exception in thread task-result-getter-0 Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread task-result-getter-0 Exception in thread task-result-getter-3 java.lang.OutOfMemoryError: PermGen space Exception in thread task-result-getter-1 java.lang.OutOfMemoryError: PermGen space Exception in thread task-result-getter-2 java.lang.OutOfMemoryError: PermGen space and many more like these from different threads. I've tried increasing the PermGen space using the -XX:MaxPermSize VM setting, but even after tripling the space, the same errors occur. I've also tried storing intermediate results, and am able to get the full job completed by running it multiple times and starting for the last successful intermediate result. There seems to be some memory leak in the parquet format. Any hints on how to fix this problem? Thanks, Anders
Re: Parquet problems
Yes, both the driver and the executors. Works a little bit better with more space, but still a leak that will cause failure after a number of reads. There are about 700 different data sources that needs to be loaded, lots of data... tor 25 jun 2015 08:02 Sabarish Sasidharan sabarish.sasidha...@manthan.com skrev: Did you try increasing the perm gen for the driver? Regards Sab On 24-Jun-2015 4:40 pm, Anders Arpteg arp...@spotify.com wrote: When reading large (and many) datasets with the Spark 1.4.0 DataFrames parquet reader (the org.apache.spark.sql.parquet format), the following exceptions are thrown: Exception in thread task-result-getter-0 Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread task-result-getter-0 Exception in thread task-result-getter-3 java.lang.OutOfMemoryError: PermGen space Exception in thread task-result-getter-1 java.lang.OutOfMemoryError: PermGen space Exception in thread task-result-getter-2 java.lang.OutOfMemoryError: PermGen space and many more like these from different threads. I've tried increasing the PermGen space using the -XX:MaxPermSize VM setting, but even after tripling the space, the same errors occur. I've also tried storing intermediate results, and am able to get the full job completed by running it multiple times and starting for the last successful intermediate result. There seems to be some memory leak in the parquet format. Any hints on how to fix this problem? Thanks, Anders
Re: Parquet problems
Did you try increasing the perm gen for the driver? Regards Sab On 24-Jun-2015 4:40 pm, Anders Arpteg arp...@spotify.com wrote: When reading large (and many) datasets with the Spark 1.4.0 DataFrames parquet reader (the org.apache.spark.sql.parquet format), the following exceptions are thrown: Exception in thread task-result-getter-0 Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread task-result-getter-0 Exception in thread task-result-getter-3 java.lang.OutOfMemoryError: PermGen space Exception in thread task-result-getter-1 java.lang.OutOfMemoryError: PermGen space Exception in thread task-result-getter-2 java.lang.OutOfMemoryError: PermGen space and many more like these from different threads. I've tried increasing the PermGen space using the -XX:MaxPermSize VM setting, but even after tripling the space, the same errors occur. I've also tried storing intermediate results, and am able to get the full job completed by running it multiple times and starting for the last successful intermediate result. There seems to be some memory leak in the parquet format. Any hints on how to fix this problem? Thanks, Anders
Parquet problems
When reading large (and many) datasets with the Spark 1.4.0 DataFrames parquet reader (the org.apache.spark.sql.parquet format), the following exceptions are thrown: Exception in thread task-result-getter-0 Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread task-result-getter-0 Exception in thread task-result-getter-3 java.lang.OutOfMemoryError: PermGen space Exception in thread task-result-getter-1 java.lang.OutOfMemoryError: PermGen space Exception in thread task-result-getter-2 java.lang.OutOfMemoryError: PermGen space and many more like these from different threads. I've tried increasing the PermGen space using the -XX:MaxPermSize VM setting, but even after tripling the space, the same errors occur. I've also tried storing intermediate results, and am able to get the full job completed by running it multiple times and starting for the last successful intermediate result. There seems to be some memory leak in the parquet format. Any hints on how to fix this problem? Thanks, Anders