For what it's worth, my data set has around 85 columns in Parquet format as well. I have tried bumping the permgen up to 512m but I'm still getting errors in the driver thread.
On Wed, Jul 22, 2015 at 1:20 PM, Jerry Lam <chiling...@gmail.com> wrote: > Hi guys, > > I noticed that too. Anders, can you confirm that it works on Spark 1.5 > snapshot? This is what I tried at the end. It seems it is 1.4 issue. > > Best Regards, > > Jerry > > On Wed, Jul 22, 2015 at 11:46 AM, Anders Arpteg <arp...@spotify.com> > wrote: > >> No, never really resolved the problem, except by increasing the permgem >> space which only partially solved it. Still have to restart the job >> multiple times so make the whole job complete (it stores intermediate >> results). >> >> The parquet data sources have about 70 columns, and yes Cheng, it works >> fine when only loading a small sample of the data. >> >> Thankful for any hints, >> Anders >> >> On Wed, Jul 22, 2015 at 5:29 PM Cheng Lian <lian.cs....@gmail.com> wrote: >> >>> How many columns are there in these Parquet files? Could you load a >>> small portion of the original large dataset successfully? >>> >>> Cheng >>> >>> >>> On 6/25/15 5:52 PM, Anders Arpteg wrote: >>> >>> Yes, both the driver and the executors. Works a little bit better with >>> more space, but still a leak that will cause failure after a number of >>> reads. There are about 700 different data sources that needs to be loaded, >>> lots of data... >>> >>> tor 25 jun 2015 08:02 Sabarish Sasidharan < >>> <sabarish.sasidha...@manthan.com>sabarish.sasidha...@manthan.com> skrev: >>> >>> Did you try increasing the perm gen for the driver? >>>> >>>> Regards >>>> Sab >>>> >>> On 24-Jun-2015 4:40 pm, "Anders Arpteg" <arp...@spotify.com> wrote: >>>> >>> When reading large (and many) datasets with the Spark 1.4.0 DataFrames >>>>> parquet reader (the org.apache.spark.sql.parquet format), the following >>>>> exceptions are thrown: >>>>> >>>>> Exception in thread "sk-result-getter-0" >>>>> >>>> Exception: java.lang.OutOfMemoryError thrown from the >>>>> UncaughtExceptionHandler in thread "task-result-getter-0" >>>>> Exception in thread "task-result-getter-3" java.lang.OutOfMemoryError: >>>>> PermGen space >>>>> Exception in thread "task-result-getter-1" java.lang.OutOfMemoryError: >>>>> PermGen space >>>>> Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError: >>>>> PermGen space >>>>> >>>> >>>>> and many more like these from different threads. I've tried >>>>> increasing the PermGen space using the -XX:MaxPermSize VM setting, but >>>>> even >>>>> after tripling the space, the same errors occur. I've also tried storing >>>>> intermediate results, and am able to get the full job completed by running >>>>> it multiple times and starting for the last successful intermediate >>>>> result. >>>>> There seems to be some memory leak in the parquet format. Any hints on how >>>>> to fix this problem? >>>>> >>>>> Thanks, >>>>> Anders >>>>> >>>> >