For what it's worth, my data set has around 85 columns in Parquet format as
well. I have tried bumping the permgen up to 512m but I'm still getting
errors in the driver thread.

On Wed, Jul 22, 2015 at 1:20 PM, Jerry Lam <chiling...@gmail.com> wrote:

> Hi guys,
>
> I noticed that too. Anders, can you confirm that it works on Spark 1.5
> snapshot? This is what I tried at the end. It seems it is 1.4 issue.
>
> Best Regards,
>
> Jerry
>
> On Wed, Jul 22, 2015 at 11:46 AM, Anders Arpteg <arp...@spotify.com>
> wrote:
>
>> No, never really resolved the problem, except by increasing the permgem
>> space which only partially solved it. Still have to restart the job
>> multiple times so make the whole job complete (it stores intermediate
>> results).
>>
>> The parquet data sources have about 70 columns, and yes Cheng, it works
>> fine when only loading a small sample of the data.
>>
>> Thankful for any hints,
>> Anders
>>
>> On Wed, Jul 22, 2015 at 5:29 PM Cheng Lian <lian.cs....@gmail.com> wrote:
>>
>>>  How many columns are there in these Parquet files? Could you load a
>>> small portion of the original large dataset successfully?
>>>
>>> Cheng
>>>
>>>
>>> On 6/25/15 5:52 PM, Anders Arpteg wrote:
>>>
>>> Yes, both the driver and the executors. Works a little bit better with
>>> more space, but still a leak that will cause failure after a number of
>>> reads. There are about 700 different data sources that needs to be loaded,
>>> lots of data...
>>>
>>>  tor 25 jun 2015 08:02 Sabarish Sasidharan <
>>> <sabarish.sasidha...@manthan.com>sabarish.sasidha...@manthan.com> skrev:
>>>
>>> Did you try increasing the perm gen for the driver?
>>>>
>>>> Regards
>>>> Sab
>>>>
>>> On 24-Jun-2015 4:40 pm, "Anders Arpteg" <arp...@spotify.com> wrote:
>>>>
>>> When reading large (and many) datasets with the Spark 1.4.0 DataFrames
>>>>> parquet reader (the org.apache.spark.sql.parquet format), the following
>>>>> exceptions are thrown:
>>>>>
>>>>>  Exception in thread "sk-result-getter-0"
>>>>>
>>>> Exception: java.lang.OutOfMemoryError thrown from the
>>>>> UncaughtExceptionHandler in thread "task-result-getter-0"
>>>>> Exception in thread "task-result-getter-3" java.lang.OutOfMemoryError:
>>>>> PermGen space
>>>>> Exception in thread "task-result-getter-1" java.lang.OutOfMemoryError:
>>>>> PermGen space
>>>>> Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError:
>>>>> PermGen space
>>>>>
>>>>
>>>>>  and many more like these from different threads. I've tried
>>>>> increasing the PermGen space using the -XX:MaxPermSize VM setting, but 
>>>>> even
>>>>> after tripling the space, the same errors occur. I've also tried storing
>>>>> intermediate results, and am able to get the full job completed by running
>>>>> it multiple times and starting for the last successful intermediate 
>>>>> result.
>>>>> There seems to be some memory leak in the parquet format. Any hints on how
>>>>> to fix this problem?
>>>>>
>>>>>  Thanks,
>>>>> Anders
>>>>>
>>>>
>

Reply via email to