[ 
https://issues.apache.org/jira/browse/PARQUET-580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15242245#comment-15242245
 ] 

Piyush Narang commented on PARQUET-580:
---------------------------------------

PR for this: https://github.com/apache/parquet-mr/pull/339

> Potentially unnecessary creation of large int[] in IntList for columns that 
> aren't used
> ---------------------------------------------------------------------------------------
>
>                 Key: PARQUET-580
>                 URL: https://issues.apache.org/jira/browse/PARQUET-580
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Piyush Narang
>            Priority: Minor
>
> Noticed that for a dataset that we were trying to import that had a lot of 
> columns (few thousand) that weren't being used, we ended up allocating a lot 
> of unnecessary int arrays (each 64K in size) in the IntList class 
> constructor. Heap footprint for all those int[]s turned out to be around 2GB 
> or so (and results in some jobs OOMing). This seems unnecessary for columns 
> that might not be used. 
> Also wondering if 64K is the right size to start off with. Wondering if a 
> potential improvement is if we could allocate these int[]s in IntList in a 
> way that slowly ramps up their size. So rather than create arrays of size 64K 
> at a time (which is potentially wasteful if there are only a few hundred 
> bytes), we could create say a 4K int[], then when it fills up an 8K[] and so 
> on till we reach 64K (at which point the behavior is the same as the current 
> implementation).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to