[
https://issues.apache.org/jira/browse/PARQUET-580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gabor Szadovszky updated PARQUET-580:
-------------------------------------
Fix Version/s: 1.8.2
> Potentially unnecessary creation of large int[] in IntList for columns that
> aren't used
> ---------------------------------------------------------------------------------------
>
> Key: PARQUET-580
> URL: https://issues.apache.org/jira/browse/PARQUET-580
> Project: Parquet
> Issue Type: Bug
> Reporter: Piyush Narang
> Assignee: Piyush Narang
> Priority: Minor
> Fix For: 1.9.0, 1.8.2
>
>
> Noticed that for a dataset that we were trying to import that had a lot of
> columns (few thousand) that weren't being used, we ended up allocating a lot
> of unnecessary int arrays (each 64K in size) as we create an IntList object
> for every column. Heap footprint for all those int[]s turned out to be around
> 2GB or so (and results in some jobs OOMing). This seems unnecessary for
> columns that might not be used.
> Also wondering if 64K is the right size to start off with. Wondering if a
> potential improvement is if we could allocate these int[]s in IntList in a
> way that slowly ramps up their size. So rather than create arrays of size 64K
> at a time (which is potentially wasteful if there are only a few hundred
> bytes), we could create say a 4K int[], then when it fills up an 8K[] and so
> on till we reach 64K (at which point the behavior is the same as the current
> implementation).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)