[jira] [Updated] (PARQUET-580) Potentially unnecessary creation of large int[] in IntList for columns that aren't used

Piyush Narang (JIRA) Thu, 14 Apr 2016 18:03:35 -0700

     [ 
https://issues.apache.org/jira/browse/PARQUET-580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Piyush Narang updated PARQUET-580:
----------------------------------
    Description: 
Noticed that for a dataset that we were trying to import that had a lot of 
columns (few thousand) that weren't being used, we ended up allocating a lot of 
unnecessary int arrays (each 64K in size) as we create an IntList object for 
every column. Heap footprint for all those int[]s turned out to be around 2GB 
or so (and results in some jobs OOMing). This seems unnecessary for columns 
that might not be used. 

Also wondering if 64K is the right size to start off with. Wondering if a 
potential improvement is if we could allocate these int[]s in IntList in a way 
that slowly ramps up their size. So rather than create arrays of size 64K at a 
time (which is potentially wasteful if there are only a few hundred bytes), we 
could create say a 4K int[], then when it fills up an 8K[] and so on till we 
reach 64K (at which point the behavior is the same as the current 
implementation).

  was:
Noticed that for a dataset that we were trying to import that had a lot of 
columns (few thousand) that weren't being used, we ended up allocating a lot of 
unnecessary int arrays (each 64K in size) in the IntList class constructor. 
Heap footprint for all those int[]s turned out to be around 2GB or so (and 
results in some jobs OOMing). This seems unnecessary for columns that might not 
be used. 

Also wondering if 64K is the right size to start off with. Wondering if a 
potential improvement is if we could allocate these int[]s in IntList in a way 
that slowly ramps up their size. So rather than create arrays of size 64K at a 
time (which is potentially wasteful if there are only a few hundred bytes), we 
could create say a 4K int[], then when it fills up an 8K[] and so on till we 
reach 64K (at which point the behavior is the same as the current 
implementation).


> Potentially unnecessary creation of large int[] in IntList for columns that 
> aren't used
> ---------------------------------------------------------------------------------------
>
>                 Key: PARQUET-580
>                 URL: https://issues.apache.org/jira/browse/PARQUET-580
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Piyush Narang
>            Priority: Minor
>
> Noticed that for a dataset that we were trying to import that had a lot of 
> columns (few thousand) that weren't being used, we ended up allocating a lot 
> of unnecessary int arrays (each 64K in size) as we create an IntList object 
> for every column. Heap footprint for all those int[]s turned out to be around 
> 2GB or so (and results in some jobs OOMing). This seems unnecessary for 
> columns that might not be used. 
> Also wondering if 64K is the right size to start off with. Wondering if a 
> potential improvement is if we could allocate these int[]s in IntList in a 
> way that slowly ramps up their size. So rather than create arrays of size 64K 
> at a time (which is potentially wasteful if there are only a few hundred 
> bytes), we could create say a 4K int[], then when it fills up an 8K[] and so 
> on till we reach 64K (at which point the behavior is the same as the current 
> implementation).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PARQUET-580) Potentially unnecessary creation of large int[] in IntList for columns that aren't used

Reply via email to