[ 
https://issues.apache.org/jira/browse/IMPALA-7294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-7294:
-----------------------------------
    Fix Version/s: Impala 2.13.0

> TABLESAMPLE clause allocates arrays based on total file count instead of 
> selected partitions
> --------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-7294
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7294
>             Project: IMPALA
>          Issue Type: Bug
>    Affects Versions: Impala 3.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Minor
>             Fix For: Impala 2.13.0, Impala 3.1.0
>
>
> The HdfsTable.getFilesSample function takes a list of input partitions to 
> sample files from, but then, when allocating an array to sample into, sizes 
> that array based on the total file count across all partitions. This is an 
> unnecessarily large array, which is expensive to allocate (may cause full GC 
> when the heap is fragmented). The code claims this to be an optimization:
> {code}
>     // Use max size to avoid looping over inputParts for the exact size.
> {code}
> ...but I think the loop over inputParts is likely to be trivial here since 
> we'll loop over them anyway later in the function and thus will already be 
> pulled into CPU cache, etc. This is also necessary for fine-grained metadata 
> loading in the impalad -- for a large table with many partitions, we don't 
> want to load the file lists of all partitions just to tablesample from one 
> partition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to