[
https://issues.apache.org/jira/browse/IMPALA-7294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Quanlong Huang updated IMPALA-7294:
-----------------------------------
Fix Version/s: Impala 2.13.0
> TABLESAMPLE clause allocates arrays based on total file count instead of
> selected partitions
> --------------------------------------------------------------------------------------------
>
> Key: IMPALA-7294
> URL: https://issues.apache.org/jira/browse/IMPALA-7294
> Project: IMPALA
> Issue Type: Bug
> Affects Versions: Impala 3.0
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Priority: Minor
> Fix For: Impala 2.13.0, Impala 3.1.0
>
>
> The HdfsTable.getFilesSample function takes a list of input partitions to
> sample files from, but then, when allocating an array to sample into, sizes
> that array based on the total file count across all partitions. This is an
> unnecessarily large array, which is expensive to allocate (may cause full GC
> when the heap is fragmented). The code claims this to be an optimization:
> {code}
> // Use max size to avoid looping over inputParts for the exact size.
> {code}
> ...but I think the loop over inputParts is likely to be trivial here since
> we'll loop over them anyway later in the function and thus will already be
> pulled into CPU cache, etc. This is also necessary for fine-grained metadata
> loading in the impalad -- for a large table with many partitions, we don't
> want to load the file lists of all partitions just to tablesample from one
> partition.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]