Todd Lipcon created IMPALA-7294:
-----------------------------------
Summary: TABLESAMPLE clause allocates arrays based on total file
count instead of selected partitions
Key: IMPALA-7294
URL: https://issues.apache.org/jira/browse/IMPALA-7294
Project: IMPALA
Issue Type: Bug
Affects Versions: Impala 3.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
The HdfsTable.getFilesSample function takes a list of input partitions to
sample files from, but then, when allocating an array to sample into, sizes
that array based on the total file count across all partitions. This is an
unnecessarily large array, which is expensive to allocate (may cause full GC
when the heap is fragmented). The code claims this to be an optimization:
{code}
// Use max size to avoid looping over inputParts for the exact size.
{code}
...but I think the loop over inputParts is likely to be trivial here since
we'll loop over them anyway later in the function and thus will already be
pulled into CPU cache, etc. This is also necessary for fine-grained metadata
loading in the impalad -- for a large table with many partitions, we don't want
to load the file lists of all partitions just to tablesample from one partition.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)