[jira] [Created] (IMPALA-7294) TABLESAMPLE clause allocates arrays based on total file count instead of selected partitions

Todd Lipcon (JIRA) Thu, 12 Jul 2018 16:42:17 -0700

Todd Lipcon created IMPALA-7294:
-----------------------------------

             Summary: TABLESAMPLE clause allocates arrays based on total file 
count instead of selected partitions
                 Key: IMPALA-7294
                 URL: https://issues.apache.org/jira/browse/IMPALA-7294
             Project: IMPALA
          Issue Type: Bug
    Affects Versions: Impala 3.0
            Reporter: Todd Lipcon
            Assignee: Todd Lipcon



The HdfsTable.getFilesSample function takes a list of input partitions to 
sample files from, but then, when allocating an array to sample into, sizes 
that array based on the total file count across all partitions. This is an 
unnecessarily large array, which is expensive to allocate (may cause full GC 
when the heap is fragmented). The code claims this to be an optimization:
{code}
    // Use max size to avoid looping over inputParts for the exact size.
{code}

...but I think the loop over inputParts is likely to be trivial here since 
we'll loop over them anyway later in the function and thus will already be 
pulled into CPU cache, etc. This is also necessary for fine-grained metadata 
loading in the impalad -- for a large table with many partitions, we don't want 
to load the file lists of all partitions just to tablesample from one partition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (IMPALA-7294) TABLESAMPLE clause allocates arrays based on total file count instead of selected partitions

Reply via email to