Hello Tianyi Wang, Vuk Ercegovac,

I'd like you to do a code review. Please visit


to review the following change.

Change subject: IMPALA-7294. TABLESAMPLE should not allocate array based on 
total table file count

IMPALA-7294. TABLESAMPLE should not allocate array based on total table file 

This changes HdfsTable.getFilesSample() to allocate its intermediate
sampling array based on the number of files in the selected
(post-pruning) partitions, rather than the total number of files in the
table. While the former behavior was correct (the total file count is of
course an upper bound on the pruned file count), it was an unnecessarily
large allocation, which has some downsides around garbage collection.

In addition, this is important for the LocalCatalog implementation of
table sampling, since we do not want to have to load all partition file
lists in order to compute a sample over a pruned subset of partitions.

The original code indicated that this was an optimization to avoid
looping over the partition list an extra time. However, typical
partition lists are relatively small even in the worst case (order of
100k) and looping over 100k in-memory Java objects is not likely to be
the bottleneck in planning any query. This is especially true
considering that we loop over that same list later in the function
anyway, so we probably aren't saving page faults or LLC cache misses

In testing this change I noticed that the existing test for TABLESAMPLE
didn't test TABLESAMPLE when applied in conjunction with a predicate.
I added a new dimension to the test which employs a predicate which
prunes some partitions to ensure that the code works in that case.
I also added coverage of the "100%" sampling parameter as a sanity check
that it returns the same results as a non-sampled query.

Change-Id: I0248d89bcd9dd4ff8b4b85fef282c19e3fe9bdd5
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M tests/query_test/test_tablesample.py
2 files changed, 29 insertions(+), 13 deletions(-)

  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/36/10936/1
To view, visit http://gerrit.cloudera.org:8080/10936
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I0248d89bcd9dd4ff8b4b85fef282c19e3fe9bdd5
Gerrit-Change-Number: 10936
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <t...@apache.org>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <vercego...@cloudera.com>

Reply via email to