Hello Michael Ho, Todd Lipcon, Tim Armstrong, Joe McDonnell, Impala Public 
Jenkins,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12097

to look at the new patch set (#8).

Change subject: IMPALA-7980: Fix spinning because of buggy num_unqueued_files_.
......................................................................

IMPALA-7980: Fix spinning because of buggy num_unqueued_files_.

This commit removes num_unqueued_files_ and replaces it with a more
tightly scoped and easier to reason about
remaining_scan_range_submissions_ variable. This variable (and its
predecessor) are used as a way to signal to scanner threads they may
exit (instead of spinning) because there will never be a scan range
provided to them, because no more scan ranges will be added. In
practice, most scanner implementations can never call AddDiskIoRanges()
after IssueInitialRanges(). The exception is sequence files and Avro,
which share a common base class. Instead of incrementing and
decrementing this counter in a variety of paths, this commit makes the
common case simple (set to 1 initially; decrement at exit points of
IssueInitialRanges()) and the complicated, sequence-file case is treated
within base-sequence-scanner.cc.

Note that this is not the first instance of a subtle bug
in this code. The following two JIRAs (and corresponding
commits) are fundamentally similar bugs:
    IMPALA-3798: Disable per-split filtering for sequence-based scanners
    IMPALA-1730: reduce scanner thread spinning windows

We ran into this bug when running TPC-DS query 1 on scale factor 10,000
(10TB) on a 140-node cluster with replica_preference=remote, we observed
really high system CPU usage for some of the scan nodes:

  HDFS_SCAN_NODE (id=6):(Total: 59s107ms, non-child: 59s107ms, % non- child: 
100.00%
    - BytesRead: 80.50 MB (84408563)
    - ScannerThreadsSysTime: 36m17s

Using 36 minutes of system time in only 1 minute of wall-clock time
required ~30 threads to be spinning in the kernel. We were able to use
perf to find a lot of usage of futex_wait() and pthread_cond_wait().
Eventually, we figured out that ScannerThreads, once started, loop
forever looking for work.  The case that there is no work is supposed to
be rare, and the scanner threads are supposed to exit based on
num_unqueued_files_ being 0, but, in some cases, that counter isn't
appropriately decremented.

The reproduction is any query that uses runtime filters to filter out
entire files. Something like:

  set RUNTIME_FILTER_WAIT_TIME_MS=10000;
  select count(*)
  from customer
  join customer_address on c_current_addr_sk = ca_address_sk
  where ca_street_name="DoesNotExist" and c_last_name="DoesNotExist";

triggers this behavior. This code path is covered by several existing
tests, most directly in test_runtime_filters.py:test_file_filtering().
Interestingly, though this wastes cycles, query results are unaffected.

I initially fixed this bug with a point fix that handled the case when
runtime filters caused files to be skipped and added assertions that
checked that num_unqueued_files_ was decremented to zero when queries
finished. Doing this led me, somewhat slowly, to both finding similar
bugs in other parts of the code (HdfsTextScanner::IssueInitialRanges had
the same bug if the entire file was skipped) and fighting with races on
the assertion itself. I eventually concluded that there's really no
shared synchronization between progress_.Done() and num_unqueued_files_.
The same conclusion is true for the current implementation, so there
aren't assertions.

I added a metric for how many times the scanners run through their
loop without doing any work and observed it to be non-zero
for a query from tests/query_test/test_runtime_filters.py:test_wait_time.

To measure the effect of this, I set up a cluster of 9 impalad's and
1 coordinator, running against an entirely remote HDFS. The machines
were r4.4xlarge and the remote disks were EBS st1's, though everything
was likely buffer cached. I ran
TPCDS-Q1 with RUNTIME_FILTER_WAIT_TIME_MS=2000 against
tpcds_1000_decimal_parquet 10 times. The big observable
thing is that ScannerThreadsSysTime went from 5.6 seconds per
query to 1.9 seconds per query. (I ran the text profiles through the 
old-fashioned:
  grep ScannerThreadsSysTime profiles | awk '/ms/ { x += $3/1000 } /ns/ { x += 
$3/1000000 } END { print x }'
)
The query time effect was quite small (the fastest query was 3.373s
with the change and 3.82s without the change, but the averages were
tighter), but the extra work was visible in the profiles.

I happened to rename HdfsScanNode::file_type_counts_ to
HdfsScanNode::file_type_counts_lock_ because
HdfsScanNodeBase::file_type_counts_ also exists, and
is totally different.

This bug was co-debugged by Todd Lipcon, Joe McDonnell, and Philip
Zeyliger.

Change-Id: I133de13238d3d05c510e2ff771d48979125735b1
---
M be/src/exec/base-sequence-scanner.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scan-node.cc
M be/src/exec/hdfs-scan-node.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M tests/query_test/test_runtime_filters.py
8 files changed, 82 insertions(+), 50 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/97/12097/8
--
To view, visit http://gerrit.cloudera.org:8080/12097
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I133de13238d3d05c510e2ff771d48979125735b1
Gerrit-Change-Number: 12097
Gerrit-PatchSet: 8
Gerrit-Owner: Philip Zeyliger <phi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com>
Gerrit-Reviewer: Michael Ho <k...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <phi...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>

Reply via email to