[ https://issues.apache.org/jira/browse/IMPALA-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16758595#comment-16758595 ]
ASF subversion and git services commented on IMPALA-1730: --------------------------------------------------------- Commit a8e30506aafef14646d95a56fb87cf7c28d259d6 in impala's branch refs/heads/master from Philip Zeyliger [ https://gitbox.apache.org/repos/asf?p=impala.git;h=a8e3050 ] IMPALA-7980: Fix spinning because of buggy num_unqueued_files_. This commit removes num_unqueued_files_ and replaces it with a more tightly scoped and easier to reason about remaining_scan_range_submissions_ variable. This variable (and its predecessor) are used as a way to signal to scanner threads they may exit (instead of spinning) because there will never be a scan range provided to them, because no more scan ranges will be added. In practice, most scanner implementations can never call AddDiskIoRanges() after IssueInitialRanges(). The exception is sequence files and Avro, which share a common base class. Instead of incrementing and decrementing this counter in a variety of paths, this commit makes the common case simple (set to 1 initially; decrement at exit points of IssueInitialRanges()) and the complicated, sequence-file case is treated within base-sequence-scanner.cc. Note that this is not the first instance of a subtle bug in this code. The following two JIRAs (and corresponding commits) are fundamentally similar bugs: IMPALA-3798: Disable per-split filtering for sequence-based scanners IMPALA-1730: reduce scanner thread spinning windows We ran into this bug when running TPC-DS query 1 on scale factor 10,000 (10TB) on a 140-node cluster with replica_preference=remote, we observed really high system CPU usage for some of the scan nodes: HDFS_SCAN_NODE (id=6):(Total: 59s107ms, non-child: 59s107ms, % non- child: 100.00% - BytesRead: 80.50 MB (84408563) - ScannerThreadsSysTime: 36m17s Using 36 minutes of system time in only 1 minute of wall-clock time required ~30 threads to be spinning in the kernel. We were able to use perf to find a lot of usage of futex_wait() and pthread_cond_wait(). Eventually, we figured out that ScannerThreads, once started, loop forever looking for work. The case that there is no work is supposed to be rare, and the scanner threads are supposed to exit based on num_unqueued_files_ being 0, but, in some cases, that counter isn't appropriately decremented. The reproduction is any query that uses runtime filters to filter out entire files. Something like: set RUNTIME_FILTER_WAIT_TIME_MS=10000; select count(*) from customer join customer_address on c_current_addr_sk = ca_address_sk where ca_street_name="DoesNotExist" and c_last_name="DoesNotExist"; triggers this behavior. This code path is covered by several existing tests, most directly in test_runtime_filters.py:test_file_filtering(). Interestingly, though this wastes cycles, query results are unaffected. I initially fixed this bug with a point fix that handled the case when runtime filters caused files to be skipped and added assertions that checked that num_unqueued_files_ was decremented to zero when queries finished. Doing this led me, somewhat slowly, to both finding similar bugs in other parts of the code (HdfsTextScanner::IssueInitialRanges had the same bug if the entire file was skipped) and fighting with races on the assertion itself. I eventually concluded that there's really no shared synchronization between progress_.Done() and num_unqueued_files_. The same conclusion is true for the current implementation, so there aren't assertions. I added a metric for how many times the scanners run through their loop without doing any work and observed it to be non-zero for a query from tests/query_test/test_runtime_filters.py:test_wait_time. To measure the effect of this, I set up a cluster of 9 impalad's and 1 coordinator, running against an entirely remote HDFS. The machines were r4.4xlarge and the remote disks were EBS st1's, though everything was likely buffer cached. I ran TPCDS-Q1 with RUNTIME_FILTER_WAIT_TIME_MS=2000 against tpcds_1000_decimal_parquet 10 times. The big observable thing is that ScannerThreadsSysTime went from 5.6 seconds per query to 1.9 seconds per query. (I ran the text profiles through the old-fashioned: grep ScannerThreadsSysTime profiles | awk '/ms/ { x += $3/1000 } /ns/ { x += $3/1000000 } END { print x }' ) The query time effect was quite small (the fastest query was 3.373s with the change and 3.82s without the change, but the averages were tighter), but the extra work was visible in the profiles. I happened to rename HdfsScanNode::file_type_counts_ to HdfsScanNode::file_type_counts_lock_ because HdfsScanNodeBase::file_type_counts_ also exists, and is totally different. This bug was co-debugged by Todd Lipcon, Joe McDonnell, and Philip Zeyliger. Change-Id: I133de13238d3d05c510e2ff771d48979125735b1 Reviewed-on: http://gerrit.cloudera.org:8080/12097 Reviewed-by: Philip Zeyliger <phi...@cloudera.com> Tested-by: Philip Zeyliger <phi...@cloudera.com> > Reduce the window of spinning for Parquet and base-sequence scanners > -------------------------------------------------------------------- > > Key: IMPALA-1730 > URL: https://issues.apache.org/jira/browse/IMPALA-1730 > Project: IMPALA > Issue Type: Improvement > Affects Versions: Product Backlog > Reporter: Ippokratis Pandis > Assignee: Dan Hecht > Priority: Major > Fix For: Impala 2.3.0 > > > The possibility of a busy spinning loop for parquet and base-sequence scanner > threads can happen during these windows: > 1) Parquet - after the footer range is taken by a scanner thread, but before > the call to scan_node_->MarkFileDescIssued(desc) in > HdfsParquetScanner::ProcessFooter(). Note that there are potential blocking > disk IO in this window. I see no reason this window can't be shrunk by moving > the MarkFileDesIssued() call to right after the footer ranges are issued in > IssueInitialRanges() [the thread for the footer range will process the rest > of the file anyway so we can consider the file issued at this point]. > 2) base-sequence - after the header range is taken by a scanner thread, but > before the file range has been issued in BaseSequenceScanner::ProcessSplit() > by the call scan_node_->AddDiskIoRanges(desc). There are potential blocking > (remote) disk IOs. This one doesn't look as straight forward to deal with. > Those windows are somewhat mitigated by the fact that there are usually > multiple files being scanned and they won't have perfectly overlapping > windows (and so there's a good chance there are unstarted ranges which means > scanner threads will block rather than busy loop), but it still seems worth > shrinking/eliminating these windows as well. > For additional context see IMPALA-1722. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org