[jira] [Commented] (IMPALA-1730) Reduce the window of spinning for Parquet and base-sequence scanners

ASF subversion and git services (JIRA) Fri, 01 Feb 2019 11:16:28 -0800


    [ 
https://issues.apache.org/jira/browse/IMPALA-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16758595#comment-16758595
 ]


ASF subversion and git services commented on IMPALA-1730:
---------------------------------------------------------

Commit a8e30506aafef14646d95a56fb87cf7c28d259d6 in impala's branch 
refs/heads/master from Philip Zeyliger
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=a8e3050 ]

IMPALA-7980: Fix spinning because of buggy num_unqueued_files_.

This commit removes num_unqueued_files_ and replaces it with a more
tightly scoped and easier to reason about
remaining_scan_range_submissions_ variable. This variable (and its
predecessor) are used as a way to signal to scanner threads they may
exit (instead of spinning) because there will never be a scan range
provided to them, because no more scan ranges will be added. In
practice, most scanner implementations can never call AddDiskIoRanges()
after IssueInitialRanges(). The exception is sequence files and Avro,
which share a common base class. Instead of incrementing and
decrementing this counter in a variety of paths, this commit makes the
common case simple (set to 1 initially; decrement at exit points of
IssueInitialRanges()) and the complicated, sequence-file case is treated
within base-sequence-scanner.cc.

Note that this is not the first instance of a subtle bug
in this code. The following two JIRAs (and corresponding
commits) are fundamentally similar bugs:
    IMPALA-3798: Disable per-split filtering for sequence-based scanners
    IMPALA-1730: reduce scanner thread spinning windows

We ran into this bug when running TPC-DS query 1 on scale factor 10,000
(10TB) on a 140-node cluster with replica_preference=remote, we observed
really high system CPU usage for some of the scan nodes:

  HDFS_SCAN_NODE (id=6):(Total: 59s107ms, non-child: 59s107ms, % non- child: 
100.00%
    - BytesRead: 80.50 MB (84408563)
    - ScannerThreadsSysTime: 36m17s

Using 36 minutes of system time in only 1 minute of wall-clock time
required ~30 threads to be spinning in the kernel. We were able to use
perf to find a lot of usage of futex_wait() and pthread_cond_wait().
Eventually, we figured out that ScannerThreads, once started, loop
forever looking for work.  The case that there is no work is supposed to
be rare, and the scanner threads are supposed to exit based on
num_unqueued_files_ being 0, but, in some cases, that counter isn't
appropriately decremented.

The reproduction is any query that uses runtime filters to filter out
entire files. Something like:

  set RUNTIME_FILTER_WAIT_TIME_MS=10000;
  select count(*)
  from customer
  join customer_address on c_current_addr_sk = ca_address_sk
  where ca_street_name="DoesNotExist" and c_last_name="DoesNotExist";

triggers this behavior. This code path is covered by several existing
tests, most directly in test_runtime_filters.py:test_file_filtering().
Interestingly, though this wastes cycles, query results are unaffected.

I initially fixed this bug with a point fix that handled the case when
runtime filters caused files to be skipped and added assertions that
checked that num_unqueued_files_ was decremented to zero when queries
finished. Doing this led me, somewhat slowly, to both finding similar
bugs in other parts of the code (HdfsTextScanner::IssueInitialRanges had
the same bug if the entire file was skipped) and fighting with races on
the assertion itself. I eventually concluded that there's really no
shared synchronization between progress_.Done() and num_unqueued_files_.
The same conclusion is true for the current implementation, so there
aren't assertions.

I added a metric for how many times the scanners run through their
loop without doing any work and observed it to be non-zero
for a query from tests/query_test/test_runtime_filters.py:test_wait_time.

To measure the effect of this, I set up a cluster of 9 impalad's and
1 coordinator, running against an entirely remote HDFS. The machines
were r4.4xlarge and the remote disks were EBS st1's, though everything
was likely buffer cached. I ran
TPCDS-Q1 with RUNTIME_FILTER_WAIT_TIME_MS=2000 against
tpcds_1000_decimal_parquet 10 times. The big observable
thing is that ScannerThreadsSysTime went from 5.6 seconds per
query to 1.9 seconds per query. (I ran the text profiles through the 
old-fashioned:
  grep ScannerThreadsSysTime profiles | awk '/ms/ { x += $3/1000 } /ns/ { x += 
$3/1000000 } END { print x }'
)
The query time effect was quite small (the fastest query was 3.373s
with the change and 3.82s without the change, but the averages were
tighter), but the extra work was visible in the profiles.

I happened to rename HdfsScanNode::file_type_counts_ to
HdfsScanNode::file_type_counts_lock_ because
HdfsScanNodeBase::file_type_counts_ also exists, and
is totally different.

This bug was co-debugged by Todd Lipcon, Joe McDonnell, and Philip
Zeyliger.

Change-Id: I133de13238d3d05c510e2ff771d48979125735b1
Reviewed-on: http://gerrit.cloudera.org:8080/12097
Reviewed-by: Philip Zeyliger <[email protected]>
Tested-by: Philip Zeyliger <[email protected]>


> Reduce the window of spinning for Parquet and base-sequence scanners
> --------------------------------------------------------------------
>
>                 Key: IMPALA-1730
>                 URL: https://issues.apache.org/jira/browse/IMPALA-1730
>             Project: IMPALA
>          Issue Type: Improvement
>    Affects Versions: Product Backlog
>            Reporter: Ippokratis Pandis
>            Assignee: Dan Hecht
>            Priority: Major
>             Fix For: Impala 2.3.0
>
>
> The possibility of a busy spinning loop for parquet and base-sequence scanner 
> threads can happen during these windows:
> 1) Parquet - after the footer range is taken by a scanner thread, but before 
> the call to scan_node_->MarkFileDescIssued(desc) in 
> HdfsParquetScanner::ProcessFooter(). Note that there are potential blocking 
> disk IO in this window. I see no reason this window can't be shrunk by moving 
> the MarkFileDesIssued() call to right after the footer ranges are issued in 
> IssueInitialRanges() [the thread for the footer range will process the rest 
> of the file anyway so we can consider the file issued at this point].
> 2) base-sequence - after the header range is taken by a scanner thread, but 
> before the file range has been issued in BaseSequenceScanner::ProcessSplit() 
> by the call scan_node_->AddDiskIoRanges(desc). There are potential blocking 
> (remote) disk IOs. This one doesn't look as straight forward to deal with.
> Those windows are somewhat mitigated by the fact that there are usually 
> multiple files being scanned and they won't have perfectly overlapping 
> windows (and so there's a good chance there are unstarted ranges which means 
> scanner threads will block rather than busy loop), but it still seems worth 
> shrinking/eliminating these windows as well.
> For additional context see IMPALA-1722.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-1730) Reduce the window of spinning for Parquet and base-sequence scanners

Reply via email to