[Impala-ASF-CR] IMPALA-9846: Enable Aggregated Runtime Profile by Default

Surya Hebbar (Code Review) Mon, 22 Jun 2026 06:02:37 -0700

Surya Hebbar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/23154 )


Change subject: IMPALA-9846: Enable Aggregated Runtime Profile by Default
......................................................................


Patch Set 31:

> Builds have been passing for both V1 and V2 since late January
> I saw some cases where I am concerned.
> from runtime_filters_mt_dop.test:
> row_regex: RowsReturned: total=7.30K
> How can this pass with the old profile?

Hi Csaba, thank you for noticing this. This inconsistency seems to have 
occurred in runtime_filters_mt_dop.test and spilling.test, in about ~20 lines. 
While updating the patchsets and breaking them up, only those 2 file changes 
were missed. I have verified it again now and will make sure to update them.

> I disagree with the order - the patch should be leaned down to not include 
> trivial test changes before being merged.

I've pulled the newline-only changes out into a separate patch and posted it 
ahead of this change, so the main change shrinks for review. The remaining .* 
changes I've kept here, since they're tied to the total additions (see below).

> There may be some tricky exceptions, but replacing most of the existing lines 
> seems fairly mechanical:
> row_regex: RowsRead: (?:2.43K \(2433\)|total=7.30K \(7300\))
> ->
> match_counter(RowsRead, 2433)

1. On "fairly mechanical":

There are 1891 RUNTIME_PROFILE sections in .test files, so several thousand 
lines contain row_regex. Of them, only about ~150 counter changes are in the 
form suggested here, and these have been updated in this patch. But a 
substantial portion (~few thousand lines) are row_regex or line matches like 
this:

NumModifiedRows: 0

NumDeletedRows: 1

row_regex: Per-Host Resource Estimates: Memory=10MB

row_regex: Cluster Memory Admitted: 50.00 MB

Rewriting all of them at once is hard, as they do not follow this form. Many of 
them also depend on the pretty-printed representation of the profile, so they 
cannot directly be converted to match_counter or a similar approach.

2. For the counter lines, matching a specific instance's value isn't more 
accurate than the total. For distributed counters like 
RowsRead/RowsReturned/Files Rejected, work distribution across instances isn't 
fixed, any instance can process an arbitrary share, so an individual instance's 
value isn't a stable assertion (that's why the older tests effectively just 
checked that some value appeared across instances). The meaningful invariant is 
the total. aggregation(SUM, RowsRead): 7300 is both more meaningful and already 
version-agnostic; match_counter against a per-instance 2433 would assert 
something not guaranteed run-to-run.

I.e.
Instance level matching of 8 vs total match of 23 files

row_regex: Files rejected: (?:8 \(8\)|total=23 \(23\))
aggregation(SUM, Files rejected): 23

So the proposal is the hybrid: aggregation(SUM) for counter checks now (covers 
the common case, no flexible regex); version-flexible regex kept only for 
genuine special cases (isolating one node's value where SUM across nodes adds 
noise); and a small has_counter-style helper added as a separate follow-up for 
those node-isolation cases. That preserves the test inventory this change 
already validated, rather than rewriting a thousand more tests and re-touching 
every converted line a second time. A full move to a thrift-based matcher now 
would mean reworking essentially every profile test again for no functional 
gain, and I'd rather not block enabling V2 on that. If reviewers feel the 
larger approach is required before merge, I'd like to scope it explicitly as a 
separate effort and get a second committer's read, happy to pull in Joe / 
Michael / Kurt so we converge on direction.

> For non-aggregated text profiles this doesn't really need Thrift parsing, the 
> exact values can be parsed from text using values in (). This may be actually 
> a more backport friendly way to do it, as in older branches some tests may 
> still use beeswax client that doesn't have Thrift profile.
> For aggregated profile Thrift is needed, but where we use aggregated profiles 
> we can also assume using HS2 protocol that has Thrift.

Agreed that for V1 the exact values are in the parens, and aggregation(SUM) 
already uses exactly that on the V1 path, no thrift needed there.

One clarification on the thrift path for V2, re your draft 
(https://gerrit.cloudera.org/#/c/24407/): it resolves counters from the 
non-aggregated thrift structure, which carries per-instance profile subtrees. 
In V2 those subtrees no longer exist, that's where the memory consumption / 
performance win comes from for the aggregated profile. The per-instance values 
do survive as arrays inside each TAggCounter, not as instance nodes, so 
matching is possible against the values list rather than an instance subtree:

Fragment F00
AGGREGATION_NODE (id=1)
TRuntimeProfileNode(name='AGGREGATION_NODE (id=1)', num_children=2, 
counters=[TCounter(name='InactiveTotalTime', unit=5, value=0), 
TCounter(name='PeakMemoryUsage', unit=3, value=37910656), 
TCounter(name='RowsReturned', unit=0, value=2000404), 
TCounter(name='RowsReturnedRate', unit=1, value=2428389), 
TCounter(name='TotalTime', unit=5, value=822274640)] [(mean values, not 
instance level)]
... aggregated=TAggregatedRuntimeProfileNode(num_instances=3, 
input_profiles=None, counters=[TAggCounter(name='InactiveTotalTime', unit=5, 
has_value=[True, True, True], values=[0, 0, 0]), 
TAggCounter(name='PeakMemoryUsage', unit=3, has_value=[True, True, True], ...), 
TAggCounter(name='RowsReturned', unit=0, has_value=[True, True, True], 
values=[...])])

In traditional profile, those per-instance values live under per-instance 
Fragment Instance nodes instead, and there is no aggregated array:

Fragment F01
Instance 6545360911b7328e:0107004000000002 (host=surya-Precision-5560:27000)
AGGREGATION_NODE (id=2)
TRuntimeProfileNode(name='AGGREGATION_NODE (id=2)', num_children=2, 
counters=[TCounter(name='InactiveTotalTime', unit=5, value=0), 
TCounter(name='PeakMemoryUsage', unit=3, value=16384), 
TCounter(name='RowsReturned', unit=0, value=1), 
TCounter(name='RowsReturnedRate', unit=1, value=11), TCounter(name='TotalTime', 
unit=5, value=88872949)])

Two costs make me want this later rather than now. The thrift profile is a 
flattened node tree, so resolving a counter is a linear walk over 
TRuntimeProfileNodes plus a scan of each node's counter list, not faster than 
the current approach, which compiles all row_regex + aggregation patterns into 
one regex and does a single pass over the text. And mixing in thrift means 
fetching and searching both the text and thrift profiles per query, which is 
less clean and much slower than the current approach. So a thrift matcher is 
buildable as the has_counter follow-up, but it doesn't pay for itself for the 
residual handful of lines and isn't needed for the common case aggregation(SUM) 
already covers.

> Another case that seems easily separable are .* removals like in 
> alter-table.test
> most were removed in https://gerrit.cloudera.org/#/c/23864/ , but there are 
> still many around - these have nothing to do with switching to aggregated 
> profile AFAIK

The bulk of the .* removals already went in via 23864. The ones remaining here 
aren't pure cleanup, they're entangled with the new total additions,
e.g.
row_regex: .RowsRead: 2.43K .
becomes
row_regex: RowsRead: (?:2.43K|total=7.30K)
so the .* removal and the total change are the same edit. That's why I've kept 
these in this patch rather than separating them. The newline-only changes I 
have split out and posted ahead of this change so they can merge quickly.

https://gerrit.cloudera.org/24491


--
To view, visit http://gerrit.cloudera.org:8080/23154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If41d6322361fba82c946efd614cc7d28cb1c36e8
Gerrit-Change-Number: 23154
Gerrit-PatchSet: 31
Gerrit-Owner: Surya Hebbar <[email protected]>
Gerrit-Reviewer: Abhishek Rawat <[email protected]>
Gerrit-Reviewer: Aleksandr Efimov <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Kurt Deschler <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Noemi Pap-Takacs <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Riza Suminto <[email protected]>
Gerrit-Reviewer: Surya Hebbar <[email protected]>
Gerrit-Comment-Date: Mon, 22 Jun 2026 13:01:09 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9846: Enable Aggregated Runtime Profile by Default

Reply via email to