[
https://issues.apache.org/jira/browse/IMPALA-13555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17899743#comment-17899743
]
Surya Hebbar edited comment on IMPALA-13555 at 11/20/24 5:57 PM:
-----------------------------------------------------------------
Consider the following example, to see some limitations and complications
arising from the current aggregation mechanism(i.e. 'labels', 'label_idxs',
'timestamps').
The events are streamed from instances in this exact order.
Instance 1 -
"Open Started" - 2280ms
"Open Finished" - 3270ms
"Last Batch Returned" - 3470ms
"Closed" - 4100ms
Instance 2 -
"Open Started" - 1280ms
"Open Finished" - 1330ms
"First Batch Requested" - 1370ms
"Closed" - 3100ms
Result of aggregation -
labels -
"Open Started" - 0, "Open Finished" - 1, "Last Batch Returned" - 2, "Closed" -
3, "First Batch Requested" - 4
label_idxs -
0 1 2 3
0 1 4 3
timestamps -
2280 3270 3470 4100
1280 1330 1370 3100
Note : The resulting 'labels', 'timestamps' and 'label_idxs' will be consistent
as 'SortEvents' is being called before delivering events.
The proper alignment of timestamps (with zeros substituted) is -
(possible to generate using 'label_idxs')
2280 3270 3470 4100 0
1280 1330 0 3100 1370
The proper order and alignment of timestamps(with zeros substituted) is -
(functionally not possible to generate, even after using 'labels'(map),
'label_idxs' and 'timestamps')
2280 3270 0 3470 4100
1280 1330 1370 0 3100
(note: identifying the order of events, through comparison between timestamps
across instances is wrong)
In this manner, if the first few instances responsible for generating 'labels'
map have missing events, it is functionally not possible to extract the perfect
order.
Hence, the current aggregation mechanism, fails to provide an absolute method
to align and order the missing events' timestamp in many edge cases.
Although a definite ordering is not always possible, I have come up with the
following approach to generate the most probable orderings(using 'label_idxs'
from all instances).
1. Consider each index in 'label_idxs' as a node
2. While maintaining a set of visited nodes
3. Add adjacent indexes in 'label_idxs' as parent(or children), if they exist
in visited nodes(Add to visited nodes, otherwise)
4. The result will be
-> A degenerate binary tree (a successful definite order was found)
-> A single multi-terminal unrooted tree(definite order, except where there are
multiple terminals)
-> A forest of disjoint trees(not functionally possible to find definite order,
need to merge all trees in any order)
5. Level-order traversal of the final tree will provide a perfect order / one
of the most probable orderings.
I believe this would result in a space and time complexity of O\(n).
In addition to the computational overhead, even after such a complex ordering
process, still there is a possibility of not finding the perfect order(when it
is not functionally definite, i.e. consider indexes [0, 1, 2] [3, 4, 5]).
The proposed alternative is to have a predefined order for event labels (unsure
if it is always possible to have a predefined order for events in cases like
AsyncCodegen).
After inspecting all these aspects, I have currently decided to go with the
first option here.
There might be a much simpler or different way through this.
Currently the following mechanism has been implemented, which handles almost
all cases.
[https://gerrit.cloudera.org/c/21683/9/be/src/util/runtime-profile.cc#3040]
was (Author: JIRAUSER299620):
Consider the following example, to see some limitations and complications
arising from the current aggregation mechanism(i.e. 'labels', 'label_idxs',
'timestamps').
The events are streamed from instances in this exact order.
Instance 1 -
"Open Started" - 2280ms
"Open Finished" - 3270ms
"Last Batch Returned" - 3470ms
"Closed" - 4100ms
Instance 2 -
"Open Started" - 1280ms
"Open Finished" - 1330ms
"First Batch Requested" - 1370ms
"Closed" - 3100ms
Result of aggregation -
labels -
"Open Started" - 0, "Open Finished" - 1, "Last Batch Returned" - 2, "Closed" -
3, "First Batch Requested" - 4
label_idxs -
0 1 2 3
0 1 4 3
timestamps -
2280 3270 3470 4100
1280 1330 1370 3100
Note : The resulting 'labels', 'timestamps' and 'label_idxs' will be consistent
as 'SortEvents' is being called before delivering events.
The proper alignment of timestamps (with zeros substituted) is -
(possible to generate using 'label_idxs')
2280 3270 3470 4100 0
1280 1330 0 3100 1370
The proper order and alignment of timestamps(with zeros substituted) is -
(functionally not possible to generate, even after using 'labels'(map),
'label_idxs' and 'timestamps')
2280 3270 0 3470 4100
1280 1330 1370 0 3100
(note: identifying the order of events, through comparison between timestamps
across instances is wrong)
In this manner, if the first few instances responsible for generating 'labels'
map have missing events, it is functionally not possible to extract the perfect
order.
Hence, the current aggregation mechanism, fails to provide an absolute method
to align and order the missing events' timestamp in many edge cases.
Although a definite ordering is not always possible, I have come up with the
following approach to generate the most probable orderings(using 'label_idxs'
from all instances).
1. Consider each index in 'label_idxs' as a node
2. While maintaining a set of visited nodes
3. Add adjacent indexes in 'label_idxs' as parent(or children), if they exist
in visited nodes(Add to visited nodes, otherwise)
4. The result will be
-> A degenerate binary tree (a successful definite order was found)
-> A single multi-terminal unrooted tree(definite order, except where there are
multiple terminals)
-> A forest of disjoint trees(not functionally possible to find definite order,
need to merge all trees in any order)
5. Level-order traversal of the final tree will provide a perfect order / one
of the most probable orderings.
I believe this would result in a space and time complexity of O(n).
In addition to the computational overhead, even after such a complex ordering
process, still there is a possibility of not finding the perfect order(when it
is not functionally definite, i.e. consider indexes [0, 1, 2] [3, 4, 5]).
The proposed alternative is to have a predefined order for event labels (unsure
if it is always possible to have a predefined order for events in cases like
AsyncCodegen).
After inspecting all these aspects, I have currently decided to go with the
first option here.
There might be a much simpler or different way through this.
Currently the following mechanism has been implemented, which handles almost
all cases.
[https://gerrit.cloudera.org/c/21683/9/be/src/util/runtime-profile.cc#3040]
> Enhance order and predictability of reported events in EventSequence
> --------------------------------------------------------------------
>
> Key: IMPALA-13555
> URL: https://issues.apache.org/jira/browse/IMPALA-13555
> Project: IMPALA
> Issue Type: Improvement
> Reporter: Surya Hebbar
> Priority: Major
>
> In some cases, when the instances responsible for generating 'labels' map
> have missing events, it is functionally not possible to extract the perfect
> order.
> The current event aggregation mechanism, fails to provide an absolute method
> to align and order the missing events' timestamp in many edge cases.
> It would be great to enhance EventSequence's constructor and the collection
> mechanism along with a cleanup and performance boost provided by switching to
> a completed version of profile V2.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]