mertak opened a new pull request, #6921:
URL: https://github.com/apache/arrow-datafusion/pull/6921
<div style="box-sizing: border-box;"><div class="edit-comment-hide"
style="box-sizing: border-box;"><task-lists disabled="" sortable=""
style="box-sizing: border-box;"><div class="comment-body markdown-body
js-comment-body soft-wrap css-overflow-wrap-anywhere user-select-contain
d-block" style="box-sizing: border-box; display: block !important; font-family:
-apple-system, BlinkMacSystemFont, "Segoe UI", "Noto Sans",
Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI
Emoji"; font-size: 14px; line-height: 1.5; overflow-wrap: anywhere; width:
838px; padding: 16px; overflow: visible; color: var(--color-fg-default);"><h1
dir="auto" style="box-sizing: border-box; font-size: 2em; margin-top: 0px
!important; margin-right: 0px; margin-bottom: 16px; margin-left: 0px;
font-weight: var(--base-text-weight-semibold, 600); line-height: 1.25;
padding-bottom: 0.3em; border-bottom: 1px solid var(--borderColor-muted,
var(--color-border-muted));">Rati
onale for this change</h1><p dir="auto" style="box-sizing: border-box;
margin-top: 0px; margin-bottom: 16px;">The planner currently uses<span>
</span><code class="notranslate" style="box-sizing: border-box; font-family:
ui-monospace, SFMono-Regular, "SF Mono", Menlo, Consolas,
"Liberation Mono", monospace; font-size: 11.9px; padding: 0.2em
0.4em; margin: 0px; white-space: break-spaces; background-color:
var(--bgColor-neutral-muted, var(--color-neutral-muted)); border-radius:
6px;">RepartitionExec</code>s for repartitioning even when this results in the
loss of output ordering. This happens when a<span> </span><code
class="notranslate" style="box-sizing: border-box; font-family: ui-monospace,
SFMono-Regular, "SF Mono", Menlo, Consolas, "Liberation
Mono", monospace; font-size: 11.9px; padding: 0.2em 0.4em; margin: 0px;
white-space: break-spaces; background-color: var(--bgColor-neutral-muted,
var(--color-neutral-muted)); border-radius: 6px;">Re
partitionExec</code><span> </span>has multiple input partitions -- it cannot
preserve the ordering of its inputs in this case. This results in
pipeline-breaking<span> </span><code class="notranslate" style="box-sizing:
border-box; font-family: ui-monospace, SFMono-Regular, "SF Mono",
Menlo, Consolas, "Liberation Mono", monospace; font-size: 11.9px;
padding: 0.2em 0.4em; margin: 0px; white-space: break-spaces; background-color:
var(--bgColor-neutral-muted, var(--color-neutral-muted)); border-radius:
6px;">SortExec</code>s being introduced to the plan.</p><p dir="auto"
style="box-sizing: border-box; margin-top: 0px; margin-bottom: 16px;">This rule
checks the children nodes of a<span> </span><code class="notranslate"
style="box-sizing: border-box; font-family: ui-monospace, SFMono-Regular,
"SF Mono", Menlo, Consolas, "Liberation Mono", monospace;
font-size: 11.9px; padding: 0.2em 0.4em; margin: 0px; white-space:
break-spaces; background-color:
var(--bgColor-neutral-muted, var(--color-neutral-muted)); border-radius:
6px;">SortExec</code><span> </span>until ordering is not preserved (such as
until another<span> </span><code class="notranslate" style="box-sizing:
border-box; font-family: ui-monospace, SFMono-Regular, "SF Mono",
Menlo, Consolas, "Liberation Mono", monospace; font-size: 11.9px;
padding: 0.2em 0.4em; margin: 0px; white-space: break-spaces; background-color:
var(--bgColor-neutral-muted, var(--color-neutral-muted)); border-radius:
6px;">SortExec</code><span> </span>or<br style="box-sizing: border-box;"><code
class="notranslate" style="box-sizing: border-box; font-family: ui-monospace,
SFMono-Regular, "SF Mono", Menlo, Consolas, "Liberation
Mono", monospace; font-size: 11.9px; padding: 0.2em 0.4em; margin: 0px;
white-space: break-spaces; background-color: var(--bgColor-neutral-muted,
var(--color-neutral-muted)); border-radius:
6px;">CoalescePartitionsExec</code><span> <
/span>which doesn't maintain ordering). It then iteratively replaces<span>
</span><code class="notranslate" style="box-sizing: border-box; font-family:
ui-monospace, SFMono-Regular, "SF Mono", Menlo, Consolas,
"Liberation Mono", monospace; font-size: 11.9px; padding: 0.2em
0.4em; margin: 0px; white-space: break-spaces; background-color:
var(--bgColor-neutral-muted, var(--color-neutral-muted)); border-radius:
6px;">RepartitionExec</code>s that do not maintain ordering (e.g<span>
</span><code class="notranslate" style="box-sizing: border-box; font-family:
ui-monospace, SFMono-Regular, "SF Mono", Menlo, Consolas,
"Liberation Mono", monospace; font-size: 11.9px; padding: 0.2em
0.4em; margin: 0px; white-space: break-spaces; background-color:
var(--bgColor-neutral-muted, var(--color-neutral-muted)); border-radius:
6px;">RepartitionExec</code>s where input partition count is larger than 1)
with<span> </span><code class="notranslate" style="box-siz
ing: border-box; font-family: ui-monospace, SFMono-Regular, "SF
Mono", Menlo, Consolas, "Liberation Mono", monospace; font-size:
11.9px; padding: 0.2em 0.4em; margin: 0px; white-space: break-spaces;
background-color: var(--bgColor-neutral-muted, var(--color-neutral-muted));
border-radius: 6px;">SortPreservingRepartitionExec</code>s. Doing so often
renders some<span> </span><code class="notranslate" style="box-sizing:
border-box; font-family: ui-monospace, SFMono-Regular, "SF Mono",
Menlo, Consolas, "Liberation Mono", monospace; font-size: 11.9px;
padding: 0.2em 0.4em; margin: 0px; white-space: break-spaces; background-color:
var(--bgColor-neutral-muted, var(--color-neutral-muted)); border-radius:
6px;">SortExec</code>s unnecessary. As an example, this rule turns plan
below</p><div class="highlight highlight-source-sql notranslate
position-relative overflow-auto" dir="auto" style="box-sizing: border-box;
position: relative !important; overflow
: visible !important; margin-bottom: 16px; background-color:
transparent;"><pre class="notranslate" style="box-sizing: border-box;
font-family: ui-monospace, SFMono-Regular, "SF Mono", Menlo,
Consolas, "Liberation Mono", monospace; font-size: 11.9px;
margin-top: 0px; margin-bottom: 0px; overflow-wrap: normal; padding: 16px;
overflow: auto; line-height: 1.45; color: var(--fgColor-default,
var(--color-fg-default)); background-color: var(--bgColor-muted,
var(--color-canvas-subtle)); border-radius: 6px; word-break: normal;"><span
class="pl-s" style="box-sizing: border-box; color:
var(--color-prettylights-syntax-string);"><span class="pl-pds"
style="box-sizing: border-box; color:
var(--color-prettylights-syntax-string);">"</span>SortPreservingMergeExec: [a@0
ASC NULLS LAST]<span class="pl-pds" style="box-sizing: border-box; color:
var(--color-prettylights-syntax-string);">"</span></span>,
<span class="pl-s" style="box-sizing: border-box; color:
var(--color-prettylights-syntax-string);"><span class="pl-pds"
style="box-sizing: border-box; color:
var(--color-prettylights-syntax-string);">"</span> SortExec: expr=[a@0 ASC
NULLS LAST]<span class="pl-pds" style="box-sizing: border-box; color:
var(--color-prettylights-syntax-string);">"</span></span>,
<span class="pl-s" style="box-sizing: border-box; color:
var(--color-prettylights-syntax-string);"><span class="pl-pds"
style="box-sizing: border-box; color:
var(--color-prettylights-syntax-string);">"</span> RepartitionExec:
partitioning=Hash([b@0], 16), input_partitions=16<span class="pl-pds"
style="box-sizing: border-box; color:
var(--color-prettylights-syntax-string);">"</span></span>,
<span class="pl-s" style="box-sizing: border-box; color:
var(--color-prettylights-syntax-string);"><span class="pl-pds"
style="box-sizing: border-box; color:
var(--color-prettylights-syntax-string);">"</span> RepartitionExec:
partitioning=Hash([a@0], 16), input_partitions=1<span class="pl-pds"
style="box-sizing: border-box; color:
var(--color-prettylights-syntax-string);">"</span></span>,
<span class="pl-s" style="box-sizing: border-box; color:
var(--color-prettylights-syntax-string);"><span class="pl-pds"
style="box-sizing: border-box; color:
var(--color-prettylights-syntax-string);">"</span> MemoryExec:
partitions=1, partition_sizes=[(<depends_on_batch_size>)],
output_ordering: [PhysicalSortExpr { expr: Column { name: \<span class="pl-pds"
style="box-sizing: border-box; color:
var(--color-prettylights-syntax-string);">"</span></span>a\<span class="pl-s"
style="box-sizing: border-box; color:
var(--color-prettylights-syntax-string);"><span class="pl-pds"
style="box-sizing: border-box; color:
var(--color-prettylights-syntax-string);">"</span>, index: 0 }, options:
SortOptions { descending: false, nulls_first: false } }]<span class="pl-pds"
style="box-sizing: border-box; color:
var(--color-prettylights-syntax-string);">"</span></span>,</pre></div><p
dir="auto" style="box-sizing: border-box; margin-top: 0px; margin-bottom:
16px;">to</p><div class="highlig
ht highlight-source-sql notranslate position-relative overflow-auto"
dir="auto" style="box-sizing: border-box; position: relative !important;
overflow: visible !important; margin-bottom: 16px; background-color:
transparent;"><pre class="notranslate" style="box-sizing: border-box;
font-family: ui-monospace, SFMono-Regular, "SF Mono", Menlo,
Consolas, "Liberation Mono", monospace; font-size: 11.9px;
margin-top: 0px; margin-bottom: 0px; overflow-wrap: normal; padding: 16px;
overflow: auto; line-height: 1.45; color: var(--fgColor-default,
var(--color-fg-default)); background-color: var(--bgColor-muted,
var(--color-canvas-subtle)); border-radius: 6px; word-break: normal;"><span
class="pl-s" style="box-sizing: border-box; color:
var(--color-prettylights-syntax-string);"><span class="pl-pds"
style="box-sizing: border-box; color:
var(--color-prettylights-syntax-string);">"</span>SortPreservingMergeExec: [a@0
ASC NULLS LAST]<span class="pl-pds" style="box-sizing: border-b
ox; color: var(--color-prettylights-syntax-string);">"</span></span>,
<span class="pl-s" style="box-sizing: border-box; color:
var(--color-prettylights-syntax-string);"><span class="pl-pds"
style="box-sizing: border-box; color:
var(--color-prettylights-syntax-string);">"</span>
SortPreservingRepartitionExec: partitioning=Hash([b@0], 16),
input_partitions=16<span class="pl-pds" style="box-sizing: border-box; color:
var(--color-prettylights-syntax-string);">"</span></span>,
<span class="pl-s" style="box-sizing: border-box; color:
var(--color-prettylights-syntax-string);"><span class="pl-pds"
style="box-sizing: border-box; color:
var(--color-prettylights-syntax-string);">"</span> RepartitionExec:
partitioning=Hash([a@0], 16), input_partitions=1<span class="pl-pds"
style="box-sizing: border-box; color:
var(--color-prettylights-syntax-string);">"</span></span>,
<span class="pl-s" style="box-sizing: border-box; color:
var(--color-prettylights-syntax-string);"><span class="pl-pds"
style="box-sizing: border-box; color:
var(--color-prettylights-syntax-string);">"</span> MemoryExec:
partitions=1, partition_sizes=[<depends_on_batch_size>], output_ordering:
[PhysicalSortExpr { expr: Column { name: \<span class="pl-pds"
style="box-sizing: border-box; color:
var(--color-prettylights-syntax-string);">"</span></span>a\<span class="pl-s"
style="box-sizing: border-box; color:
var(--color-prettylights-syntax-string);"><span class="pl-pds"
style="box-sizing: border-box; color:
var(--color-prettylights-syntax-string);">"</span>, index: 0 }, options:
SortOptions { descending: false, nulls_first: false } }]<span class="pl-pds"
style="box-sizing: border-box; color:
var(--color-prettylights-syntax-string);">"</span></span>,</pre></div><p
dir="auto" style="box-sizing: border-box; margin-top: 0px; margin-bottom:
16px;">During our benchmarks we have
compared above plans to see how the new plan affects performance. Table below
shows different parameters and performance of each version.</p>
batch_size | elapsed_time(mean) for new version | elapsed_time(median) for
new version | elapsed_time(mean) for old version | elapsed_time(median) for old
version
-- | -- | -- | -- | --
25 | 581.791054ms | 581.777708ms | 1.112292862s | 1.117181625s
50 | 373.18927ms | 373.231708ms | 385.533966ms | 385.685291ms
100 | 240.91572ms | 240.570833ms | 239.308708ms | 239.303709ms
1000 | 115.471541ms | 113.817709ms | 100.713324ms | 100.647333ms
<p dir="auto" style="box-sizing: border-box; margin-top: 0px; margin-bottom:
16px;">For all the tests, our test data had<span> </span><code
class="notranslate" style="box-sizing: border-box; font-family: ui-monospace,
SFMono-Regular, "SF Mono", Menlo, Consolas, "Liberation
Mono", monospace; font-size: 11.9px; padding: 0.2em 0.4em; margin: 0px;
white-space: break-spaces; background-color: var(--bgColor-neutral-muted,
var(--color-neutral-muted)); border-radius: 6px;">100_000</code><span>
</span>rows, for each case<span> </span><code class="notranslate"
style="box-sizing: border-box; font-family: ui-monospace, SFMono-Regular,
"SF Mono", Menlo, Consolas, "Liberation Mono", monospace;
font-size: 11.9px; padding: 0.2em 0.4em; margin: 0px; white-space:
break-spaces; background-color: var(--bgColor-neutral-muted,
var(--color-neutral-muted)); border-radius: 6px;">10</code><span> </span>runs
were taken. According to our measurements, the old versi
on performs better after the<span> </span><code class="notranslate"
style="box-sizing: border-box; font-family: ui-monospace, SFMono-Regular,
"SF Mono", Menlo, Consolas, "Liberation Mono", monospace;
font-size: 11.9px; padding: 0.2em 0.4em; margin: 0px; white-space:
break-spaces; background-color: var(--bgColor-neutral-muted,
var(--color-neutral-muted)); border-radius:
6px;">batch_size>100</code><span> </span>threshold. Hence we have decided to
use this rule only to prevent pipeline-breaking in unbounded (streaming) use
cases. In the future, we may utilize the latter plan in more cases when we have
a better optimized<span> </span><code class="notranslate" style="box-sizing:
border-box; font-family: ui-monospace, SFMono-Regular, "SF Mono",
Menlo, Consolas, "Liberation Mono", monospace; font-size: 11.9px;
padding: 0.2em 0.4em; margin: 0px; white-space: break-spaces; background-color:
var(--bgColor-neutral-muted, var(--color-neutral-muted))
; border-radius: 6px;">SortPreservingRepartitionExec</code><span>
</span>available to us.</p><h1 dir="auto" style="box-sizing: border-box;
font-size: 2em; margin: 24px 0px 16px; font-weight:
var(--base-text-weight-semibold, 600); line-height: 1.25; padding-bottom:
0.3em; border-bottom: 1px solid var(--borderColor-muted,
var(--color-border-muted));">What changes are included in this PR?</h1><p
dir="auto" style="box-sizing: border-box; margin-top: 0px; margin-bottom:
16px;">In the file,<span> </span><code class="notranslate" style="box-sizing:
border-box; font-family: ui-monospace, SFMono-Regular, "SF Mono",
Menlo, Consolas, "Liberation Mono", monospace; font-size: 11.9px;
padding: 0.2em 0.4em; margin: 0px; white-space: break-spaces; background-color:
var(--bgColor-neutral-muted, var(--color-neutral-muted)); border-radius:
6px;">replace_repartition_execs.rs</code><span> </span>the optimizer rule is
implemented. Other than that there is a typo fix for<span> </sp
an><code class="notranslate" style="box-sizing: border-box; font-family:
ui-monospace, SFMono-Regular, "SF Mono", Menlo, Consolas,
"Liberation Mono", monospace; font-size: 11.9px; padding: 0.2em
0.4em; margin: 0px; white-space: break-spaces; background-color:
var(--bgColor-neutral-muted, var(--color-neutral-muted)); border-radius:
6px;">CoalescePartitionsExec</code>.</p><h1 dir="auto" style="box-sizing:
border-box; font-size: 2em; margin: 24px 0px 16px; font-weight:
var(--base-text-weight-semibold, 600); line-height: 1.25; padding-bottom:
0.3em; border-bottom: 1px solid var(--borderColor-muted,
var(--color-border-muted));">Are these changes tested?</h1><p dir="auto"
style="box-sizing: border-box; margin-top: 0px; margin-bottom: 16px;">Yes, the
unit tests are included in the<span> </span><code class="notranslate"
style="box-sizing: border-box; font-family: ui-monospace, SFMono-Regular,
"SF Mono", Menlo, Consolas, "Liberation Mono", monospace;
font-size: 11.9px; padding: 0.2em 0.4em; margin: 0px; white-space:
break-spaces; background-color: var(--bgColor-neutral-muted,
var(--color-neutral-muted)); border-radius:
6px;">replace_repartition_execs.rs</code><span> </span>file.</p><h1 dir="auto"
style="box-sizing: border-box; font-size: 2em; margin: 24px 0px 16px;
font-weight: var(--base-text-weight-semibold, 600); line-height: 1.25;
padding-bottom: 0.3em; border-bottom: 1px solid var(--borderColor-muted,
var(--color-border-muted));">Are there any user-facing changes?</h1><p
dir="auto" style="box-sizing: border-box; margin-top: 0px; margin-bottom: 0px
!important;">No</p></div></task-lists></div></div><form
class="js-comment-update" id="issue-1767673674-edit-form" data-type="json"
data-turbo="false"
action="https://github.com/synnada-ai/arrow-datafusion/issues/122"
accept-charset="UTF-8" method="post" style="box-sizing:
border-box;"></form><div class="pr-review-reactions" style="box-sizing:
border-box;"><div data-view-component
="true" class="comment-reactions just-bottom js-reactions-container
js-reaction-buttons-container social-reactions reactions-container
has-reactions d-flex" style="box-sizing: border-box; display: flex;
margin-bottom: 16px; margin-left: 16px;"><reactions-menu tabindex="-1"
data-catalyst="" style="box-sizing: border-box; display: block;"><details
data-action="toggle:reactions-menu#focusFirstItem"
data-target="reactions-menu.details" class="dropdown details-reset
details-overlay d-inline-block new-reactions-dropdown
js-reaction-popover-container js-comment-header-reaction-button"
style="box-sizing: border-box; display: inline-block !important; position:
relative;"><summary data-target="reactions-menu.summary" aria-label="Add or
remove reactions" aria-haspopup="true" data-view-component="true" class="circle
reaction-dropdown-button reaction-dropdown-button--inline btn-invisible btn p-0
mr-1 d-flex flex-justify-center flex-items-center color-bg-subtle border
color-border-muted" style="b
ox-sizing: border-box; display: flex !important; cursor: pointer; position:
relative; padding: 0px !important; font-size: 14px; font-weight:
var(--base-text-weight-medium, 500); line-height: 20px; white-space: nowrap;
vertical-align: middle; user-select: none; border-top-width: !important;
border-right-width: !important; border-bottom-width: !important;
border-left-width: !important; border-top-style: !important;
border-right-style: !important; border-bottom-style: !important;
border-left-style: !important; border-color: var(--borderColor-muted,
var(--color-border-muted)) !important; border-image-source: !important;
border-image-slice: !important; border-image-width: !important;
border-image-outset: !important; border-image-repeat: !important;
border-radius: var(--borderRadius-full, 50%) !important; appearance: none;
color: var(--color-fg-muted); background-color: var(--bgColor-muted,
var(--color-canvas-subtle)) !important; box-shadow: none; transition: color
80ms cubic
-bezier(0.33, 1, 0.68, 1) 0s, background-color, box-shadow, border-color;
justify-content: center !important; align-items: center !important;
margin-right: var(--base-size-4, 4px) !important; width: 26px; height: 26px;
list-style: none;"><svg height="18" aria-hidden="true" viewBox="0 0 16 16"
version="1.1" width="18" data-view-component="true" class="octicon
octicon-smiley social-button-emoji"><path d="M8 0a8 8 0 1 1 0 16A8 8 0 0 1 8
0ZM1.5 8a6.5 6.5 0 1 0 13 0 6.5 6.5 0 0 0-13 0Zm3.82 1.636a.75.75 0 0 1
1.038.175l.007.009c.103.118.22.222.35.31.264.178.683.37 1.285.37.602 0
1.02-.192 1.285-.371.13-.088.247-.192.35-.31l.007-.008a.75.75 0 0 1
1.222.87l-.022-.015c.02.013.021.015.021.015v.001l-.001.002-.002.003-.005.007-.014.019a2.066
2.066 0 0 1-.184.213c-.16.166-.338.316-.53.445-.63.418-1.37.638-2.127.629-.946
0-1.652-.308-2.126-.63a3.331 3.331 0 0
1-.715-.657l-.014-.02-.005-.006-.002-.003v-.002h-.001l.613-.432-.614.43a.75.75
0 0 1 .183-1.044ZM12 7a1 1 0 1 1-2 0 1 1 0 0 1 2 0ZM5 8a1 1
0 1 1 0-2 1 1 0 0 1 0 2Zm5.25 2.25.592.416a97.71 97.71 0 0
0-.592-.416Z"></path></svg></summary></details></reactions-menu><form
class="js-pick-reaction" data-turbo="false"
action="https://github.com/synnada-ai/arrow-datafusion/reactions"
accept-charset="UTF-8" method="post" style="box-sizing: border-box;"><div
class="js-comment-reactions-options d-flex flex-items-center flex-row
flex-wrap" style="box-sizing: border-box; flex-flow: row wrap !important;
align-items: center !important; display: flex !important;"><button
name="input[content]" id="reactions--reaction_button_component-a110cc"
value="ROCKET react" data-button-index-position="6"
data-reaction-label="Rocket" data-reaction-content="rocket"
aria-pressed="false" aria-label="rocket (1): mertak, 05:00PM on June 21"
type="submit" data-view-component="true" class="social-reaction-summary-item
js-reaction-group-button btn-link d-flex no-underline color-fg-muted
flex-items-baseline mr-2" aria-describedby="tooltip-7afd657d-d5eb-4fba
-a49c-9279fda477e7" style="box-sizing: border-box; font-style: inherit;
font-variant: inherit; font-weight: inherit; font-stretch: inherit; font-size:
12px; line-height: 26px; font-family: inherit; font-optical-sizing: inherit;
font-kerning: inherit; font-feature-settings: inherit; font-variation-settings:
inherit; margin: 0px; overflow: visible; text-transform: none; appearance:
none; cursor: pointer; border-radius: 100px; display: flex !important; padding:
0px 4px !important; color: var(--fgColor-muted, var(--color-fg-muted))
!important; text-decoration-line: none !important; text-decoration-thickness:
initial; text-decoration-style: initial; text-decoration-color: initial;
white-space: nowrap; user-select: none; background-color: transparent; border:
1px solid var(--color-border-default, #d2dff0); align-items: baseline
!important; height: 26px; orphans: 2; widows: 2; -webkit-text-stroke-width:
0px;"><g-emoji alias="rocket"
fallback-src="https://github.githubassets.com/images/icon
s/emoji/unicode/1f680.png" class="social-button-emoji" style="box-sizing:
border-box; display: inline-block; min-width: 1ch; font-family: "Apple
Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol";
font-size: 1em !important; font-weight: var(--base-text-weight-normal, 400);
line-height: 1.25; vertical-align: -1px; font-style: normal !important; width:
16px; height: 16px;">🚀</g-emoji><span
class="js-discussion-reaction-group-count" style="box-sizing: border-box;
height: 24px; padding: 0px 4px; margin-left: 2px;">1</span></button><br
class="Apple-interchange-newline">
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]