Denis Tarima created SPARK-46992:
------------------------------------
Summary: Inconsistent results with 'sort', 'cache', and AQE.
Key: SPARK-46992
URL: https://issues.apache.org/jira/browse/SPARK-46992
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 3.5.0, 3.3.2
Reporter: Denis Tarima
With AQE enabled, having {color:#4c9aff}sort{color} in the plan changes
{color:#4c9aff}sample{color} results after caching.
Moreover, when cached, {color:#4c9aff}collect{color} returns records as if
it's not cached, which is inconsistent with {color:#4c9aff}count{color} and
{color:#4c9aff}show{color}.
A script to reproduce:
{color:#0000ff}import{color}{color:#3b3b3b} spark.implicits._{color}
{color:#0000ff}val{color}{color:#3b3b3b}
{color}{color:#001080}df{color}{color:#3b3b3b}
{color}{color:#000000}={color}{color:#3b3b3b}
({color}{color:#098658}1{color}{color:#3b3b3b} to
{color}{color:#098658}4{color}{color:#3b3b3b}).toDF({color}{color:#c72e0f}"id"{color}{color:#3b3b3b}).sort({color}{color:#c72e0f}"id"{color}{color:#3b3b3b}).sample({color}{color:#098658}0.4{color}{color:#3b3b3b},
{color}{color:#098658}123{color}{color:#3b3b3b}){color}
{color:#3b3b3b}println({color}{color:#c72e0f}"NON
CACHED:"{color}{color:#3b3b3b}){color}
{color:#3b3b3b}println({color}{color:#c72e0f}" count: "{color}{color:#3b3b3b}
{color}{color:#000000}+{color}{color:#3b3b3b} df.count()){color}
{color:#3b3b3b}println({color}{color:#c72e0f}" collect: "{color}{color:#3b3b3b}
{color}{color:#000000}+{color}{color:#3b3b3b}
df.collect().mkString({color}{color:#c72e0f}" "{color}{color:#3b3b3b})){color}
{color:#3b3b3b}df.show(){color}
{color:#3b3b3b}println({color}{color:#c72e0f}"CACHED:"{color}{color:#3b3b3b}){color}
{color:#3b3b3b}df.cache().count(){color}
{color:#3b3b3b}println({color}{color:#c72e0f}" count: "{color}{color:#3b3b3b}
{color}{color:#000000}+{color}{color:#3b3b3b} df.count()){color}
{color:#3b3b3b}println({color}{color:#c72e0f}" collect: "{color}{color:#3b3b3b}
{color}{color:#000000}+{color}{color:#3b3b3b}
df.collect().mkString({color}{color:#c72e0f}" "{color}{color:#3b3b3b})){color}
{color:#3b3b3b}df.show(){color}
{color:#3b3b3b}df.unpersist(){color}
output:
{color:#267f99}NON{color}{color:#3b3b3b}
{color}{color:#267f99}CACHED{color}{color:#000000}:{color}
{color:#3b3b3b} {color}{color:#001080}count{color}{color:#3b3b3b}:
{color}{color:#098658}2{color}
{color:#3b3b3b} {color}{color:#001080}collect{color}{color:#3b3b3b}:
[{color}{color:#098658}1{color}{color:#3b3b3b}]
[{color}{color:#098658}4{color}{color:#3b3b3b}]{color}
{color:#000000}+---+{color}
{color:#000000}|{color}{color:#3b3b3b} id{color}{color:#000000}|{color}
{color:#000000}+---+{color}
{color:#000000}|{color}{color:#3b3b3b}
{color}{color:#098658}1{color}{color:#000000}|{color}
{color:#000000}|{color}{color:#3b3b3b}
{color}{color:#098658}4{color}{color:#000000}|{color}
{color:#000000}+---+{color}
{color:#267f99}CACHED{color}{color:#000000}:{color}
{color:#3b3b3b} {color}{color:#001080}count{color}{color:#3b3b3b}:
{color}{color:#098658}3{color}
{color:#3b3b3b} {color}{color:#001080}collect{color}{color:#3b3b3b}:
[{color}{color:#098658}1{color}{color:#3b3b3b}]
[{color}{color:#098658}4{color}{color:#3b3b3b}]{color}
{color:#000000}+---+{color}
{color:#000000}|{color}{color:#3b3b3b} id{color}{color:#000000}|{color}
{color:#000000}+---+{color}
{color:#000000}|{color}{color:#3b3b3b}
{color}{color:#098658}1{color}{color:#000000}|{color}
{color:#000000}|{color}{color:#3b3b3b}
{color}{color:#098658}2{color}{color:#000000}|{color}
{color:#000000}|{color}{color:#3b3b3b}
{color}{color:#098658}3{color}{color:#000000}|{color}
{color:#000000}+---+{color}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]