I've noticed that two queries, which return identical results, have very
different performance. I'd be interested in any hints about how avoid
problems like this.
The DataFrame df contains a string field "series" and an integer "eday", the
number of days since (or before) the 1970-01-01 epoch.
I'm doing some analysis over a sliding date window and, for now, avoiding
UDAFs. I'm therefore using a self join. First, I create
val laggard = df.withColumnRenamed("series",
"p_series").withColumnRenamed("eday", "p_eday")
Then, the following query runs in 16s:
df.join(laggard, (df("series") === laggard("p_series")) && (df("eday") ===
(laggard("p_eday") + 1))).count
while the following query runs in 4 - 6 minutes:
df.join(laggard, (df("series") === laggard("p_series")) && ((df("eday") -
laggard("p_eday")) === 1)).count
It's worth noting that the series term is necessary to keep the query from
doing a complete cartesian product over the data.
Ideally, I'd like to look at lags of more than one day, but the following is
equally slow:
df.join(laggard, (df("series") === laggard("p_series")) && (df("eday") -
laggard("p_eday")).between(1,7)).count
Any advice about the general principle at work here would be welcome.
Thanks,
David
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/Differing-performance-in-self-joins-tp13864.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]