I've noticed that two queries, which return identical results, have very different performance. I'd be interested in any hints about how avoid problems like this.
The DataFrame df contains a string field "series" and an integer "eday", the number of days since (or before) the 1970-01-01 epoch. I'm doing some analysis over a sliding date window and, for now, avoiding UDAFs. I'm therefore using a self join. First, I create val laggard = df.withColumnRenamed("series", "p_series").withColumnRenamed("eday", "p_eday") Then, the following query runs in 16s: df.join(laggard, (df("series") === laggard("p_series")) && (df("eday") === (laggard("p_eday") + 1))).count while the following query runs in 4 - 6 minutes: df.join(laggard, (df("series") === laggard("p_series")) && ((df("eday") - laggard("p_eday")) === 1)).count It's worth noting that the series term is necessary to keep the query from doing a complete cartesian product over the data. Ideally, I'd like to look at lags of more than one day, but the following is equally slow: df.join(laggard, (df("series") === laggard("p_series")) && (df("eday") - laggard("p_eday")).between(1,7)).count Any advice about the general principle at work here would be welcome. Thanks, David -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Differing-performance-in-self-joins-tp13864.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org