Thanks for your suggestions. file.count() takes 7s, so that doesn't seem to be the problem. Moreover, a union with the same code/CSV takes about 15s (SELECT * FROM rooms2 UNION SELECT * FROM rooms3).
The web status page shows that both stages 'count at joins.scala:216' and 'reduce at joins.scala:219' take up the majority of the time. Is this due to bad partitioning or caching? Or is there a problem with the JOIN operator? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Performance-problems-on-SQL-JOIN-tp8001p8016.html Sent from the Apache Spark User List mailing list archive at Nabble.com.