The issue might be group by , which under certain circumstances can cause a lot of traffic to one node. This transfer is of course obsolete the less nodes you have. Have you checked in the UI what it reports?
> On 17. May 2017, at 17:13, Junaid Nasir <jna...@an10.io> wrote: > > I have a large data set of 1B records and want to run analytics using Apache > spark because of the scaling it provides, but I am seeing an anti pattern > here. The more nodes I add to spark cluster, completion time increases. Data > store is Cassandra, and queries are run by Zeppelin. I have tried many > different queries but even a simple query of `dataframe.count()` behaves like > this. > > Here is the zeppelin notebook, temp table has 18M records > > val df = sqlContext > .read > .format("org.apache.spark.sql.cassandra") > .options(Map( "table" -> "temp", "keyspace" -> "mykeyspace")) > .load().cache() > df.registerTempTable("table") > > %sql > SELECT first(devid),date,count(1) FROM table group by date,rtu order by date > > > when tested against different no. of spark worker nodes these were the results > Spark nodes Time > 4 nodes 22 min 58 sec > 3 nodes 15 min 49 sec > 2 nodes 12 min 51 sec > 1 node 17 min 59 sec > > Increasing the no. of nodes decreases performance. which should not happen as > it defeats the purpose of using Spark. > > If you want me to run any query or further info about the setup please ask. > Any cues on why this is happening would be very helpful, been stuck on this > for two days now. Thank you for your time. > > > **versions** > > Zeppelin: 0.7.1 > Spark: 2.1.0 > Cassandra: 2.2.9 > Connector: datastax:spark-cassandra-connector:2.0.1-s_2.11 > > Spark cluster specs > > 6 vCPUs, 32 GB memory = 1 node > > Cassandra + Zeppelin server specs > 8 vCPUs, 52 GB memory >