The issue might be group by , which under certain circumstances can cause a lot 
of traffic to one node. This transfer is of course obsolete the less nodes you 
have.
Have you checked in the UI what it reports?

> On 17. May 2017, at 17:13, Junaid Nasir <jna...@an10.io> wrote:
> 
> I have a large data set of 1B records and want to run analytics using Apache 
> spark because of the scaling it provides, but I am seeing an anti pattern 
> here. The more nodes I add to spark cluster, completion time increases. Data 
> store is Cassandra, and queries are run by Zeppelin. I have tried many 
> different queries but even a simple query of `dataframe.count()` behaves like 
> this. 
> 
> Here is the zeppelin notebook, temp table has 18M records 
> 
> val df = sqlContext
>       .read
>       .format("org.apache.spark.sql.cassandra")
>       .options(Map( "table" -> "temp", "keyspace" -> "mykeyspace"))
>       .load().cache()
>     df.registerTempTable("table")
> 
> %sql 
> SELECT first(devid),date,count(1) FROM table group by date,rtu order by date
> 
> 
> when tested against different no. of spark worker nodes these were the results
> Spark nodes   Time
> 4 nodes       22 min 58 sec
> 3 nodes       15 min 49 sec
> 2 nodes       12 min 51 sec
> 1 node        17 min 59 sec
> 
> Increasing the no. of nodes decreases performance. which should not happen as 
> it defeats the purpose of using Spark. 
> 
> If you want me to run any query or further info about the setup please ask.
> Any cues on why this is happening would be very helpful, been stuck on this 
> for two days now. Thank you for your time.
> 
> 
> **versions**
> 
> Zeppelin: 0.7.1
> Spark: 2.1.0
> Cassandra: 2.2.9
> Connector: datastax:spark-cassandra-connector:2.0.1-s_2.11
> 
> Spark cluster specs
> 
> 6 vCPUs, 32 GB memory = 1 node
> 
> Cassandra + Zeppelin server specs
> 8 vCPUs, 52 GB memory
> 

Reply via email to