Maybe your master or zeppelin server is running out of memory and the more data 
it receives the more memory swapping it has to do....something to check.




Get Outlook for Android







On Wed, May 17, 2017 at 11:14 AM -0400, "Junaid Nasir" <jna...@an10.io> wrote:










I have a large data set of 1B records and want to run analytics using Apache 
spark because of the scaling it provides, but I am seeing an anti pattern here. 
The more nodes I add to spark cluster, completion time increases. Data store is 
Cassandra, and queries are run by Zeppelin. I have tried many different queries 
but even a simple query of `dataframe.count()` behaves like this. 
Here is the zeppelin notebook, temp table has 18M records 


val df = sqlContext

      .read

      .format("org.apache.spark.sql.cassandra")

      .options(Map( "table" -> "temp", "keyspace" -> "mykeyspace"))

      .load().cache()

    df.registerTempTable("table")


%sql 

SELECT first(devid),date,count(1) FROM table group by date,rtu order by date

when tested against different no. of spark worker nodes these were the 
resultsSpark nodesTime4 nodes22 min 58 sec3 nodes15 min 49 sec2 nodes12 min 51 
sec1 node17 min 59 sec
Increasing the no. of nodes decreases performance. which should not happen as 
it defeats the purpose of using Spark. 
If you want me to run any query or further info about the setup please ask.Any 
cues on why this is happening would be very helpful, been stuck on this for two 
days now. Thank you for your time.

**versions**
Zeppelin: 0.7.1Spark: 2.1.0Cassandra: 2.2.9Connector: 
datastax:spark-cassandra-connector:2.0.1-s_2.11
Spark cluster specs
6 vCPUs, 32 GB memory = 1 node
Cassandra + Zeppelin server specs8 vCPUs, 52 GB memory






Reply via email to