Hello, I am working on a project as a part of my research. The system I am working on is basically an in-memory computing system. I want to compare its performance with Spark. Here is how I conduct experiments. For my project: I have a software defined network(SDN) that allows HPC applications to share data, such as sending and receiving messages through this network. For example, in a word count application, a master reads a 10GB text file from hard drive, slices into small chunks, and distribute the chunks. Each worker will fetch some chunks, process them, and send them back to the SDN. Then master collects the results.
To compare with Spark, I run word count application. I run Spark in standalone mode. I do not use any cluster manager. There is no pre-installed HDFS. I use PBS to reserve nodes, which gives me list of nodes. Then I simply run Spark on these nodes. Here is the command to run Spark: ~/SPARK/bin/spark-submit --class word.JavaWordCount --num-executors 1 spark.jar ~/data.txt > ~/wc Technically, these experiments are run under same conditions. Read file, cut it into small chunks, distribute chunks, process chunks, collect results. Do you think this is a reasonable comparison? Can someone make this claim: "Well, Spark is designed to work on top of HDFS, in which the data is already stored in nodes, and Spark jobs are submitted to these nodes to take advantage of data locality" Any comment is appreciated. Thanks -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-performance-comparison-for-research-tp16498.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org