Hello,

I am working on a project as a part of my research. The system I am working
on is basically an in-memory computing system. I want to compare its
performance with Spark. Here is how I conduct experiments. For my project: I
have a software defined network(SDN) that allows HPC applications to share
data, such as sending and receiving messages through this network. For
example, in a word count application, a master reads a 10GB text file from
hard drive, slices into small chunks, and distribute the chunks. Each worker
will fetch some chunks, process them, and send them back to the SDN. Then
master collects the results.

To compare with Spark, I run word count application. I run Spark in
standalone mode. I do not use any cluster manager. There is no pre-installed
HDFS. I use PBS to reserve nodes, which gives me list of nodes. Then I
simply run Spark on these nodes. Here is the command to run Spark:
~/SPARK/bin/spark-submit --class word.JavaWordCount  --num-executors 1
spark.jar ~/data.txt  > ~/wc

Technically, these experiments are run under same conditions. Read file, cut
it into small chunks, distribute chunks, process chunks, collect results.
Do you think this is a reasonable comparison? Can someone make this claim:
"Well, Spark is designed to work on top of HDFS, in which the data is
already stored in nodes, and Spark jobs are submitted to these nodes to take
advantage of data locality"


Any comment is appreciated.

Thanks



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-performance-comparison-for-research-tp16498.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to