Hi Spark Developers, First, apologies if this doesn't belong on this list but the comments/praise are relevant to all developers. This is just a small note about what we really like about Spark, I/we don't mean to start a whole long discussion thread in this forum but just share our positive experiences with Spark thus far.
To start, as you can tell, we think that the Spark project is amazing and we love it! Having put in nearly half a decade worth of sweat and tears into production Hadoop, MapReduce clusters and application development it's so refreshing to see something arguably simpler and more elegant to supersede it. These are the things we love about Spark and hope these principles continue: -the one command build; make-distribution.sh. Simple, clean and ideal for deployment and devops and rebuilding on different environments and nodes. -not having too much runtime and deploy config; as admins and developers we are sick of setting props like io.sort and mapred.job.shuffle.merge.percent and dfs file locations and temp directories and so on and on again and again every time we deploy a job, new cluster, environment or even change company. -a fully built-in stack, one global project for SQL, dataframes, MLlib etc, so there is no need to add on projects to it on as per Hive, Hue, Hbase etc. This helps life and keeps everything in one place. -single (global) user based operation - no creation of a hdfs mapred unix user, makes life much simpler -single quick-start daemons; master and slaves. Not having to worry about JT, NN, DN , TT, RM, Hbase master ... and doing netstat and jps on hundreds of clusters makes life much easier. -proper code versioning, feature releases and release management. - good & well organised documentation with good examples. In addition to the comments above this is where we hope Spark never ends up: -tonnes of configuration properties and "go faster" type flags. For example Hadoop and Hbase users will know that there are a whole catalogue of properties for regions, caches, network properties, block sizes, etc etc. Please don't end up here for example: https://hadoop.apache.org/docs/r1.0.4/mapred-default.html, it is painful having to configure all of this and then create a set of properties for each environment and then tie this into CI and deployment tools. -no more daemons and processes to have to monitor and manipulate and restart and crash. -a project that penalises developers (that will ultimately help promote Spark to their managers and budget holders) with expensive training, certification, books and accreditation. Ideally this open source should be free, free training= more users = more commercial uptake. Anyway, those are our thoughts for what they are worth, keep up the good work, we just had to mention it. Again sorry if this is not the right place or if there is another forum for this stuff. Cheers