Hi Spark Developers,

First, apologies if this doesn't belong on this list but the
comments/praise are relevant to all developers. This is just a small note
about what we really like about Spark, I/we don't mean to start a whole
long discussion thread in this forum but just share our positive
experiences with Spark thus far.

To start, as you can tell, we think that the Spark project is amazing and
we love it! Having put in nearly half a decade worth of sweat and tears
into production Hadoop, MapReduce clusters and application development it's
so refreshing to see something arguably simpler and more elegant to
supersede it.

These are the things we love about Spark and hope these principles continue:

-the one command build; make-distribution.sh. Simple, clean  and ideal for
deployment and devops and rebuilding on different environments and nodes.
-not having too much runtime and deploy config; as admins and developers we
are sick of setting props like io.sort and mapred.job.shuffle.merge.percent
and dfs file locations and temp directories and so on and on again and
again every time we deploy a job, new cluster, environment or even change
company.
-a fully built-in stack, one global project for SQL, dataframes, MLlib etc,
so there is no need to add on projects to it on as per Hive, Hue, Hbase
etc. This helps life and keeps everything in one place.
-single (global) user based operation - no creation of a hdfs mapred unix
user, makes life much simpler
-single quick-start daemons; master and slaves. Not having to worry about
JT, NN, DN , TT, RM, Hbase master ... and doing netstat and jps on hundreds
of clusters makes life much easier.
-proper code versioning, feature releases and release management.
- good & well organised documentation with good examples.

In addition to the comments above this is where we hope Spark never ends
up:

-tonnes of configuration properties and "go faster" type flags. For example
Hadoop and Hbase users will know that there are a whole catalogue of
properties for regions, caches, network properties, block sizes, etc etc.
Please don't end up here for example:
https://hadoop.apache.org/docs/r1.0.4/mapred-default.html, it is painful
having to configure all of this and then create a set of properties for
each environment and then tie this into CI and deployment tools.
-no more daemons and processes to have to monitor and manipulate and
restart and crash.
-a project that penalises developers (that will ultimately help promote
Spark to their managers and budget holders) with expensive training,
certification, books and accreditation. Ideally this open source should be
free, free training= more users = more commercial uptake.

Anyway, those are our thoughts for what they are worth, keep up the good
work, we just had to mention it. Again sorry if this is not the right place
or if there is another forum for this stuff.

Cheers

Reply via email to