Thanks for sharing the feedback about what works well for you! It's nice to get that; as we all probably know, people generally reach out only when they have problems.
On Wed, Feb 25, 2015 at 5:38 PM Reynold Xin <r...@databricks.com> wrote: > Thanks for the email and encouragement, Devl. Responses to the 3 requests: > > -tonnes of configuration properties and "go faster" type flags. For example > Hadoop and Hbase users will know that there are a whole catalogue of > properties for regions, caches, network properties, block sizes, etc etc. > Please don't end up here for example: > https://hadoop.apache.org/docs/r1.0.4/mapred-default.html, it is painful > having to configure all of this and then create a set of properties for > each environment and then tie this into CI and deployment tools. > > As the project grows, it is unavoidable to introduce more config options, > in particular, we often use config options to test new modules that are > still experimental before making them the default (e.g. sort-based > shuffle). > > The philosophy here is to make it a very high bar to introduce new config > options, and make the default values sensible for most deployments, and > then whenever possible, figure out automatically what is the right setting. > Note that this in general is hard, but we expect for 99% of the users they > only need to know a very small number of options (e.g. setting the > serializer). > > > -no more daemons and processes to have to monitor and manipulate and > restart and crash. > > At the very least you'd need the cluster manager itself to be a daemon > process because we can't defy the law of physics. But I don't think we want > to introduce anything beyond that. > > > -a project that penalises developers (that will ultimately help promote > Spark to their managers and budget holders) with expensive training, > certification, books and accreditation. Ideally this open source should be > free, free training= more users = more commercial uptake. > > I definitely agree with you on making it easier to learn Spark. We are > making a lot of materials freely available, including two freely available > MOOCs on edX: > https://databricks.com/blog/2014/12/02/announcing-two- > spark-based-moocs.html > > > > On Wed, Feb 25, 2015 at 2:13 PM, Devl Devel <devl.developm...@gmail.com> > wrote: > > > Hi Spark Developers, > > > > First, apologies if this doesn't belong on this list but the > > comments/praise are relevant to all developers. This is just a small note > > about what we really like about Spark, I/we don't mean to start a whole > > long discussion thread in this forum but just share our positive > > experiences with Spark thus far. > > > > To start, as you can tell, we think that the Spark project is amazing and > > we love it! Having put in nearly half a decade worth of sweat and tears > > into production Hadoop, MapReduce clusters and application development > it's > > so refreshing to see something arguably simpler and more elegant to > > supersede it. > > > > These are the things we love about Spark and hope these principles > > continue: > > > > -the one command build; make-distribution.sh. Simple, clean and ideal > for > > deployment and devops and rebuilding on different environments and nodes. > > -not having too much runtime and deploy config; as admins and developers > we > > are sick of setting props like io.sort and mapred.job.shuffle.merge. > percent > > and dfs file locations and temp directories and so on and on again and > > again every time we deploy a job, new cluster, environment or even change > > company. > > -a fully built-in stack, one global project for SQL, dataframes, MLlib > etc, > > so there is no need to add on projects to it on as per Hive, Hue, Hbase > > etc. This helps life and keeps everything in one place. > > -single (global) user based operation - no creation of a hdfs mapred unix > > user, makes life much simpler > > -single quick-start daemons; master and slaves. Not having to worry about > > JT, NN, DN , TT, RM, Hbase master ... and doing netstat and jps on > hundreds > > of clusters makes life much easier. > > -proper code versioning, feature releases and release management. > > - good & well organised documentation with good examples. > > > > In addition to the comments above this is where we hope Spark never ends > > up: > > > > -tonnes of configuration properties and "go faster" type flags. For > example > > Hadoop and Hbase users will know that there are a whole catalogue of > > properties for regions, caches, network properties, block sizes, etc etc. > > Please don't end up here for example: > > https://hadoop.apache.org/docs/r1.0.4/mapred-default.html, it is painful > > having to configure all of this and then create a set of properties for > > each environment and then tie this into CI and deployment tools. > > -no more daemons and processes to have to monitor and manipulate and > > restart and crash. > > -a project that penalises developers (that will ultimately help promote > > Spark to their managers and budget holders) with expensive training, > > certification, books and accreditation. Ideally this open source should > be > > free, free training= more users = more commercial uptake. > > > > Anyway, those are our thoughts for what they are worth, keep up the good > > work, we just had to mention it. Again sorry if this is not the right > place > > or if there is another forum for this stuff. > > > > Cheers > > >