Some good things to look at though hopefully #2 will be largely addressed by: 
https://github.com/apache/incubator-spark/pull/230—
Sent from Mailbox for iPhone

On Thu, Dec 19, 2013 at 8:57 PM, Andrew Ash <and...@andrewash.com> wrote:

> I think there are also some improvements that could be made to
> deployability in an enterprise setting.  From my experience:
> 1. Most places I deploy Spark in don't have internet access.  So I can't
> build from source, compile against a different version of Hadoop, etc
> without doing it locally and then getting that onto my servers manually.
>  This is less a problem with Spark now that there are binary distributions,
> but it's still a problem for using Mesos with Spark.
> 2. Configuration of Spark is confusing -- you can make configuration in
> Java system properties, environment variables, command line parameters, and
> for the standalone cluster deployment mode you need to worry about whether
> these need to be set on the master, the worker, the executor, or the
> application/driver program.  Also because spark-shell automatically
> instantiates a SparkContext you have to set up any system properties in the
> init scripts or on the command line with
> JAVA_OPTS="-Dspark.executor.memory=8g" etc.  I'm not sure what needs to be
> done, but it feels that there are gains to be made in configuration options
> here.  Ideally, I would have one configuration file that can be used in all
> 4 places and that's the only place to make configuration changes.
> 3. Standalone cluster mode could use improved resiliency for starting,
> stopping, and keeping alive a service -- there are custom init scripts that
> call each other in a mess of ways: spark-shell, spark-daemon.sh,
> spark-daemons.sh, spark-config.sh, spark-env.sh, compute-classpath.sh,
> spark-executor, spark-class, run-example, and several others in the bin/
> directory.  I would love it if Spark used the Tanuki Service Wrapper, which
> is widely-used for Java service daemons, supports retries, installation as
> init scripts that can be chkconfig'd, etc.  Let's not re-solve the "how do
> I keep a service running?" problem when it's been done so well by Tanuki --
> we use it at my day job for all our services, plus it's used by
> Elasticsearch.  This would help solve the problem where a quick bounce of
> the master causes all the workers to self-destruct.
> 4. Sensitivity to hostname vs FQDN vs IP address in spark URL -- this is
> entirely an Akka bug based on previous mailing list discussion with Matei,
> but it'd be awesome if you could use either the hostname or the FQDN or the
> IP address in the Spark URL and not have Akka barf at you.
> I've been telling myself I'd look into these at some point but just haven't
> gotten around to them myself yet.  Some day!  I would prioritize these
> requests from most- to least-important as 3, 2, 4, 1.
> Andrew
> On Thu, Dec 19, 2013 at 1:38 PM, Nick Pentreath 
> <nick.pentre...@gmail.com>wrote:
>> Or if you're extremely ambitious work in implementing Spark Streaming in
>> Python—
>> Sent from Mailbox for iPhone
>>
>> On Thu, Dec 19, 2013 at 8:30 PM, Matei Zaharia <matei.zaha...@gmail.com>
>> wrote:
>>
>> > Hi Matt,
>> > If you want to get started looking at Spark, I recommend the following
>> resources:
>> > - Our issue tracker at http://spark-project.atlassian.net contains some
>> issues marked “Starter” that are good places to jump into. You might be
>> able to take one of those and extend it into a bigger project.
>> > - The “contributing to Spark” wiki page covers how to send patches and
>> set up development:
>> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
>> > - This talk has an intro to Spark internals (video and slides are in the
>> comments): http://www.meetup.com/spark-users/events/94101942/
>> > For a longer project, here are some possible ones:
>> > - Create a tool that automatically checks which Scala API methods are
>> missing in Python. We had a similar one for Java that was very useful. Even
>> better would be to automatically create wrappers for the Scala ones.
>> > - Extend the Spark monitoring UI with profiling information (to sample
>> the workers and say where they’re spending time, or what data structures
>> consume the most memory).
>> > - Pick and implement a new machine learning algorithm for MLlib.
>> > Matei
>> > On Dec 17, 2013, at 10:43 AM, Matthew Cheah <mcch...@uwaterloo.ca>
>> wrote:
>> >> Hi everyone,
>> >>
>> >> During my most recent internship, I worked extensively with Apache
>> Spark,
>> >> integrating it into a company's data analytics platform. I've now become
>> >> interested in contributing to Apache Spark.
>> >>
>> >> I'm returning to undergraduate studies in January and there is an
>> academic
>> >> course which is simply a standalone software engineering project. I was
>> >> thinking that some contribution to Apache Spark would satisfy my
>> curiosity,
>> >> help continue support the company I interned at, and give me academic
>> >> credits required to graduate, all at the same time. It seems like too
>> good
>> >> an opportunity to pass up.
>> >>
>> >> With that in mind, I have the following questions:
>> >>
>> >>   1. At this point, is there any self-contained project that I could
>> work
>> >>   on within Spark? Ideally, I would work on it independently, in about a
>> >>   three month time frame. This time also needs to accommodate ramping
>> up on
>> >>   the Spark codebase and adjusting to the Scala programming language and
>> >>   paradigms. The company I worked at primarily used the Java APIs. The
>> output
>> >>   needs to be a technical report describing the project requirements,
>> and the
>> >>   design process I took to engineer the solution for the requirements.
>> In
>> >>   particular, it cannot just be a series of haphazard patches.
>> >>   2. How can I get started with contributing to Spark?
>> >>   3. Is there a high-level UML or some other design specification for
>> the
>> >>   Spark architecture?
>> >>
>> >> Thanks! I hope to be of some help =)
>> >>
>> >> -Matt Cheah
>>

Reply via email to