Interesting. As long as Spark's dependencies don't change that often, the
same caches could save "from scratch" build time over many months of Spark
development. Is that right?

On Tue, Dec 8, 2015 at 12:33 PM Josh Rosen <joshro...@databricks.com> wrote:

> @Nick, on a fresh EC2 instance a significant chunk of the initial build
> time might be due to artifact resolution + downloading. Putting
> pre-populated Ivy and Maven caches onto your EC2 machine could shave a
> decent chunk of time off that first build.
>
> On Tue, Dec 8, 2015 at 9:16 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Thanks for the tips, Jakob and Steve.
>>
>> It looks like my original approach is the best for me since I'm
>> installing Spark on newly launched EC2 instances and can't take advantage
>> of incremental compilation.
>>
>> Nick
>>
>> On Tue, Dec 8, 2015 at 7:01 AM Steve Loughran <ste...@hortonworks.com>
>> wrote:
>>
>>> On 7 Dec 2015, at 19:07, Jakob Odersky <joder...@gmail.com> wrote:
>>>
>>> make-distribution and the second code snippet both create a distribution
>>> from a clean state. They therefore require that every source file be
>>> compiled and that takes time (you can maybe tweak some settings or use a
>>> newer compiler to gain some speed).
>>>
>>> I'm inferring from your question that for your use-case deployment speed
>>> is a critical issue, furthermore you'd like to build Spark for lots of
>>> (every?) commit in a systematic way. In that case I would suggest you try
>>> using the second code snippet without the `clean` task and only resort to
>>> it if the build fails.
>>>
>>> On my local machine, an assembly without a clean drops from 6 minutes to
>>> 2.
>>>
>>> regards,
>>> --Jakob
>>>
>>>
>>> 1. you can use zinc -where possible- to speed up scala compilations
>>> 2. you might also consider setting up a local jenkins VM, hooked to
>>> whatever git repo & branch you are working off, and have it do the builds
>>> and tests for you. Not so great for interactive dev,
>>>
>>> finally, on the mac, the "say" command is pretty handy at letting you
>>> know when some work in a terminal is ready, so you can do the
>>> first-thing-in-the morning build-of-the-SNAPSHOTS
>>>
>>> mvn install -DskipTests -Pyarn,hadoop-2.6 -Dhadoop.version=2.7.1; say moo
>>>
>>> After that you can work on the modules you care about (via the -pl)
>>> option). That doesn't work if you are running on an EC2 instance though
>>>
>>>
>>>
>>>
>>> On 23 November 2015 at 20:18, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> Say I want to build a complete Spark distribution against Hadoop 2.6+
>>>> as fast as possible from scratch.
>>>>
>>>> This is what I’m doing at the moment:
>>>>
>>>> ./make-distribution.sh -T 1C -Phadoop-2.6
>>>>
>>>> -T 1C instructs Maven to spin up 1 thread per available core. This
>>>> takes around 20 minutes on an m3.large instance.
>>>>
>>>> I see that spark-ec2, on the other hand, builds Spark as follows
>>>> <https://github.com/amplab/spark-ec2/blob/a990752575cd8b0ab25731d7820a55c714798ec3/spark/init.sh#L21-L22>
>>>> when you deploy Spark at a specific git commit:
>>>>
>>>> sbt/sbt clean assembly
>>>> sbt/sbt publish-local
>>>>
>>>> This seems slower than using make-distribution.sh, actually.
>>>>
>>>> Is there a faster way to do this?
>>>>
>>>> Nick
>>>> ​
>>>>
>>>
>>>
>>>
>

Reply via email to