Re: Fastest way to build Spark from scratch

Josh Rosen Tue, 08 Dec 2015 09:41:02 -0800

@Nick, on a fresh EC2 instance a significant chunk of the initial build
time might be due to artifact resolution + downloading. Putting
pre-populated Ivy and Maven caches onto your EC2 machine could shave a
decent chunk of time off that first build.


On Tue, Dec 8, 2015 at 9:16 AM, Nicholas Chammas <nicholas.cham...@gmail.com
> wrote:

> Thanks for the tips, Jakob and Steve.
>
> It looks like my original approach is the best for me since I'm installing
> Spark on newly launched EC2 instances and can't take advantage of
> incremental compilation.
>
> Nick
>
> On Tue, Dec 8, 2015 at 7:01 AM Steve Loughran <ste...@hortonworks.com>
> wrote:
>
>> On 7 Dec 2015, at 19:07, Jakob Odersky <joder...@gmail.com> wrote:
>>
>> make-distribution and the second code snippet both create a distribution
>> from a clean state. They therefore require that every source file be
>> compiled and that takes time (you can maybe tweak some settings or use a
>> newer compiler to gain some speed).
>>
>> I'm inferring from your question that for your use-case deployment speed
>> is a critical issue, furthermore you'd like to build Spark for lots of
>> (every?) commit in a systematic way. In that case I would suggest you try
>> using the second code snippet without the `clean` task and only resort to
>> it if the build fails.
>>
>> On my local machine, an assembly without a clean drops from 6 minutes to
>> 2.
>>
>> regards,
>> --Jakob
>>
>>
>> 1. you can use zinc -where possible- to speed up scala compilations
>> 2. you might also consider setting up a local jenkins VM, hooked to
>> whatever git repo & branch you are working off, and have it do the builds
>> and tests for you. Not so great for interactive dev,
>>
>> finally, on the mac, the "say" command is pretty handy at letting you
>> know when some work in a terminal is ready, so you can do the
>> first-thing-in-the morning build-of-the-SNAPSHOTS
>>
>> mvn install -DskipTests -Pyarn,hadoop-2.6 -Dhadoop.version=2.7.1; say moo
>>
>> After that you can work on the modules you care about (via the -pl)
>> option). That doesn't work if you are running on an EC2 instance though
>>
>>
>>
>>
>> On 23 November 2015 at 20:18, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Say I want to build a complete Spark distribution against Hadoop 2.6+ as
>>> fast as possible from scratch.
>>>
>>> This is what I’m doing at the moment:
>>>
>>> ./make-distribution.sh -T 1C -Phadoop-2.6
>>>
>>> -T 1C instructs Maven to spin up 1 thread per available core. This
>>> takes around 20 minutes on an m3.large instance.
>>>
>>> I see that spark-ec2, on the other hand, builds Spark as follows
>>> <https://github.com/amplab/spark-ec2/blob/a990752575cd8b0ab25731d7820a55c714798ec3/spark/init.sh#L21-L22>
>>> when you deploy Spark at a specific git commit:
>>>
>>> sbt/sbt clean assembly
>>> sbt/sbt publish-local
>>>
>>> This seems slower than using make-distribution.sh, actually.
>>>
>>> Is there a faster way to do this?
>>>
>>> Nick
>>> 
>>>
>>
>>
>>

Re: Fastest way to build Spark from scratch

Reply via email to