Re: [build system] IMPORTANT UPDATE

2020-11-24 Thread shane knapp ☠
all spark builds have been ported and triggered:
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/

not shown are the regular and k8s PRB, which are also running.

i think i've nailed down most of the stupid PATH and JAVA_HOME issues, but
i'm sure we'll have some stuff to work out.  i'm mostly keeping an eye on
the build history of research-jenkins-worker-01 and -02, as they're running
the latest OS + ansible (which will be moved in to the spark repo asap).

i'm still concerned about sbt failures, which includes the PRB.  we'll see
how things go, and just focus on getting things working on ubuntu 20 LTS.
if we need to drop the ubuntu 16 workers from the pool temporarily, i would
be more than happy to do that.  we'll lose some capacity, but it looks like
we have a solid template for getting these suckers redeployed so
turn-around should be pretty quick.

we also need to dedicate some time to clean up/fix our plugin configs.
there's been a lot of change over the past three years and things like PRB
triggers seem flaky (it took 28m instead of 5m for this job to trigger:
https://github.com/apache/spark/pull/29994)

this all being said, i'm really happy w/our progress so far and have
started leaning towards 'cautiously optimistic'...  we'll see how things go
and recalibrate accordingly.  i'll have a better idea of where we are
tomorrow and keep the list updated.

and finally:  a HUGE thanks goes out to jon for the work going on at the
colo this moment:  rack rearrangement, cleaning up networking, fixing
hardware, reimaging and generally kicking ass!

have a great holiday!

shane

On Tue, Nov 24, 2020 at 2:24 PM shane knapp ☠  wrote:

> our very first ubuntu-based PRB is running:
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131701/
>
> crossing my fingers!  :)
>
> On Tue, Nov 24, 2020 at 1:30 PM shane knapp ☠  wrote:
>
>> due to scheduling, upcoming holiday and in-the-colo work requirements,
>> all of the centos workers are being wiped NOW.
>>
>> this is great, as the sooner we can get started on fixing builds the
>> better.  i'm not going anywhere over the holiday, so i'll get a good
>> head-start on things.
>>
>> thank you jon!
>>
>> shane
>>
>> On Tue, Nov 24, 2020 at 11:24 AM shane knapp ☠ 
>> wrote:
>>
>>> this is a lengthy, but important read for everyone here.
>>>
>>> in the next few days, the remaining centos machines (PRB/SBT workers AND
>>> primary) will have be reimaged from centos6.9 to ubuntu 20.04LTS.
>>>
>>> this means three important things on the very near horizon:
>>> 1 -- the PRB and SBT tests WILL BE BROKEN (by thanksgiving)
>>> 2 -- jenkins itself will be down for a while as we move the jenkins
>>> installation to it's new home.
>>> 3 -- those of you with accounts here will temporarily lose access
>>>
>>> regarding (1), brian (cced) will be helping me debug and fix any
>>> system-level bugs (python envs, missing packages, etc).  jon (cced) will be
>>> doing the reimaging and cobbling together of hardware to keep us on our
>>> feet.  their help is going to be invaluable to getting us back on the
>>> ground.
>>>
>>> we already have two ubuntu 20 workers up and building
>>> (research-jenkins-worker-0[1,2]), and the SparkPullRequestBuilder-K8s build
>>> is already green.  i'll keep an eye on these workers to ensure i didn't
>>> miss anything.
>>>
>>> once we have a couple of more ubuntu 20 machines up, i'll move the PRB
>>> and SBT builds there and let them fail as often as possible so we can use
>>> the build logs during the migration of the primary.
>>>
>>> then we shut down jenkins and move to the new primary.
>>>
>>> this will all be happening in the next week to week-and-a-half.
>>>
>>> nearish on the horizon, we need to do two things:
>>> 1 -- reimage the ubuntu 16 workers
>>> 2 -- clean up the all of the breakages within jenkins plugin universe.
>>> there's a lot of stacktraces everywhere after the upgrade, but things are
>>> still building so i'm inclined to push this out.
>>> 3 -- fix the PRB/SBT builds.
>>>
>>> further off, once we're stable, we (the spark community) will need to
>>> have an honest conversation about where the build system lives.  we don't
>>> currently have enough resources here to manage the system in a way that it
>>> deserves, and i can't forsee getting the staffing for long-term support any
>>> time soon.
>>>
>>> however, with the ansible configs (which i plan on moving to the spark
>>> repo), it should be much easier to replicate the build system.
>>>
>>> by this time next year, i would like to have helped find the build
>>> system a new home, and sunset jenkins.  over the past 11 years (i think),
>>> this system has built spark.  it's getting a little tired and needs a well
>>> deserved break.  :)
>>>
>>> shane
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Shane Knapp
>> Computer Guy / 

Re: [build system] IMPORTANT UPDATE

2020-11-24 Thread shane knapp ☠
our very first ubuntu-based PRB is running:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131701/

crossing my fingers!  :)

On Tue, Nov 24, 2020 at 1:30 PM shane knapp ☠  wrote:

> due to scheduling, upcoming holiday and in-the-colo work requirements, all
> of the centos workers are being wiped NOW.
>
> this is great, as the sooner we can get started on fixing builds the
> better.  i'm not going anywhere over the holiday, so i'll get a good
> head-start on things.
>
> thank you jon!
>
> shane
>
> On Tue, Nov 24, 2020 at 11:24 AM shane knapp ☠ 
> wrote:
>
>> this is a lengthy, but important read for everyone here.
>>
>> in the next few days, the remaining centos machines (PRB/SBT workers AND
>> primary) will have be reimaged from centos6.9 to ubuntu 20.04LTS.
>>
>> this means three important things on the very near horizon:
>> 1 -- the PRB and SBT tests WILL BE BROKEN (by thanksgiving)
>> 2 -- jenkins itself will be down for a while as we move the jenkins
>> installation to it's new home.
>> 3 -- those of you with accounts here will temporarily lose access
>>
>> regarding (1), brian (cced) will be helping me debug and fix any
>> system-level bugs (python envs, missing packages, etc).  jon (cced) will be
>> doing the reimaging and cobbling together of hardware to keep us on our
>> feet.  their help is going to be invaluable to getting us back on the
>> ground.
>>
>> we already have two ubuntu 20 workers up and building
>> (research-jenkins-worker-0[1,2]), and the SparkPullRequestBuilder-K8s build
>> is already green.  i'll keep an eye on these workers to ensure i didn't
>> miss anything.
>>
>> once we have a couple of more ubuntu 20 machines up, i'll move the PRB
>> and SBT builds there and let them fail as often as possible so we can use
>> the build logs during the migration of the primary.
>>
>> then we shut down jenkins and move to the new primary.
>>
>> this will all be happening in the next week to week-and-a-half.
>>
>> nearish on the horizon, we need to do two things:
>> 1 -- reimage the ubuntu 16 workers
>> 2 -- clean up the all of the breakages within jenkins plugin universe.
>> there's a lot of stacktraces everywhere after the upgrade, but things are
>> still building so i'm inclined to push this out.
>> 3 -- fix the PRB/SBT builds.
>>
>> further off, once we're stable, we (the spark community) will need to
>> have an honest conversation about where the build system lives.  we don't
>> currently have enough resources here to manage the system in a way that it
>> deserves, and i can't forsee getting the staffing for long-term support any
>> time soon.
>>
>> however, with the ansible configs (which i plan on moving to the spark
>> repo), it should be much easier to replicate the build system.
>>
>> by this time next year, i would like to have helped find the build system
>> a new home, and sunset jenkins.  over the past 11 years (i think), this
>> system has built spark.  it's getting a little tired and needs a well
>> deserved break.  :)
>>
>> shane
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [build system] IMPORTANT UPDATE

2020-11-24 Thread shane knapp ☠
due to scheduling, upcoming holiday and in-the-colo work requirements, all
of the centos workers are being wiped NOW.

this is great, as the sooner we can get started on fixing builds the
better.  i'm not going anywhere over the holiday, so i'll get a good
head-start on things.

thank you jon!

shane

On Tue, Nov 24, 2020 at 11:24 AM shane knapp ☠  wrote:

> this is a lengthy, but important read for everyone here.
>
> in the next few days, the remaining centos machines (PRB/SBT workers AND
> primary) will have be reimaged from centos6.9 to ubuntu 20.04LTS.
>
> this means three important things on the very near horizon:
> 1 -- the PRB and SBT tests WILL BE BROKEN (by thanksgiving)
> 2 -- jenkins itself will be down for a while as we move the jenkins
> installation to it's new home.
> 3 -- those of you with accounts here will temporarily lose access
>
> regarding (1), brian (cced) will be helping me debug and fix any
> system-level bugs (python envs, missing packages, etc).  jon (cced) will be
> doing the reimaging and cobbling together of hardware to keep us on our
> feet.  their help is going to be invaluable to getting us back on the
> ground.
>
> we already have two ubuntu 20 workers up and building
> (research-jenkins-worker-0[1,2]), and the SparkPullRequestBuilder-K8s build
> is already green.  i'll keep an eye on these workers to ensure i didn't
> miss anything.
>
> once we have a couple of more ubuntu 20 machines up, i'll move the PRB and
> SBT builds there and let them fail as often as possible so we can use the
> build logs during the migration of the primary.
>
> then we shut down jenkins and move to the new primary.
>
> this will all be happening in the next week to week-and-a-half.
>
> nearish on the horizon, we need to do two things:
> 1 -- reimage the ubuntu 16 workers
> 2 -- clean up the all of the breakages within jenkins plugin universe.
> there's a lot of stacktraces everywhere after the upgrade, but things are
> still building so i'm inclined to push this out.
> 3 -- fix the PRB/SBT builds.
>
> further off, once we're stable, we (the spark community) will need to have
> an honest conversation about where the build system lives.  we don't
> currently have enough resources here to manage the system in a way that it
> deserves, and i can't forsee getting the staffing for long-term support any
> time soon.
>
> however, with the ansible configs (which i plan on moving to the spark
> repo), it should be much easier to replicate the build system.
>
> by this time next year, i would like to have helped find the build system
> a new home, and sunset jenkins.  over the past 11 years (i think), this
> system has built spark.  it's getting a little tired and needs a well
> deserved break.  :)
>
> shane
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: jenkins downtime tomorrow evening/weekend

2020-11-24 Thread shane knapp ☠
i just added it to the PRB config.

On Tue, Nov 24, 2020 at 2:12 AM Yuming Wang  wrote:

> Hi Shane,
>
> Did you set :export LANG=en_US.UTF-8? Some test seems failed because of
> this issue:
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131631/testReport/
>
> Please see https://issues.apache.org/jira/browse/SPARK-27177 for more
> details.
>
> On Tue, Nov 24, 2020 at 8:23 AM shane knapp ☠  wrote:
>
>> it seems that the plugin upgrade went as smoothly as it could have...  i
>> still have a bunch of stack traces to filter through and see if anything is
>> really broken but it's looking pretty good and things are building.
>>
>> if you see any bad behavior from jenkins, don't hesitate to file a jira
>> and ping me here.
>>
>> also, my backlog of things i need to install will be addressed this
>> week.  the ansible is coming along nicely!
>>
>> On Mon, Nov 23, 2020 at 2:11 PM shane knapp ☠ 
>> wrote:
>>
>>> the third most terrifying event in the world, a massive jenkins plugin
>>> update is happening in a couple of hours.  i'm going to restart jenkins and
>>> start working out any bugs/issues that pop up.
>>>
>>> this could be short, or quite long.  i'm guessing somewhere in the
>>> middle.  no new builds will be kicked off starting now.
>>>
>>> in parallel, i'm about to start porting my ansible to ubuntu 20 and
>>> testing that on two freshly reinstalled workers.  the ultimate goal is to
>>> get the PRB running on ubuntu 20...   the sbt tests will also likely be
>>> broken as i've never been able to work on ubuntu 16, 18 or 20.
>>>
>>> shane
>>>
>>> On Sat, Nov 21, 2020 at 4:23 PM shane knapp ☠ 
>>> wrote:
>>>
 somehow that went pretty smoothly, tho i've got a bunch of plugins to
 deal with...  we're back up and building w/a shiny new UI.  :)

 On Sat, Nov 21, 2020 at 3:52 PM shane knapp ☠ 
 wrote:

> this is starting now
>
> On Thu, Nov 19, 2020 at 4:34 PM shane knapp ☠ 
> wrote:
>
>> i'm going to be upgrading jenkins to something more reasonable, and
>> there will definitely be some downtime as i get things sorted.
>>
>> we should be back up and building by monday.
>>
>> shane
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


 --
 Shane Knapp
 Computer Guy / Voice of Reason
 UC Berkeley EECS Research / RISELab Staff Technical Lead
 https://rise.cs.berkeley.edu

>>>
>>>
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>

-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


[build system] IMPORTANT UPDATE

2020-11-24 Thread shane knapp ☠
this is a lengthy, but important read for everyone here.

in the next few days, the remaining centos machines (PRB/SBT workers AND
primary) will have be reimaged from centos6.9 to ubuntu 20.04LTS.

this means three important things on the very near horizon:
1 -- the PRB and SBT tests WILL BE BROKEN (by thanksgiving)
2 -- jenkins itself will be down for a while as we move the jenkins
installation to it's new home.
3 -- those of you with accounts here will temporarily lose access

regarding (1), brian (cced) will be helping me debug and fix any
system-level bugs (python envs, missing packages, etc).  jon (cced) will be
doing the reimaging and cobbling together of hardware to keep us on our
feet.  their help is going to be invaluable to getting us back on the
ground.

we already have two ubuntu 20 workers up and building
(research-jenkins-worker-0[1,2]), and the SparkPullRequestBuilder-K8s build
is already green.  i'll keep an eye on these workers to ensure i didn't
miss anything.

once we have a couple of more ubuntu 20 machines up, i'll move the PRB and
SBT builds there and let them fail as often as possible so we can use the
build logs during the migration of the primary.

then we shut down jenkins and move to the new primary.

this will all be happening in the next week to week-and-a-half.

nearish on the horizon, we need to do two things:
1 -- reimage the ubuntu 16 workers
2 -- clean up the all of the breakages within jenkins plugin universe.
there's a lot of stacktraces everywhere after the upgrade, but things are
still building so i'm inclined to push this out.
3 -- fix the PRB/SBT builds.

further off, once we're stable, we (the spark community) will need to have
an honest conversation about where the build system lives.  we don't
currently have enough resources here to manage the system in a way that it
deserves, and i can't forsee getting the staffing for long-term support any
time soon.

however, with the ansible configs (which i plan on moving to the spark
repo), it should be much easier to replicate the build system.

by this time next year, i would like to have helped find the build system a
new home, and sunset jenkins.  over the past 11 years (i think), this
system has built spark.  it's getting a little tired and needs a well
deserved break.  :)

shane
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: jenkins downtime tomorrow evening/weekend

2020-11-24 Thread Yuming Wang
Hi Shane,

Did you set :export LANG=en_US.UTF-8? Some test seems failed because of
this issue:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131631/testReport/

Please see https://issues.apache.org/jira/browse/SPARK-27177 for more
details.

On Tue, Nov 24, 2020 at 8:23 AM shane knapp ☠  wrote:

> it seems that the plugin upgrade went as smoothly as it could have...  i
> still have a bunch of stack traces to filter through and see if anything is
> really broken but it's looking pretty good and things are building.
>
> if you see any bad behavior from jenkins, don't hesitate to file a jira
> and ping me here.
>
> also, my backlog of things i need to install will be addressed this week.
> the ansible is coming along nicely!
>
> On Mon, Nov 23, 2020 at 2:11 PM shane knapp ☠  wrote:
>
>> the third most terrifying event in the world, a massive jenkins plugin
>> update is happening in a couple of hours.  i'm going to restart jenkins and
>> start working out any bugs/issues that pop up.
>>
>> this could be short, or quite long.  i'm guessing somewhere in the
>> middle.  no new builds will be kicked off starting now.
>>
>> in parallel, i'm about to start porting my ansible to ubuntu 20 and
>> testing that on two freshly reinstalled workers.  the ultimate goal is to
>> get the PRB running on ubuntu 20...   the sbt tests will also likely be
>> broken as i've never been able to work on ubuntu 16, 18 or 20.
>>
>> shane
>>
>> On Sat, Nov 21, 2020 at 4:23 PM shane knapp ☠ 
>> wrote:
>>
>>> somehow that went pretty smoothly, tho i've got a bunch of plugins to
>>> deal with...  we're back up and building w/a shiny new UI.  :)
>>>
>>> On Sat, Nov 21, 2020 at 3:52 PM shane knapp ☠ 
>>> wrote:
>>>
 this is starting now

 On Thu, Nov 19, 2020 at 4:34 PM shane knapp ☠ 
 wrote:

> i'm going to be upgrading jenkins to something more reasonable, and
> there will definitely be some downtime as i get things sorted.
>
> we should be back up and building by monday.
>
> shane
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


 --
 Shane Knapp
 Computer Guy / Voice of Reason
 UC Berkeley EECS Research / RISELab Staff Technical Lead
 https://rise.cs.berkeley.edu

>>>
>>>
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>