+1 to only having one way to do things. The Lite option seems liable to
cause more problems since it means it's changes can be blown away if a new
image isn't prepared anyway.
I don't think we are changing the images often enough for it.  Perhaps call
it the option to test changes if anything?

On Tue, Oct 19, 2021, 11:55 AM Valentyn Tymofieiev <[email protected]>
wrote:

> All workers were updated to use jenkins-slave-boot-image-20211011, which
> should have had a go command, but it appears slightly misconfigured. I
> reopened BEAM-13037 [1] and added some details there.
>
> I also added instructions to wiki [2] on how to perform an image swap and
> it is actually very straightforward. I think a lesson here is that making
> 'lite' upgrades is brittle as misconfigurations could resurface down the
> road when the context of the lite upgrade is no longer fresh in our memory.
>
> I suggest we revise the instructions to keep only image swap commands and
> remove the 'lite' update option. +Daniel Oliveira <[email protected]>,
> WDYT?  In the meantime, we should also prepare an image that fixes the
> misconfiguration. Would you be able to help with that? Thank you.
>
> [1] https://issues.apache.org/jira/browse/BEAM-13037
> [2]
> https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers
>
>
> On Tue, Oct 19, 2021 at 8:46 AM Robert Burke <[email protected]> wrote:
>
>> FYI it looks like all the Go tests are now failing because it can't find
>> the Go command at all.
>> Did a Jenkins image without Go (v1.16+) pre-installed get pushed?
>>
>> On Mon, Oct 18, 2021, 1:45 PM Valentyn Tymofieiev <[email protected]>
>> wrote:
>>
>>> Thanks Daniel,
>>>
>>> I can recreate the VMs on new disks.
>>>
>>> We currently have a set of stopped jenkins workers (named:
>>> apache-beam-jenkins-##) and running workers (named:
>>> apache-ci-beam-jenkins-##)
>>>
>>> Are there any concerns about deleting the stopped group of workers?
>>>
>>>
>>>
>>> On Mon, Oct 18, 2021 at 11:19 AM Ahmet Altay <[email protected]> wrote:
>>>
>>>> Thank you Daniel, Valentyn!
>>>>
>>>> On Mon, Oct 18, 2021 at 8:02 AM Daniel Oliveira <[email protected]>
>>>> wrote:
>>>>
>>>>> I performed a light update of both Go and Python (from Valentyn's
>>>>> update) on each worker VM over the weekend. I also added additional
>>>>> instructions for the light update to Confluence (as an alternative to the
>>>>> current instructions).
>>>>>
>>>>> There is still reason to perform a full update at some point: Valentyn
>>>>> updated the VM image from 500 GB to 1000 GB of storage, which requires a
>>>>> full update to actually take effect.
>>>>>
>>>>> On Tue, Oct 12, 2021 at 10:32 AM Valentyn Tymofieiev <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> > 3. SSH into the agent and perform the update.
>>>>>> So, this would be a 'lite' version of the update, where we make
>>>>>> changes to the live worker without recreating worker VM with a new image?
>>>>>> We could perhaps document both options, and also make it clear that
>>>>>> producing a VM image that has necessary updates is mandatory even if we
>>>>>> perform 'lite' updates without recreating the worker.
>>>>>> Also, for a lite update, marking the Jenkins offer offline may be
>>>>>> optional, as some updates might not be disruptive (such as installing 
>>>>>> some
>>>>>> software that will not be used immediately).
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Oct 11, 2021 at 7:53 PM Robert Burke <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> SGTM. Thank you very much Daniel!
>>>>>>>
>>>>>>> On Mon, Oct 11, 2021, 7:51 PM Ahmet Altay <[email protected]> wrote:
>>>>>>>
>>>>>>>> Thank you Daniel. Could you please update the wiki once you are
>>>>>>>> done with the process?
>>>>>>>>
>>>>>>>> On Mon, Oct 11, 2021 at 6:22 PM Daniel Oliveira <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Took me a bit to get to this, sorry. I finally figured out an
>>>>>>>>> approach for updating Go and did so and will be updating the image
>>>>>>>>> momentarily.
>>>>>>>>>
>>>>>>>>> I think a more important note is that I tried what Valentyn was
>>>>>>>>> considering, which is SSHing into workers and updating the 
>>>>>>>>> dependency. I'll
>>>>>>>>> describe the process below, but the summary is that I did it on one 
>>>>>>>>> worker
>>>>>>>>> with Go so far, saw no problems over the weekend, and would like to
>>>>>>>>> continue updating the rest of the workers if there are no objections.
>>>>>>>>>
>>>>>>>>> Here's a step-by-step of what I did. If we decide to stick with
>>>>>>>>> this approach, these instructions can be added to Confluence:
>>>>>>>>>
>>>>>>>>> 1. Go to the page for the Jenkins agent you want to update [1] and
>>>>>>>>> click "Mark this node temporarily offline", leaving a reason such as
>>>>>>>>> "Updating X dependency."
>>>>>>>>> 2. Wait until there are no more tests running in that agent (under
>>>>>>>>> "Build Executor Status" on the left of the page).
>>>>>>>>> 3. SSH into the agent and perform the update.
>>>>>>>>> 4. Mark the node as online again.
>>>>>>>>> 5. Repeat for every worker.
>>>>>>>>>
>>>>>>>>> And these are some additional steps if you want to immediately run
>>>>>>>>> a test suite to check that the update worked correctly. For example 
>>>>>>>>> in my
>>>>>>>>> case, I wanted to check against the Go Postcommit, and it was a good 
>>>>>>>>> thing
>>>>>>>>> I did, because it actually failed the first time and I had to go back 
>>>>>>>>> in to
>>>>>>>>> fix a small oversight I made. So doing this after you update your 
>>>>>>>>> first
>>>>>>>>> worker is probably a good idea before updating the rest:
>>>>>>>>>
>>>>>>>>> 1. Go to the page for the job you want to run (for example: [2]).
>>>>>>>>> 2. Click "Configure" on the left menu.
>>>>>>>>> 3. Find the checkmark "Restrict where this project can be run" and
>>>>>>>>> change the restriction from "beam" to the specific name of the agent 
>>>>>>>>> (ex.
>>>>>>>>> "apache-beam-jenkins-1").
>>>>>>>>> 4. Save and apply that change.
>>>>>>>>> 5. Back on the page for the job, click "Build with Parameters" on
>>>>>>>>> the left menu.
>>>>>>>>> 6. Run the build on "master".
>>>>>>>>> 7. Once you're done checking the results, change the restriction
>>>>>>>>> for the job back to "beam". (This also gets reset once every 24 hours 
>>>>>>>>> in
>>>>>>>>> case you forget.)
>>>>>>>>>
>>>>>>>>> I did that on one agent (apache-beam-jenkins-2) on Friday evening
>>>>>>>>> when it wasn't too busy, and got Go updated and working. I checked 
>>>>>>>>> that
>>>>>>>>> agent's execution history again today just in case, and it was 
>>>>>>>>> healthy over
>>>>>>>>> the weekend, with no Go-related problems as far as I could see. If 
>>>>>>>>> there's
>>>>>>>>> no objections I'd like to go ahead and continue updating the rest of 
>>>>>>>>> the
>>>>>>>>> workers (I'll do this late at night or over the weekend to avoid 
>>>>>>>>> disrupting
>>>>>>>>> dev work).
>>>>>>>>>
>>>>>>>>> [1] https://ci-beam.apache.org/computer/apache-beam-jenkins-1/
>>>>>>>>> [2] https://ci-beam.apache.org/job/beam_PostCommit_Go/
>>>>>>>>>
>>>>>>>>> On Mon, Oct 4, 2021 at 6:14 PM Valentyn Tymofieiev <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> I updated the image in [1], but did not change the workers yet to
>>>>>>>>>> pick up the new image yet. We can do this once we add Go changes on 
>>>>>>>>>> top of
>>>>>>>>>> it.
>>>>>>>>>>
>>>>>>>>>> I am also considering to SSH into every worker and run a one-line
>>>>>>>>>> command that adds the dependency that was missing. It seems to be 
>>>>>>>>>> low risk,
>>>>>>>>>> and  there is a fall-back plan to re-start the worker using the 
>>>>>>>>>> saved image
>>>>>>>>>> - both new and old images are saved and available in Cloud Console.
>>>>>>>>>>
>>>>>>>>>> Ideally, we should find a way to do a rolling upgrade that a PMC
>>>>>>>>>> or committer could trigger without logging into every machine.
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424228#comment-17424228
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 22, 2021 at 3:28 PM Daniel Oliveira <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> @Brian Hulette <[email protected]> That button seems like
>>>>>>>>>>> exactly what we'd need. Doing it manually would be a pain, but it's
>>>>>>>>>>> probably still preferable to causing a bunch of aborted tests.
>>>>>>>>>>>
>>>>>>>>>>> @Valentyn Tymofieiev <[email protected]> Collaborating to do
>>>>>>>>>>> both updates at once is a great idea! I'll message you directly 
>>>>>>>>>>> about it.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:44 PM Valentyn Tymofieiev <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I am also interested in this updating version of Python on VMs,
>>>>>>>>>>>> I need to install Python 3.9. Thanks for looking into this.  We can
>>>>>>>>>>>> coordinate together to make one update instead of two.
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:40 PM Brian Hulette <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I'm not sure about best practices here. Out of curiosity I
>>>>>>>>>>>>> just poked around in the Jenkins UI (e.g. [1]) and it looks like 
>>>>>>>>>>>>> you can
>>>>>>>>>>>>> manually "Mark node temporarily offline" when logged in (if 
>>>>>>>>>>>>> you're a
>>>>>>>>>>>>> committer). According to [2] this will prevent it from picking up 
>>>>>>>>>>>>> new jobs
>>>>>>>>>>>>> after it's finished the currently executing ones. Doing that 
>>>>>>>>>>>>> manually for
>>>>>>>>>>>>> every worker could be a pain though.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Brian
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> https://ci-beam.apache.org/computer/apache-beam-jenkins-13/
>>>>>>>>>>>>> [2]
>>>>>>>>>>>>> https://stackoverflow.com/questions/26553612/how-do-i-disable-a-node-in-jenkins-ui-after-it-has-completed-its-currently-runni
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 1:03 PM Daniel Oliveira <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm aiming at upgrading the version of Go on our Jenkins VMs,
>>>>>>>>>>>>>> and I found these instructions on upgrading software on
>>>>>>>>>>>>>> Jenkins
>>>>>>>>>>>>>> <https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers>
>>>>>>>>>>>>>>  on
>>>>>>>>>>>>>> our cwiki.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I haven't started going through it yet, but I was wondering
>>>>>>>>>>>>>> about the last few steps that involve stopping VMs, deleting 
>>>>>>>>>>>>>> boot disks,
>>>>>>>>>>>>>> and restarting executors. Is there some best practice for that 
>>>>>>>>>>>>>> section to
>>>>>>>>>>>>>> avoid causing interruptions in our automated testing? Should I 
>>>>>>>>>>>>>> be trying to
>>>>>>>>>>>>>> do this outside of peak dev hours, or going one VM at a time so 
>>>>>>>>>>>>>> others can
>>>>>>>>>>>>>> pick up extra load, or anything like that?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Daniel Oliveira
>>>>>>>>>>>>>>
>>>>>>>>>>>>>

Reply via email to