Looks like while .profile was edited to add in a PATH section pointing to
/snap/bin (where go is now installed), it doesn't seem like .profile is
executed by the jenkins login shells.



On Fri, Oct 29, 2021, 6:23 PM Valentyn Tymofieiev <valen...@google.com>
wrote:

>
>
> On Wed, Oct 20, 2021 at 11:16 AM Valentyn Tymofieiev <valen...@google.com>
> wrote:
>
>>
>>
>> On Wed, Oct 20, 2021 at 11:12 AM Pablo Estrada <pabl...@google.com>
>> wrote:
>>
>>> Thanks everyone for investigating and documenting this. I'll use it
>>> today : )
>>>
>> Dan may be also in the middle of doing this, please coordinate.
>>
>>>
>>> ahem - maybe we should rename the image name/image family names
>>> to jenkins-worker-boot-image ? Does anyone foresee issues if we do that?
>>> Does jenkins depend on these names in some undocumented way?
>>>
>> +1. it should 'just work', need to update the wiki after the change.
>> Jenkins also did a terminology adjustment.
>>
> I had to reimage Jenkins workers again, took care of the rename and
> changed the instructions.
>
> I am not sure what is the status of Go Postcommit problem, but noticed
> that jenkins worker #1 had a different boot disk. I reimaged all workers
> building on top of the latest image from the image family. If Go tests
> start failing, we may need to get help from Dan again.
>
>
>>
>>> On Tue, Oct 19, 2021 at 1:43 PM Daniel Oliveira <danolive...@google.com>
>>> wrote:
>>>
>>>> I'm ok with deciding to avoid the "lite" update option, feel free to
>>>> revise the instructions as it seems appropriate. As for the issue, I fixed
>>>> it with a workaround that should work until we need to add a new image to
>>>> the agents, and I'm currently investigating the root cause and prepare a
>>>> fixed image.
>>>>
>>>> That said, I think this issue would have still happened even if we
>>>> didn't perform the "lite" update. I'm still trying to figure out the exact
>>>> problem, but it looks to be a PATH issue that wasn't effectively caught by
>>>> the current process. I won't get into details too much in this thread (see
>>>> the Jira for that), but essentially everything works in my environment when
>>>> I SSH into the VMs, but because the location of the "go" command changed in
>>>> the PATH, it seems to have stopped working for every other user, including
>>>> the Jenkins agents. I actually did notice that would happen when I was
>>>> working on the image, but the solution seemed to be to reboot the machine,
>>>> which I assumed happened already since I shut down the VM to image it.
>>>>
>>>> On Tue, Oct 19, 2021 at 12:09 PM Robert Burke <rob...@frantil.com>
>>>> wrote:
>>>>
>>>>> +1 to only having one way to do things. The Lite option seems liable
>>>>> to cause more problems since it means it's changes can be blown away if a
>>>>> new image isn't prepared anyway.
>>>>> I don't think we are changing the images often enough for it.  Perhaps
>>>>> call it the option to test changes if anything?
>>>>>
>>>>> On Tue, Oct 19, 2021, 11:55 AM Valentyn Tymofieiev <
>>>>> valen...@google.com> wrote:
>>>>>
>>>>>> All workers were updated to use jenkins-slave-boot-image-20211011,
>>>>>> which should have had a go command, but it appears slightly 
>>>>>> misconfigured.
>>>>>> I reopened BEAM-13037 [1] and added some details there.
>>>>>>
>>>>>> I also added instructions to wiki [2] on how to perform an image swap
>>>>>> and it is actually very straightforward. I think a lesson here is that
>>>>>> making 'lite' upgrades is brittle as misconfigurations could resurface 
>>>>>> down
>>>>>> the road when the context of the lite upgrade is no longer fresh in our
>>>>>> memory.
>>>>>>
>>>>>> I suggest we revise the instructions to keep only image swap commands
>>>>>> and remove the 'lite' update option. +Daniel Oliveira
>>>>>> <danolive...@google.com>, WDYT?  In the meantime, we should also
>>>>>> prepare an image that fixes the misconfiguration. Would you be able to 
>>>>>> help
>>>>>> with that? Thank you.
>>>>>>
>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-13037
>>>>>> [2]
>>>>>> https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 19, 2021 at 8:46 AM Robert Burke <rob...@frantil.com>
>>>>>> wrote:
>>>>>>
>>>>>>> FYI it looks like all the Go tests are now failing because it can't
>>>>>>> find the Go command at all.
>>>>>>> Did a Jenkins image without Go (v1.16+) pre-installed get pushed?
>>>>>>>
>>>>>>> On Mon, Oct 18, 2021, 1:45 PM Valentyn Tymofieiev <
>>>>>>> valen...@google.com> wrote:
>>>>>>>
>>>>>>>> Thanks Daniel,
>>>>>>>>
>>>>>>>> I can recreate the VMs on new disks.
>>>>>>>>
>>>>>>>> We currently have a set of stopped jenkins workers (named:
>>>>>>>> apache-beam-jenkins-##) and running workers (named:
>>>>>>>> apache-ci-beam-jenkins-##)
>>>>>>>>
>>>>>>>> Are there any concerns about deleting the stopped group of workers?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Oct 18, 2021 at 11:19 AM Ahmet Altay <al...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thank you Daniel, Valentyn!
>>>>>>>>>
>>>>>>>>> On Mon, Oct 18, 2021 at 8:02 AM Daniel Oliveira <
>>>>>>>>> danolive...@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> I performed a light update of both Go and Python (from Valentyn's
>>>>>>>>>> update) on each worker VM over the weekend. I also added additional
>>>>>>>>>> instructions for the light update to Confluence (as an alternative 
>>>>>>>>>> to the
>>>>>>>>>> current instructions).
>>>>>>>>>>
>>>>>>>>>> There is still reason to perform a full update at some point:
>>>>>>>>>> Valentyn updated the VM image from 500 GB to 1000 GB of storage, 
>>>>>>>>>> which
>>>>>>>>>> requires a full update to actually take effect.
>>>>>>>>>>
>>>>>>>>>> On Tue, Oct 12, 2021 at 10:32 AM Valentyn Tymofieiev <
>>>>>>>>>> valen...@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> > 3. SSH into the agent and perform the update.
>>>>>>>>>>> So, this would be a 'lite' version of the update, where we make
>>>>>>>>>>> changes to the live worker without recreating worker VM with a new 
>>>>>>>>>>> image?
>>>>>>>>>>> We could perhaps document both options, and also make it clear that
>>>>>>>>>>> producing a VM image that has necessary updates is mandatory even 
>>>>>>>>>>> if we
>>>>>>>>>>> perform 'lite' updates without recreating the worker.
>>>>>>>>>>> Also, for a lite update, marking the Jenkins offer offline may
>>>>>>>>>>> be optional, as some updates might not be disruptive (such as 
>>>>>>>>>>> installing
>>>>>>>>>>> some software that will not be used immediately).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Oct 11, 2021 at 7:53 PM Robert Burke <rob...@frantil.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> SGTM. Thank you very much Daniel!
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Oct 11, 2021, 7:51 PM Ahmet Altay <al...@google.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you Daniel. Could you please update the wiki once you
>>>>>>>>>>>>> are done with the process?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Oct 11, 2021 at 6:22 PM Daniel Oliveira <
>>>>>>>>>>>>> danolive...@google.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Took me a bit to get to this, sorry. I finally figured out an
>>>>>>>>>>>>>> approach for updating Go and did so and will be updating the 
>>>>>>>>>>>>>> image
>>>>>>>>>>>>>> momentarily.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think a more important note is that I tried what Valentyn
>>>>>>>>>>>>>> was considering, which is SSHing into workers and updating the 
>>>>>>>>>>>>>> dependency.
>>>>>>>>>>>>>> I'll describe the process below, but the summary is that I did 
>>>>>>>>>>>>>> it on one
>>>>>>>>>>>>>> worker with Go so far, saw no problems over the weekend, and 
>>>>>>>>>>>>>> would like to
>>>>>>>>>>>>>> continue updating the rest of the workers if there are no 
>>>>>>>>>>>>>> objections.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Here's a step-by-step of what I did. If we decide to stick
>>>>>>>>>>>>>> with this approach, these instructions can be added to 
>>>>>>>>>>>>>> Confluence:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. Go to the page for the Jenkins agent you want to update
>>>>>>>>>>>>>> [1] and click "Mark this node temporarily offline", leaving a 
>>>>>>>>>>>>>> reason such
>>>>>>>>>>>>>> as "Updating X dependency."
>>>>>>>>>>>>>> 2. Wait until there are no more tests running in that agent
>>>>>>>>>>>>>> (under "Build Executor Status" on the left of the page).
>>>>>>>>>>>>>> 3. SSH into the agent and perform the update.
>>>>>>>>>>>>>> 4. Mark the node as online again.
>>>>>>>>>>>>>> 5. Repeat for every worker.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> And these are some additional steps if you want to
>>>>>>>>>>>>>> immediately run a test suite to check that the update worked 
>>>>>>>>>>>>>> correctly. For
>>>>>>>>>>>>>> example in my case, I wanted to check against the Go Postcommit, 
>>>>>>>>>>>>>> and it was
>>>>>>>>>>>>>> a good thing I did, because it actually failed the first time 
>>>>>>>>>>>>>> and I had to
>>>>>>>>>>>>>> go back in to fix a small oversight I made. So doing this after 
>>>>>>>>>>>>>> you update
>>>>>>>>>>>>>> your first worker is probably a good idea before updating the 
>>>>>>>>>>>>>> rest:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. Go to the page for the job you want to run (for example:
>>>>>>>>>>>>>> [2]).
>>>>>>>>>>>>>> 2. Click "Configure" on the left menu.
>>>>>>>>>>>>>> 3. Find the checkmark "Restrict where this project can be
>>>>>>>>>>>>>> run" and change the restriction from "beam" to the specific name 
>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>> agent (ex. "apache-beam-jenkins-1").
>>>>>>>>>>>>>> 4. Save and apply that change.
>>>>>>>>>>>>>> 5. Back on the page for the job, click "Build with
>>>>>>>>>>>>>> Parameters" on the left menu.
>>>>>>>>>>>>>> 6. Run the build on "master".
>>>>>>>>>>>>>> 7. Once you're done checking the results, change
>>>>>>>>>>>>>> the restriction for the job back to "beam". (This also gets 
>>>>>>>>>>>>>> reset once
>>>>>>>>>>>>>> every 24 hours in case you forget.)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I did that on one agent (apache-beam-jenkins-2) on Friday
>>>>>>>>>>>>>> evening when it wasn't too busy, and got Go updated and working. 
>>>>>>>>>>>>>> I checked
>>>>>>>>>>>>>> that agent's execution history again today just in case, and it 
>>>>>>>>>>>>>> was healthy
>>>>>>>>>>>>>> over the weekend, with no Go-related problems as far as I could 
>>>>>>>>>>>>>> see. If
>>>>>>>>>>>>>> there's no objections I'd like to go ahead and continue updating 
>>>>>>>>>>>>>> the rest
>>>>>>>>>>>>>> of the workers (I'll do this late at night or over the weekend 
>>>>>>>>>>>>>> to avoid
>>>>>>>>>>>>>> disrupting dev work).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>> https://ci-beam.apache.org/computer/apache-beam-jenkins-1/
>>>>>>>>>>>>>> [2] https://ci-beam.apache.org/job/beam_PostCommit_Go/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Oct 4, 2021 at 6:14 PM Valentyn Tymofieiev <
>>>>>>>>>>>>>> valen...@google.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I updated the image in [1], but did not change the workers
>>>>>>>>>>>>>>> yet to pick up the new image yet. We can do this once we add Go 
>>>>>>>>>>>>>>> changes on
>>>>>>>>>>>>>>> top of it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am also considering to SSH into every worker and run a
>>>>>>>>>>>>>>> one-line command that adds the dependency that was missing. It 
>>>>>>>>>>>>>>> seems to be
>>>>>>>>>>>>>>> low risk, and  there is a fall-back plan to re-start the worker 
>>>>>>>>>>>>>>> using the
>>>>>>>>>>>>>>> saved image - both new and old images are saved and available 
>>>>>>>>>>>>>>> in Cloud
>>>>>>>>>>>>>>> Console.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ideally, we should find a way to do a rolling upgrade that a
>>>>>>>>>>>>>>> PMC or committer could trigger without logging into every 
>>>>>>>>>>>>>>> machine.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424228#comment-17424228
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 3:28 PM Daniel Oliveira <
>>>>>>>>>>>>>>> danolive...@google.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> @Brian Hulette <bhule...@google.com> That button seems
>>>>>>>>>>>>>>>> like exactly what we'd need. Doing it manually would be a 
>>>>>>>>>>>>>>>> pain, but it's
>>>>>>>>>>>>>>>> probably still preferable to causing a bunch of aborted tests.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> @Valentyn Tymofieiev <valen...@google.com> Collaborating
>>>>>>>>>>>>>>>> to do both updates at once is a great idea! I'll message you 
>>>>>>>>>>>>>>>> directly about
>>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:44 PM Valentyn Tymofieiev <
>>>>>>>>>>>>>>>> valen...@google.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I am also interested in this updating version of Python on
>>>>>>>>>>>>>>>>> VMs, I need to install Python 3.9. Thanks for looking into 
>>>>>>>>>>>>>>>>> this.  We can
>>>>>>>>>>>>>>>>> coordinate together to make one update instead of two.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:40 PM Brian Hulette <
>>>>>>>>>>>>>>>>> bhule...@google.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I'm not sure about best practices here. Out of curiosity
>>>>>>>>>>>>>>>>>> I just poked around in the Jenkins UI (e.g. [1]) and it 
>>>>>>>>>>>>>>>>>> looks like you can
>>>>>>>>>>>>>>>>>> manually "Mark node temporarily offline" when logged in (if 
>>>>>>>>>>>>>>>>>> you're a
>>>>>>>>>>>>>>>>>> committer). According to [2] this will prevent it from 
>>>>>>>>>>>>>>>>>> picking up new jobs
>>>>>>>>>>>>>>>>>> after it's finished the currently executing ones. Doing that 
>>>>>>>>>>>>>>>>>> manually for
>>>>>>>>>>>>>>>>>> every worker could be a pain though.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Brian
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>> https://ci-beam.apache.org/computer/apache-beam-jenkins-13/
>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>>> https://stackoverflow.com/questions/26553612/how-do-i-disable-a-node-in-jenkins-ui-after-it-has-completed-its-currently-runni
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 1:03 PM Daniel Oliveira <
>>>>>>>>>>>>>>>>>> danolive...@google.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I'm aiming at upgrading the version of Go on our Jenkins
>>>>>>>>>>>>>>>>>>> VMs, and I found these instructions on upgrading
>>>>>>>>>>>>>>>>>>> software on Jenkins
>>>>>>>>>>>>>>>>>>> <https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers>
>>>>>>>>>>>>>>>>>>>  on
>>>>>>>>>>>>>>>>>>> our cwiki.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I haven't started going through it yet, but I was
>>>>>>>>>>>>>>>>>>> wondering about the last few steps that involve stopping 
>>>>>>>>>>>>>>>>>>> VMs, deleting boot
>>>>>>>>>>>>>>>>>>> disks, and restarting executors. Is there some best 
>>>>>>>>>>>>>>>>>>> practice for
>>>>>>>>>>>>>>>>>>> that section to avoid causing interruptions in our 
>>>>>>>>>>>>>>>>>>> automated testing?
>>>>>>>>>>>>>>>>>>> Should I be trying to do this outside of peak dev hours, or 
>>>>>>>>>>>>>>>>>>> going one VM at
>>>>>>>>>>>>>>>>>>> a time so others can pick up extra load, or anything like 
>>>>>>>>>>>>>>>>>>> that?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Daniel Oliveira
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>

Reply via email to