Looks like while .profile was edited to add in a PATH section pointing to /snap/bin (where go is now installed), it doesn't seem like .profile is executed by the jenkins login shells.
On Fri, Oct 29, 2021, 6:23 PM Valentyn Tymofieiev <valen...@google.com> wrote: > > > On Wed, Oct 20, 2021 at 11:16 AM Valentyn Tymofieiev <valen...@google.com> > wrote: > >> >> >> On Wed, Oct 20, 2021 at 11:12 AM Pablo Estrada <pabl...@google.com> >> wrote: >> >>> Thanks everyone for investigating and documenting this. I'll use it >>> today : ) >>> >> Dan may be also in the middle of doing this, please coordinate. >> >>> >>> ahem - maybe we should rename the image name/image family names >>> to jenkins-worker-boot-image ? Does anyone foresee issues if we do that? >>> Does jenkins depend on these names in some undocumented way? >>> >> +1. it should 'just work', need to update the wiki after the change. >> Jenkins also did a terminology adjustment. >> > I had to reimage Jenkins workers again, took care of the rename and > changed the instructions. > > I am not sure what is the status of Go Postcommit problem, but noticed > that jenkins worker #1 had a different boot disk. I reimaged all workers > building on top of the latest image from the image family. If Go tests > start failing, we may need to get help from Dan again. > > >> >>> On Tue, Oct 19, 2021 at 1:43 PM Daniel Oliveira <danolive...@google.com> >>> wrote: >>> >>>> I'm ok with deciding to avoid the "lite" update option, feel free to >>>> revise the instructions as it seems appropriate. As for the issue, I fixed >>>> it with a workaround that should work until we need to add a new image to >>>> the agents, and I'm currently investigating the root cause and prepare a >>>> fixed image. >>>> >>>> That said, I think this issue would have still happened even if we >>>> didn't perform the "lite" update. I'm still trying to figure out the exact >>>> problem, but it looks to be a PATH issue that wasn't effectively caught by >>>> the current process. I won't get into details too much in this thread (see >>>> the Jira for that), but essentially everything works in my environment when >>>> I SSH into the VMs, but because the location of the "go" command changed in >>>> the PATH, it seems to have stopped working for every other user, including >>>> the Jenkins agents. I actually did notice that would happen when I was >>>> working on the image, but the solution seemed to be to reboot the machine, >>>> which I assumed happened already since I shut down the VM to image it. >>>> >>>> On Tue, Oct 19, 2021 at 12:09 PM Robert Burke <rob...@frantil.com> >>>> wrote: >>>> >>>>> +1 to only having one way to do things. The Lite option seems liable >>>>> to cause more problems since it means it's changes can be blown away if a >>>>> new image isn't prepared anyway. >>>>> I don't think we are changing the images often enough for it. Perhaps >>>>> call it the option to test changes if anything? >>>>> >>>>> On Tue, Oct 19, 2021, 11:55 AM Valentyn Tymofieiev < >>>>> valen...@google.com> wrote: >>>>> >>>>>> All workers were updated to use jenkins-slave-boot-image-20211011, >>>>>> which should have had a go command, but it appears slightly >>>>>> misconfigured. >>>>>> I reopened BEAM-13037 [1] and added some details there. >>>>>> >>>>>> I also added instructions to wiki [2] on how to perform an image swap >>>>>> and it is actually very straightforward. I think a lesson here is that >>>>>> making 'lite' upgrades is brittle as misconfigurations could resurface >>>>>> down >>>>>> the road when the context of the lite upgrade is no longer fresh in our >>>>>> memory. >>>>>> >>>>>> I suggest we revise the instructions to keep only image swap commands >>>>>> and remove the 'lite' update option. +Daniel Oliveira >>>>>> <danolive...@google.com>, WDYT? In the meantime, we should also >>>>>> prepare an image that fixes the misconfiguration. Would you be able to >>>>>> help >>>>>> with that? Thank you. >>>>>> >>>>>> [1] https://issues.apache.org/jira/browse/BEAM-13037 >>>>>> [2] >>>>>> https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers >>>>>> >>>>>> >>>>>> On Tue, Oct 19, 2021 at 8:46 AM Robert Burke <rob...@frantil.com> >>>>>> wrote: >>>>>> >>>>>>> FYI it looks like all the Go tests are now failing because it can't >>>>>>> find the Go command at all. >>>>>>> Did a Jenkins image without Go (v1.16+) pre-installed get pushed? >>>>>>> >>>>>>> On Mon, Oct 18, 2021, 1:45 PM Valentyn Tymofieiev < >>>>>>> valen...@google.com> wrote: >>>>>>> >>>>>>>> Thanks Daniel, >>>>>>>> >>>>>>>> I can recreate the VMs on new disks. >>>>>>>> >>>>>>>> We currently have a set of stopped jenkins workers (named: >>>>>>>> apache-beam-jenkins-##) and running workers (named: >>>>>>>> apache-ci-beam-jenkins-##) >>>>>>>> >>>>>>>> Are there any concerns about deleting the stopped group of workers? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Oct 18, 2021 at 11:19 AM Ahmet Altay <al...@google.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thank you Daniel, Valentyn! >>>>>>>>> >>>>>>>>> On Mon, Oct 18, 2021 at 8:02 AM Daniel Oliveira < >>>>>>>>> danolive...@google.com> wrote: >>>>>>>>> >>>>>>>>>> I performed a light update of both Go and Python (from Valentyn's >>>>>>>>>> update) on each worker VM over the weekend. I also added additional >>>>>>>>>> instructions for the light update to Confluence (as an alternative >>>>>>>>>> to the >>>>>>>>>> current instructions). >>>>>>>>>> >>>>>>>>>> There is still reason to perform a full update at some point: >>>>>>>>>> Valentyn updated the VM image from 500 GB to 1000 GB of storage, >>>>>>>>>> which >>>>>>>>>> requires a full update to actually take effect. >>>>>>>>>> >>>>>>>>>> On Tue, Oct 12, 2021 at 10:32 AM Valentyn Tymofieiev < >>>>>>>>>> valen...@google.com> wrote: >>>>>>>>>> >>>>>>>>>>> > 3. SSH into the agent and perform the update. >>>>>>>>>>> So, this would be a 'lite' version of the update, where we make >>>>>>>>>>> changes to the live worker without recreating worker VM with a new >>>>>>>>>>> image? >>>>>>>>>>> We could perhaps document both options, and also make it clear that >>>>>>>>>>> producing a VM image that has necessary updates is mandatory even >>>>>>>>>>> if we >>>>>>>>>>> perform 'lite' updates without recreating the worker. >>>>>>>>>>> Also, for a lite update, marking the Jenkins offer offline may >>>>>>>>>>> be optional, as some updates might not be disruptive (such as >>>>>>>>>>> installing >>>>>>>>>>> some software that will not be used immediately). >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, Oct 11, 2021 at 7:53 PM Robert Burke <rob...@frantil.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> SGTM. Thank you very much Daniel! >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Oct 11, 2021, 7:51 PM Ahmet Altay <al...@google.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Thank you Daniel. Could you please update the wiki once you >>>>>>>>>>>>> are done with the process? >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Oct 11, 2021 at 6:22 PM Daniel Oliveira < >>>>>>>>>>>>> danolive...@google.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Took me a bit to get to this, sorry. I finally figured out an >>>>>>>>>>>>>> approach for updating Go and did so and will be updating the >>>>>>>>>>>>>> image >>>>>>>>>>>>>> momentarily. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I think a more important note is that I tried what Valentyn >>>>>>>>>>>>>> was considering, which is SSHing into workers and updating the >>>>>>>>>>>>>> dependency. >>>>>>>>>>>>>> I'll describe the process below, but the summary is that I did >>>>>>>>>>>>>> it on one >>>>>>>>>>>>>> worker with Go so far, saw no problems over the weekend, and >>>>>>>>>>>>>> would like to >>>>>>>>>>>>>> continue updating the rest of the workers if there are no >>>>>>>>>>>>>> objections. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Here's a step-by-step of what I did. If we decide to stick >>>>>>>>>>>>>> with this approach, these instructions can be added to >>>>>>>>>>>>>> Confluence: >>>>>>>>>>>>>> >>>>>>>>>>>>>> 1. Go to the page for the Jenkins agent you want to update >>>>>>>>>>>>>> [1] and click "Mark this node temporarily offline", leaving a >>>>>>>>>>>>>> reason such >>>>>>>>>>>>>> as "Updating X dependency." >>>>>>>>>>>>>> 2. Wait until there are no more tests running in that agent >>>>>>>>>>>>>> (under "Build Executor Status" on the left of the page). >>>>>>>>>>>>>> 3. SSH into the agent and perform the update. >>>>>>>>>>>>>> 4. Mark the node as online again. >>>>>>>>>>>>>> 5. Repeat for every worker. >>>>>>>>>>>>>> >>>>>>>>>>>>>> And these are some additional steps if you want to >>>>>>>>>>>>>> immediately run a test suite to check that the update worked >>>>>>>>>>>>>> correctly. For >>>>>>>>>>>>>> example in my case, I wanted to check against the Go Postcommit, >>>>>>>>>>>>>> and it was >>>>>>>>>>>>>> a good thing I did, because it actually failed the first time >>>>>>>>>>>>>> and I had to >>>>>>>>>>>>>> go back in to fix a small oversight I made. So doing this after >>>>>>>>>>>>>> you update >>>>>>>>>>>>>> your first worker is probably a good idea before updating the >>>>>>>>>>>>>> rest: >>>>>>>>>>>>>> >>>>>>>>>>>>>> 1. Go to the page for the job you want to run (for example: >>>>>>>>>>>>>> [2]). >>>>>>>>>>>>>> 2. Click "Configure" on the left menu. >>>>>>>>>>>>>> 3. Find the checkmark "Restrict where this project can be >>>>>>>>>>>>>> run" and change the restriction from "beam" to the specific name >>>>>>>>>>>>>> of the >>>>>>>>>>>>>> agent (ex. "apache-beam-jenkins-1"). >>>>>>>>>>>>>> 4. Save and apply that change. >>>>>>>>>>>>>> 5. Back on the page for the job, click "Build with >>>>>>>>>>>>>> Parameters" on the left menu. >>>>>>>>>>>>>> 6. Run the build on "master". >>>>>>>>>>>>>> 7. Once you're done checking the results, change >>>>>>>>>>>>>> the restriction for the job back to "beam". (This also gets >>>>>>>>>>>>>> reset once >>>>>>>>>>>>>> every 24 hours in case you forget.) >>>>>>>>>>>>>> >>>>>>>>>>>>>> I did that on one agent (apache-beam-jenkins-2) on Friday >>>>>>>>>>>>>> evening when it wasn't too busy, and got Go updated and working. >>>>>>>>>>>>>> I checked >>>>>>>>>>>>>> that agent's execution history again today just in case, and it >>>>>>>>>>>>>> was healthy >>>>>>>>>>>>>> over the weekend, with no Go-related problems as far as I could >>>>>>>>>>>>>> see. If >>>>>>>>>>>>>> there's no objections I'd like to go ahead and continue updating >>>>>>>>>>>>>> the rest >>>>>>>>>>>>>> of the workers (I'll do this late at night or over the weekend >>>>>>>>>>>>>> to avoid >>>>>>>>>>>>>> disrupting dev work). >>>>>>>>>>>>>> >>>>>>>>>>>>>> [1] >>>>>>>>>>>>>> https://ci-beam.apache.org/computer/apache-beam-jenkins-1/ >>>>>>>>>>>>>> [2] https://ci-beam.apache.org/job/beam_PostCommit_Go/ >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, Oct 4, 2021 at 6:14 PM Valentyn Tymofieiev < >>>>>>>>>>>>>> valen...@google.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I updated the image in [1], but did not change the workers >>>>>>>>>>>>>>> yet to pick up the new image yet. We can do this once we add Go >>>>>>>>>>>>>>> changes on >>>>>>>>>>>>>>> top of it. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I am also considering to SSH into every worker and run a >>>>>>>>>>>>>>> one-line command that adds the dependency that was missing. It >>>>>>>>>>>>>>> seems to be >>>>>>>>>>>>>>> low risk, and there is a fall-back plan to re-start the worker >>>>>>>>>>>>>>> using the >>>>>>>>>>>>>>> saved image - both new and old images are saved and available >>>>>>>>>>>>>>> in Cloud >>>>>>>>>>>>>>> Console. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Ideally, we should find a way to do a rolling upgrade that a >>>>>>>>>>>>>>> PMC or committer could trigger without logging into every >>>>>>>>>>>>>>> machine. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424228#comment-17424228 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 3:28 PM Daniel Oliveira < >>>>>>>>>>>>>>> danolive...@google.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> @Brian Hulette <bhule...@google.com> That button seems >>>>>>>>>>>>>>>> like exactly what we'd need. Doing it manually would be a >>>>>>>>>>>>>>>> pain, but it's >>>>>>>>>>>>>>>> probably still preferable to causing a bunch of aborted tests. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> @Valentyn Tymofieiev <valen...@google.com> Collaborating >>>>>>>>>>>>>>>> to do both updates at once is a great idea! I'll message you >>>>>>>>>>>>>>>> directly about >>>>>>>>>>>>>>>> it. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:44 PM Valentyn Tymofieiev < >>>>>>>>>>>>>>>> valen...@google.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I am also interested in this updating version of Python on >>>>>>>>>>>>>>>>> VMs, I need to install Python 3.9. Thanks for looking into >>>>>>>>>>>>>>>>> this. We can >>>>>>>>>>>>>>>>> coordinate together to make one update instead of two. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:40 PM Brian Hulette < >>>>>>>>>>>>>>>>> bhule...@google.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I'm not sure about best practices here. Out of curiosity >>>>>>>>>>>>>>>>>> I just poked around in the Jenkins UI (e.g. [1]) and it >>>>>>>>>>>>>>>>>> looks like you can >>>>>>>>>>>>>>>>>> manually "Mark node temporarily offline" when logged in (if >>>>>>>>>>>>>>>>>> you're a >>>>>>>>>>>>>>>>>> committer). According to [2] this will prevent it from >>>>>>>>>>>>>>>>>> picking up new jobs >>>>>>>>>>>>>>>>>> after it's finished the currently executing ones. Doing that >>>>>>>>>>>>>>>>>> manually for >>>>>>>>>>>>>>>>>> every worker could be a pain though. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Brian >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>>> https://ci-beam.apache.org/computer/apache-beam-jenkins-13/ >>>>>>>>>>>>>>>>>> [2] >>>>>>>>>>>>>>>>>> https://stackoverflow.com/questions/26553612/how-do-i-disable-a-node-in-jenkins-ui-after-it-has-completed-its-currently-runni >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 1:03 PM Daniel Oliveira < >>>>>>>>>>>>>>>>>> danolive...@google.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hey everyone, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I'm aiming at upgrading the version of Go on our Jenkins >>>>>>>>>>>>>>>>>>> VMs, and I found these instructions on upgrading >>>>>>>>>>>>>>>>>>> software on Jenkins >>>>>>>>>>>>>>>>>>> <https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers> >>>>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>>>> our cwiki. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I haven't started going through it yet, but I was >>>>>>>>>>>>>>>>>>> wondering about the last few steps that involve stopping >>>>>>>>>>>>>>>>>>> VMs, deleting boot >>>>>>>>>>>>>>>>>>> disks, and restarting executors. Is there some best >>>>>>>>>>>>>>>>>>> practice for >>>>>>>>>>>>>>>>>>> that section to avoid causing interruptions in our >>>>>>>>>>>>>>>>>>> automated testing? >>>>>>>>>>>>>>>>>>> Should I be trying to do this outside of peak dev hours, or >>>>>>>>>>>>>>>>>>> going one VM at >>>>>>>>>>>>>>>>>>> a time so others can pick up extra load, or anything like >>>>>>>>>>>>>>>>>>> that? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>> Daniel Oliveira >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>