TIL as well. Sounds like the right location. Thanks Valentyn! On Tue, Nov 2, 2021, 11:00 AM Valentyn Tymofieiev <valen...@google.com> wrote:
> Yeah, .profile is only sourced by login shells. Adding the PATH in > .bashrc can be a workaround, but since .bashrc is executed every time a new > shell runs, PATH variable will be growing with every shell subprocess, so > several sources recommend .profile instead, which does not always work. > We should be able to fix this by updating /etc/environment instead (TIL). > > This is the current content: > cat /etc/environment > > PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games" > > > > > On Mon, Nov 1, 2021 at 10:50 AM Robert Burke <rob...@frantil.com> wrote: > >> Looks like while .profile was edited to add in a PATH section pointing to >> /snap/bin (where go is now installed), it doesn't seem like .profile is >> executed by the jenkins login shells. >> >> >> >> On Fri, Oct 29, 2021, 6:23 PM Valentyn Tymofieiev <valen...@google.com> >> wrote: >> >>> >>> >>> On Wed, Oct 20, 2021 at 11:16 AM Valentyn Tymofieiev < >>> valen...@google.com> wrote: >>> >>>> >>>> >>>> On Wed, Oct 20, 2021 at 11:12 AM Pablo Estrada <pabl...@google.com> >>>> wrote: >>>> >>>>> Thanks everyone for investigating and documenting this. I'll use it >>>>> today : ) >>>>> >>>> Dan may be also in the middle of doing this, please coordinate. >>>> >>>>> >>>>> ahem - maybe we should rename the image name/image family names >>>>> to jenkins-worker-boot-image ? Does anyone foresee issues if we do that? >>>>> Does jenkins depend on these names in some undocumented way? >>>>> >>>> +1. it should 'just work', need to update the wiki after the change. >>>> Jenkins also did a terminology adjustment. >>>> >>> I had to reimage Jenkins workers again, took care of the rename and >>> changed the instructions. >>> >>> I am not sure what is the status of Go Postcommit problem, but noticed >>> that jenkins worker #1 had a different boot disk. I reimaged all workers >>> building on top of the latest image from the image family. If Go tests >>> start failing, we may need to get help from Dan again. >>> >>> >>>> >>>>> On Tue, Oct 19, 2021 at 1:43 PM Daniel Oliveira < >>>>> danolive...@google.com> wrote: >>>>> >>>>>> I'm ok with deciding to avoid the "lite" update option, feel free to >>>>>> revise the instructions as it seems appropriate. As for the issue, I >>>>>> fixed >>>>>> it with a workaround that should work until we need to add a new image to >>>>>> the agents, and I'm currently investigating the root cause and prepare a >>>>>> fixed image. >>>>>> >>>>>> That said, I think this issue would have still happened even if we >>>>>> didn't perform the "lite" update. I'm still trying to figure out the >>>>>> exact >>>>>> problem, but it looks to be a PATH issue that wasn't effectively caught >>>>>> by >>>>>> the current process. I won't get into details too much in this thread >>>>>> (see >>>>>> the Jira for that), but essentially everything works in my environment >>>>>> when >>>>>> I SSH into the VMs, but because the location of the "go" command changed >>>>>> in >>>>>> the PATH, it seems to have stopped working for every other user, >>>>>> including >>>>>> the Jenkins agents. I actually did notice that would happen when I was >>>>>> working on the image, but the solution seemed to be to reboot the >>>>>> machine, >>>>>> which I assumed happened already since I shut down the VM to image it. >>>>>> >>>>>> On Tue, Oct 19, 2021 at 12:09 PM Robert Burke <rob...@frantil.com> >>>>>> wrote: >>>>>> >>>>>>> +1 to only having one way to do things. The Lite option seems liable >>>>>>> to cause more problems since it means it's changes can be blown away if >>>>>>> a >>>>>>> new image isn't prepared anyway. >>>>>>> I don't think we are changing the images often enough for it. >>>>>>> Perhaps call it the option to test changes if anything? >>>>>>> >>>>>>> On Tue, Oct 19, 2021, 11:55 AM Valentyn Tymofieiev < >>>>>>> valen...@google.com> wrote: >>>>>>> >>>>>>>> All workers were updated to use jenkins-slave-boot-image-20211011, >>>>>>>> which should have had a go command, but it appears slightly >>>>>>>> misconfigured. >>>>>>>> I reopened BEAM-13037 [1] and added some details there. >>>>>>>> >>>>>>>> I also added instructions to wiki [2] on how to perform an image >>>>>>>> swap and it is actually very straightforward. I think a lesson here is >>>>>>>> that >>>>>>>> making 'lite' upgrades is brittle as misconfigurations could resurface >>>>>>>> down >>>>>>>> the road when the context of the lite upgrade is no longer fresh in our >>>>>>>> memory. >>>>>>>> >>>>>>>> I suggest we revise the instructions to keep only image swap >>>>>>>> commands and remove the 'lite' update option. +Daniel Oliveira >>>>>>>> <danolive...@google.com>, WDYT? In the meantime, we should also >>>>>>>> prepare an image that fixes the misconfiguration. Would you be able to >>>>>>>> help >>>>>>>> with that? Thank you. >>>>>>>> >>>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-13037 >>>>>>>> [2] >>>>>>>> https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Oct 19, 2021 at 8:46 AM Robert Burke <rob...@frantil.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> FYI it looks like all the Go tests are now failing because it >>>>>>>>> can't find the Go command at all. >>>>>>>>> Did a Jenkins image without Go (v1.16+) pre-installed get pushed? >>>>>>>>> >>>>>>>>> On Mon, Oct 18, 2021, 1:45 PM Valentyn Tymofieiev < >>>>>>>>> valen...@google.com> wrote: >>>>>>>>> >>>>>>>>>> Thanks Daniel, >>>>>>>>>> >>>>>>>>>> I can recreate the VMs on new disks. >>>>>>>>>> >>>>>>>>>> We currently have a set of stopped jenkins workers (named: >>>>>>>>>> apache-beam-jenkins-##) and running workers (named: >>>>>>>>>> apache-ci-beam-jenkins-##) >>>>>>>>>> >>>>>>>>>> Are there any concerns about deleting the stopped group of >>>>>>>>>> workers? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Oct 18, 2021 at 11:19 AM Ahmet Altay <al...@google.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Thank you Daniel, Valentyn! >>>>>>>>>>> >>>>>>>>>>> On Mon, Oct 18, 2021 at 8:02 AM Daniel Oliveira < >>>>>>>>>>> danolive...@google.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> I performed a light update of both Go and Python (from >>>>>>>>>>>> Valentyn's update) on each worker VM over the weekend. I also added >>>>>>>>>>>> additional instructions for the light update to Confluence (as an >>>>>>>>>>>> alternative to the current instructions). >>>>>>>>>>>> >>>>>>>>>>>> There is still reason to perform a full update at some point: >>>>>>>>>>>> Valentyn updated the VM image from 500 GB to 1000 GB of storage, >>>>>>>>>>>> which >>>>>>>>>>>> requires a full update to actually take effect. >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Oct 12, 2021 at 10:32 AM Valentyn Tymofieiev < >>>>>>>>>>>> valen...@google.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> > 3. SSH into the agent and perform the update. >>>>>>>>>>>>> So, this would be a 'lite' version of the update, where we >>>>>>>>>>>>> make changes to the live worker without recreating worker VM with >>>>>>>>>>>>> a new >>>>>>>>>>>>> image? We could perhaps document both options, and also make it >>>>>>>>>>>>> clear that >>>>>>>>>>>>> producing a VM image that has necessary updates is mandatory even >>>>>>>>>>>>> if we >>>>>>>>>>>>> perform 'lite' updates without recreating the worker. >>>>>>>>>>>>> Also, for a lite update, marking the Jenkins offer offline may >>>>>>>>>>>>> be optional, as some updates might not be disruptive (such as >>>>>>>>>>>>> installing >>>>>>>>>>>>> some software that will not be used immediately). >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Oct 11, 2021 at 7:53 PM Robert Burke < >>>>>>>>>>>>> rob...@frantil.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> SGTM. Thank you very much Daniel! >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, Oct 11, 2021, 7:51 PM Ahmet Altay <al...@google.com> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thank you Daniel. Could you please update the wiki once you >>>>>>>>>>>>>>> are done with the process? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, Oct 11, 2021 at 6:22 PM Daniel Oliveira < >>>>>>>>>>>>>>> danolive...@google.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Took me a bit to get to this, sorry. I finally figured out >>>>>>>>>>>>>>>> an approach for updating Go and did so and will be updating >>>>>>>>>>>>>>>> the image >>>>>>>>>>>>>>>> momentarily. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I think a more important note is that I tried what Valentyn >>>>>>>>>>>>>>>> was considering, which is SSHing into workers and updating the >>>>>>>>>>>>>>>> dependency. >>>>>>>>>>>>>>>> I'll describe the process below, but the summary is that I did >>>>>>>>>>>>>>>> it on one >>>>>>>>>>>>>>>> worker with Go so far, saw no problems over the weekend, and >>>>>>>>>>>>>>>> would like to >>>>>>>>>>>>>>>> continue updating the rest of the workers if there are no >>>>>>>>>>>>>>>> objections. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Here's a step-by-step of what I did. If we decide to stick >>>>>>>>>>>>>>>> with this approach, these instructions can be added to >>>>>>>>>>>>>>>> Confluence: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 1. Go to the page for the Jenkins agent you want to update >>>>>>>>>>>>>>>> [1] and click "Mark this node temporarily offline", leaving a >>>>>>>>>>>>>>>> reason such >>>>>>>>>>>>>>>> as "Updating X dependency." >>>>>>>>>>>>>>>> 2. Wait until there are no more tests running in that agent >>>>>>>>>>>>>>>> (under "Build Executor Status" on the left of the page). >>>>>>>>>>>>>>>> 3. SSH into the agent and perform the update. >>>>>>>>>>>>>>>> 4. Mark the node as online again. >>>>>>>>>>>>>>>> 5. Repeat for every worker. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> And these are some additional steps if you want to >>>>>>>>>>>>>>>> immediately run a test suite to check that the update worked >>>>>>>>>>>>>>>> correctly. For >>>>>>>>>>>>>>>> example in my case, I wanted to check against the Go >>>>>>>>>>>>>>>> Postcommit, and it was >>>>>>>>>>>>>>>> a good thing I did, because it actually failed the first time >>>>>>>>>>>>>>>> and I had to >>>>>>>>>>>>>>>> go back in to fix a small oversight I made. So doing this >>>>>>>>>>>>>>>> after you update >>>>>>>>>>>>>>>> your first worker is probably a good idea before updating the >>>>>>>>>>>>>>>> rest: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 1. Go to the page for the job you want to run (for example: >>>>>>>>>>>>>>>> [2]). >>>>>>>>>>>>>>>> 2. Click "Configure" on the left menu. >>>>>>>>>>>>>>>> 3. Find the checkmark "Restrict where this project can be >>>>>>>>>>>>>>>> run" and change the restriction from "beam" to the specific >>>>>>>>>>>>>>>> name of the >>>>>>>>>>>>>>>> agent (ex. "apache-beam-jenkins-1"). >>>>>>>>>>>>>>>> 4. Save and apply that change. >>>>>>>>>>>>>>>> 5. Back on the page for the job, click "Build with >>>>>>>>>>>>>>>> Parameters" on the left menu. >>>>>>>>>>>>>>>> 6. Run the build on "master". >>>>>>>>>>>>>>>> 7. Once you're done checking the results, change >>>>>>>>>>>>>>>> the restriction for the job back to "beam". (This also gets >>>>>>>>>>>>>>>> reset once >>>>>>>>>>>>>>>> every 24 hours in case you forget.) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I did that on one agent (apache-beam-jenkins-2) on Friday >>>>>>>>>>>>>>>> evening when it wasn't too busy, and got Go updated and >>>>>>>>>>>>>>>> working. I checked >>>>>>>>>>>>>>>> that agent's execution history again today just in case, and >>>>>>>>>>>>>>>> it was healthy >>>>>>>>>>>>>>>> over the weekend, with no Go-related problems as far as I >>>>>>>>>>>>>>>> could see. If >>>>>>>>>>>>>>>> there's no objections I'd like to go ahead and continue >>>>>>>>>>>>>>>> updating the rest >>>>>>>>>>>>>>>> of the workers (I'll do this late at night or over the weekend >>>>>>>>>>>>>>>> to avoid >>>>>>>>>>>>>>>> disrupting dev work). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>> https://ci-beam.apache.org/computer/apache-beam-jenkins-1/ >>>>>>>>>>>>>>>> [2] https://ci-beam.apache.org/job/beam_PostCommit_Go/ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Mon, Oct 4, 2021 at 6:14 PM Valentyn Tymofieiev < >>>>>>>>>>>>>>>> valen...@google.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I updated the image in [1], but did not change the workers >>>>>>>>>>>>>>>>> yet to pick up the new image yet. We can do this once we add >>>>>>>>>>>>>>>>> Go changes on >>>>>>>>>>>>>>>>> top of it. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I am also considering to SSH into every worker and run a >>>>>>>>>>>>>>>>> one-line command that adds the dependency that was missing. >>>>>>>>>>>>>>>>> It seems to be >>>>>>>>>>>>>>>>> low risk, and there is a fall-back plan to re-start the >>>>>>>>>>>>>>>>> worker using the >>>>>>>>>>>>>>>>> saved image - both new and old images are saved and available >>>>>>>>>>>>>>>>> in Cloud >>>>>>>>>>>>>>>>> Console. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Ideally, we should find a way to do a rolling upgrade that >>>>>>>>>>>>>>>>> a PMC or committer could trigger without logging into every >>>>>>>>>>>>>>>>> machine. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424228#comment-17424228 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 3:28 PM Daniel Oliveira < >>>>>>>>>>>>>>>>> danolive...@google.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> @Brian Hulette <bhule...@google.com> That button seems >>>>>>>>>>>>>>>>>> like exactly what we'd need. Doing it manually would be a >>>>>>>>>>>>>>>>>> pain, but it's >>>>>>>>>>>>>>>>>> probably still preferable to causing a bunch of aborted >>>>>>>>>>>>>>>>>> tests. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> @Valentyn Tymofieiev <valen...@google.com> Collaborating >>>>>>>>>>>>>>>>>> to do both updates at once is a great idea! I'll message you >>>>>>>>>>>>>>>>>> directly about >>>>>>>>>>>>>>>>>> it. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:44 PM Valentyn Tymofieiev < >>>>>>>>>>>>>>>>>> valen...@google.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I am also interested in this updating version of Python >>>>>>>>>>>>>>>>>>> on VMs, I need to install Python 3.9. Thanks for looking >>>>>>>>>>>>>>>>>>> into this. We can >>>>>>>>>>>>>>>>>>> coordinate together to make one update instead of two. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:40 PM Brian Hulette < >>>>>>>>>>>>>>>>>>> bhule...@google.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I'm not sure about best practices here. Out of >>>>>>>>>>>>>>>>>>>> curiosity I just poked around in the Jenkins UI (e.g. [1]) >>>>>>>>>>>>>>>>>>>> and it looks >>>>>>>>>>>>>>>>>>>> like you can manually "Mark node temporarily offline" when >>>>>>>>>>>>>>>>>>>> logged in (if >>>>>>>>>>>>>>>>>>>> you're a committer). According to [2] this will prevent it >>>>>>>>>>>>>>>>>>>> from picking up >>>>>>>>>>>>>>>>>>>> new jobs after it's finished the currently executing ones. >>>>>>>>>>>>>>>>>>>> Doing that >>>>>>>>>>>>>>>>>>>> manually for every worker could be a pain though. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Brian >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>>>>> https://ci-beam.apache.org/computer/apache-beam-jenkins-13/ >>>>>>>>>>>>>>>>>>>> [2] >>>>>>>>>>>>>>>>>>>> https://stackoverflow.com/questions/26553612/how-do-i-disable-a-node-in-jenkins-ui-after-it-has-completed-its-currently-runni >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 1:03 PM Daniel Oliveira < >>>>>>>>>>>>>>>>>>>> danolive...@google.com> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hey everyone, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I'm aiming at upgrading the version of Go on our >>>>>>>>>>>>>>>>>>>>> Jenkins VMs, and I found these instructions on >>>>>>>>>>>>>>>>>>>>> upgrading software on Jenkins >>>>>>>>>>>>>>>>>>>>> <https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers> >>>>>>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>>>>>> our cwiki. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I haven't started going through it yet, but I was >>>>>>>>>>>>>>>>>>>>> wondering about the last few steps that involve stopping >>>>>>>>>>>>>>>>>>>>> VMs, deleting boot >>>>>>>>>>>>>>>>>>>>> disks, and restarting executors. Is there some best >>>>>>>>>>>>>>>>>>>>> practice for >>>>>>>>>>>>>>>>>>>>> that section to avoid causing interruptions in our >>>>>>>>>>>>>>>>>>>>> automated testing? >>>>>>>>>>>>>>>>>>>>> Should I be trying to do this outside of peak dev hours, >>>>>>>>>>>>>>>>>>>>> or going one VM at >>>>>>>>>>>>>>>>>>>>> a time so others can pick up extra load, or anything like >>>>>>>>>>>>>>>>>>>>> that? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>> Daniel Oliveira >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>