TIL as well. Sounds like the right location. Thanks Valentyn!

On Tue, Nov 2, 2021, 11:00 AM Valentyn Tymofieiev <valen...@google.com>
wrote:

> Yeah,  .profile is only sourced by login shells. Adding the PATH in
> .bashrc can be a workaround, but since .bashrc is executed every time a new
> shell runs, PATH variable will be growing with every shell subprocess, so
> several sources recommend .profile instead, which does not always work.
> We should be able to fix this by updating  /etc/environment instead (TIL).
>
> This is the current content:
> cat /etc/environment
>
> PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games"
>
>
>
>
> On Mon, Nov 1, 2021 at 10:50 AM Robert Burke <rob...@frantil.com> wrote:
>
>> Looks like while .profile was edited to add in a PATH section pointing to
>> /snap/bin (where go is now installed), it doesn't seem like .profile is
>> executed by the jenkins login shells.
>>
>>
>>
>> On Fri, Oct 29, 2021, 6:23 PM Valentyn Tymofieiev <valen...@google.com>
>> wrote:
>>
>>>
>>>
>>> On Wed, Oct 20, 2021 at 11:16 AM Valentyn Tymofieiev <
>>> valen...@google.com> wrote:
>>>
>>>>
>>>>
>>>> On Wed, Oct 20, 2021 at 11:12 AM Pablo Estrada <pabl...@google.com>
>>>> wrote:
>>>>
>>>>> Thanks everyone for investigating and documenting this. I'll use it
>>>>> today : )
>>>>>
>>>> Dan may be also in the middle of doing this, please coordinate.
>>>>
>>>>>
>>>>> ahem - maybe we should rename the image name/image family names
>>>>> to jenkins-worker-boot-image ? Does anyone foresee issues if we do that?
>>>>> Does jenkins depend on these names in some undocumented way?
>>>>>
>>>> +1. it should 'just work', need to update the wiki after the change.
>>>> Jenkins also did a terminology adjustment.
>>>>
>>> I had to reimage Jenkins workers again, took care of the rename and
>>> changed the instructions.
>>>
>>> I am not sure what is the status of Go Postcommit problem, but noticed
>>> that jenkins worker #1 had a different boot disk. I reimaged all workers
>>> building on top of the latest image from the image family. If Go tests
>>> start failing, we may need to get help from Dan again.
>>>
>>>
>>>>
>>>>> On Tue, Oct 19, 2021 at 1:43 PM Daniel Oliveira <
>>>>> danolive...@google.com> wrote:
>>>>>
>>>>>> I'm ok with deciding to avoid the "lite" update option, feel free to
>>>>>> revise the instructions as it seems appropriate. As for the issue, I 
>>>>>> fixed
>>>>>> it with a workaround that should work until we need to add a new image to
>>>>>> the agents, and I'm currently investigating the root cause and prepare a
>>>>>> fixed image.
>>>>>>
>>>>>> That said, I think this issue would have still happened even if we
>>>>>> didn't perform the "lite" update. I'm still trying to figure out the 
>>>>>> exact
>>>>>> problem, but it looks to be a PATH issue that wasn't effectively caught 
>>>>>> by
>>>>>> the current process. I won't get into details too much in this thread 
>>>>>> (see
>>>>>> the Jira for that), but essentially everything works in my environment 
>>>>>> when
>>>>>> I SSH into the VMs, but because the location of the "go" command changed 
>>>>>> in
>>>>>> the PATH, it seems to have stopped working for every other user, 
>>>>>> including
>>>>>> the Jenkins agents. I actually did notice that would happen when I was
>>>>>> working on the image, but the solution seemed to be to reboot the 
>>>>>> machine,
>>>>>> which I assumed happened already since I shut down the VM to image it.
>>>>>>
>>>>>> On Tue, Oct 19, 2021 at 12:09 PM Robert Burke <rob...@frantil.com>
>>>>>> wrote:
>>>>>>
>>>>>>> +1 to only having one way to do things. The Lite option seems liable
>>>>>>> to cause more problems since it means it's changes can be blown away if 
>>>>>>> a
>>>>>>> new image isn't prepared anyway.
>>>>>>> I don't think we are changing the images often enough for it.
>>>>>>> Perhaps call it the option to test changes if anything?
>>>>>>>
>>>>>>> On Tue, Oct 19, 2021, 11:55 AM Valentyn Tymofieiev <
>>>>>>> valen...@google.com> wrote:
>>>>>>>
>>>>>>>> All workers were updated to use jenkins-slave-boot-image-20211011,
>>>>>>>> which should have had a go command, but it appears slightly 
>>>>>>>> misconfigured.
>>>>>>>> I reopened BEAM-13037 [1] and added some details there.
>>>>>>>>
>>>>>>>> I also added instructions to wiki [2] on how to perform an image
>>>>>>>> swap and it is actually very straightforward. I think a lesson here is 
>>>>>>>> that
>>>>>>>> making 'lite' upgrades is brittle as misconfigurations could resurface 
>>>>>>>> down
>>>>>>>> the road when the context of the lite upgrade is no longer fresh in our
>>>>>>>> memory.
>>>>>>>>
>>>>>>>> I suggest we revise the instructions to keep only image swap
>>>>>>>> commands and remove the 'lite' update option. +Daniel Oliveira
>>>>>>>> <danolive...@google.com>, WDYT?  In the meantime, we should also
>>>>>>>> prepare an image that fixes the misconfiguration. Would you be able to 
>>>>>>>> help
>>>>>>>> with that? Thank you.
>>>>>>>>
>>>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-13037
>>>>>>>> [2]
>>>>>>>> https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Oct 19, 2021 at 8:46 AM Robert Burke <rob...@frantil.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> FYI it looks like all the Go tests are now failing because it
>>>>>>>>> can't find the Go command at all.
>>>>>>>>> Did a Jenkins image without Go (v1.16+) pre-installed get pushed?
>>>>>>>>>
>>>>>>>>> On Mon, Oct 18, 2021, 1:45 PM Valentyn Tymofieiev <
>>>>>>>>> valen...@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks Daniel,
>>>>>>>>>>
>>>>>>>>>> I can recreate the VMs on new disks.
>>>>>>>>>>
>>>>>>>>>> We currently have a set of stopped jenkins workers (named:
>>>>>>>>>> apache-beam-jenkins-##) and running workers (named:
>>>>>>>>>> apache-ci-beam-jenkins-##)
>>>>>>>>>>
>>>>>>>>>> Are there any concerns about deleting the stopped group of
>>>>>>>>>> workers?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 18, 2021 at 11:19 AM Ahmet Altay <al...@google.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thank you Daniel, Valentyn!
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Oct 18, 2021 at 8:02 AM Daniel Oliveira <
>>>>>>>>>>> danolive...@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I performed a light update of both Go and Python (from
>>>>>>>>>>>> Valentyn's update) on each worker VM over the weekend. I also added
>>>>>>>>>>>> additional instructions for the light update to Confluence (as an
>>>>>>>>>>>> alternative to the current instructions).
>>>>>>>>>>>>
>>>>>>>>>>>> There is still reason to perform a full update at some point:
>>>>>>>>>>>> Valentyn updated the VM image from 500 GB to 1000 GB of storage, 
>>>>>>>>>>>> which
>>>>>>>>>>>> requires a full update to actually take effect.
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Oct 12, 2021 at 10:32 AM Valentyn Tymofieiev <
>>>>>>>>>>>> valen...@google.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> > 3. SSH into the agent and perform the update.
>>>>>>>>>>>>> So, this would be a 'lite' version of the update, where we
>>>>>>>>>>>>> make changes to the live worker without recreating worker VM with 
>>>>>>>>>>>>> a new
>>>>>>>>>>>>> image? We could perhaps document both options, and also make it 
>>>>>>>>>>>>> clear that
>>>>>>>>>>>>> producing a VM image that has necessary updates is mandatory even 
>>>>>>>>>>>>> if we
>>>>>>>>>>>>> perform 'lite' updates without recreating the worker.
>>>>>>>>>>>>> Also, for a lite update, marking the Jenkins offer offline may
>>>>>>>>>>>>> be optional, as some updates might not be disruptive (such as 
>>>>>>>>>>>>> installing
>>>>>>>>>>>>> some software that will not be used immediately).
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Oct 11, 2021 at 7:53 PM Robert Burke <
>>>>>>>>>>>>> rob...@frantil.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> SGTM. Thank you very much Daniel!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Oct 11, 2021, 7:51 PM Ahmet Altay <al...@google.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you Daniel. Could you please update the wiki once you
>>>>>>>>>>>>>>> are done with the process?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Oct 11, 2021 at 6:22 PM Daniel Oliveira <
>>>>>>>>>>>>>>> danolive...@google.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Took me a bit to get to this, sorry. I finally figured out
>>>>>>>>>>>>>>>> an approach for updating Go and did so and will be updating 
>>>>>>>>>>>>>>>> the image
>>>>>>>>>>>>>>>> momentarily.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I think a more important note is that I tried what Valentyn
>>>>>>>>>>>>>>>> was considering, which is SSHing into workers and updating the 
>>>>>>>>>>>>>>>> dependency.
>>>>>>>>>>>>>>>> I'll describe the process below, but the summary is that I did 
>>>>>>>>>>>>>>>> it on one
>>>>>>>>>>>>>>>> worker with Go so far, saw no problems over the weekend, and 
>>>>>>>>>>>>>>>> would like to
>>>>>>>>>>>>>>>> continue updating the rest of the workers if there are no 
>>>>>>>>>>>>>>>> objections.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Here's a step-by-step of what I did. If we decide to stick
>>>>>>>>>>>>>>>> with this approach, these instructions can be added to 
>>>>>>>>>>>>>>>> Confluence:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1. Go to the page for the Jenkins agent you want to update
>>>>>>>>>>>>>>>> [1] and click "Mark this node temporarily offline", leaving a 
>>>>>>>>>>>>>>>> reason such
>>>>>>>>>>>>>>>> as "Updating X dependency."
>>>>>>>>>>>>>>>> 2. Wait until there are no more tests running in that agent
>>>>>>>>>>>>>>>> (under "Build Executor Status" on the left of the page).
>>>>>>>>>>>>>>>> 3. SSH into the agent and perform the update.
>>>>>>>>>>>>>>>> 4. Mark the node as online again.
>>>>>>>>>>>>>>>> 5. Repeat for every worker.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> And these are some additional steps if you want to
>>>>>>>>>>>>>>>> immediately run a test suite to check that the update worked 
>>>>>>>>>>>>>>>> correctly. For
>>>>>>>>>>>>>>>> example in my case, I wanted to check against the Go 
>>>>>>>>>>>>>>>> Postcommit, and it was
>>>>>>>>>>>>>>>> a good thing I did, because it actually failed the first time 
>>>>>>>>>>>>>>>> and I had to
>>>>>>>>>>>>>>>> go back in to fix a small oversight I made. So doing this 
>>>>>>>>>>>>>>>> after you update
>>>>>>>>>>>>>>>> your first worker is probably a good idea before updating the 
>>>>>>>>>>>>>>>> rest:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1. Go to the page for the job you want to run (for example:
>>>>>>>>>>>>>>>> [2]).
>>>>>>>>>>>>>>>> 2. Click "Configure" on the left menu.
>>>>>>>>>>>>>>>> 3. Find the checkmark "Restrict where this project can be
>>>>>>>>>>>>>>>> run" and change the restriction from "beam" to the specific 
>>>>>>>>>>>>>>>> name of the
>>>>>>>>>>>>>>>> agent (ex. "apache-beam-jenkins-1").
>>>>>>>>>>>>>>>> 4. Save and apply that change.
>>>>>>>>>>>>>>>> 5. Back on the page for the job, click "Build with
>>>>>>>>>>>>>>>> Parameters" on the left menu.
>>>>>>>>>>>>>>>> 6. Run the build on "master".
>>>>>>>>>>>>>>>> 7. Once you're done checking the results, change
>>>>>>>>>>>>>>>> the restriction for the job back to "beam". (This also gets 
>>>>>>>>>>>>>>>> reset once
>>>>>>>>>>>>>>>> every 24 hours in case you forget.)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I did that on one agent (apache-beam-jenkins-2) on Friday
>>>>>>>>>>>>>>>> evening when it wasn't too busy, and got Go updated and 
>>>>>>>>>>>>>>>> working. I checked
>>>>>>>>>>>>>>>> that agent's execution history again today just in case, and 
>>>>>>>>>>>>>>>> it was healthy
>>>>>>>>>>>>>>>> over the weekend, with no Go-related problems as far as I 
>>>>>>>>>>>>>>>> could see. If
>>>>>>>>>>>>>>>> there's no objections I'd like to go ahead and continue 
>>>>>>>>>>>>>>>> updating the rest
>>>>>>>>>>>>>>>> of the workers (I'll do this late at night or over the weekend 
>>>>>>>>>>>>>>>> to avoid
>>>>>>>>>>>>>>>> disrupting dev work).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>> https://ci-beam.apache.org/computer/apache-beam-jenkins-1/
>>>>>>>>>>>>>>>> [2] https://ci-beam.apache.org/job/beam_PostCommit_Go/
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Oct 4, 2021 at 6:14 PM Valentyn Tymofieiev <
>>>>>>>>>>>>>>>> valen...@google.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I updated the image in [1], but did not change the workers
>>>>>>>>>>>>>>>>> yet to pick up the new image yet. We can do this once we add 
>>>>>>>>>>>>>>>>> Go changes on
>>>>>>>>>>>>>>>>> top of it.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I am also considering to SSH into every worker and run a
>>>>>>>>>>>>>>>>> one-line command that adds the dependency that was missing. 
>>>>>>>>>>>>>>>>> It seems to be
>>>>>>>>>>>>>>>>> low risk, and  there is a fall-back plan to re-start the 
>>>>>>>>>>>>>>>>> worker using the
>>>>>>>>>>>>>>>>> saved image - both new and old images are saved and available 
>>>>>>>>>>>>>>>>> in Cloud
>>>>>>>>>>>>>>>>> Console.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Ideally, we should find a way to do a rolling upgrade that
>>>>>>>>>>>>>>>>> a PMC or committer could trigger without logging into every 
>>>>>>>>>>>>>>>>> machine.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424228#comment-17424228
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 3:28 PM Daniel Oliveira <
>>>>>>>>>>>>>>>>> danolive...@google.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> @Brian Hulette <bhule...@google.com> That button seems
>>>>>>>>>>>>>>>>>> like exactly what we'd need. Doing it manually would be a 
>>>>>>>>>>>>>>>>>> pain, but it's
>>>>>>>>>>>>>>>>>> probably still preferable to causing a bunch of aborted 
>>>>>>>>>>>>>>>>>> tests.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> @Valentyn Tymofieiev <valen...@google.com> Collaborating
>>>>>>>>>>>>>>>>>> to do both updates at once is a great idea! I'll message you 
>>>>>>>>>>>>>>>>>> directly about
>>>>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:44 PM Valentyn Tymofieiev <
>>>>>>>>>>>>>>>>>> valen...@google.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I am also interested in this updating version of Python
>>>>>>>>>>>>>>>>>>> on VMs, I need to install Python 3.9. Thanks for looking 
>>>>>>>>>>>>>>>>>>> into this.  We can
>>>>>>>>>>>>>>>>>>> coordinate together to make one update instead of two.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:40 PM Brian Hulette <
>>>>>>>>>>>>>>>>>>> bhule...@google.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I'm not sure about best practices here. Out of
>>>>>>>>>>>>>>>>>>>> curiosity I just poked around in the Jenkins UI (e.g. [1]) 
>>>>>>>>>>>>>>>>>>>> and it looks
>>>>>>>>>>>>>>>>>>>> like you can manually "Mark node temporarily offline" when 
>>>>>>>>>>>>>>>>>>>> logged in (if
>>>>>>>>>>>>>>>>>>>> you're a committer). According to [2] this will prevent it 
>>>>>>>>>>>>>>>>>>>> from picking up
>>>>>>>>>>>>>>>>>>>> new jobs after it's finished the currently executing ones. 
>>>>>>>>>>>>>>>>>>>> Doing that
>>>>>>>>>>>>>>>>>>>> manually for every worker could be a pain though.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Brian
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>> https://ci-beam.apache.org/computer/apache-beam-jenkins-13/
>>>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>>>>> https://stackoverflow.com/questions/26553612/how-do-i-disable-a-node-in-jenkins-ui-after-it-has-completed-its-currently-runni
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 1:03 PM Daniel Oliveira <
>>>>>>>>>>>>>>>>>>>> danolive...@google.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I'm aiming at upgrading the version of Go on our
>>>>>>>>>>>>>>>>>>>>> Jenkins VMs, and I found these instructions on
>>>>>>>>>>>>>>>>>>>>> upgrading software on Jenkins
>>>>>>>>>>>>>>>>>>>>> <https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers>
>>>>>>>>>>>>>>>>>>>>>  on
>>>>>>>>>>>>>>>>>>>>> our cwiki.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I haven't started going through it yet, but I was
>>>>>>>>>>>>>>>>>>>>> wondering about the last few steps that involve stopping 
>>>>>>>>>>>>>>>>>>>>> VMs, deleting boot
>>>>>>>>>>>>>>>>>>>>> disks, and restarting executors. Is there some best 
>>>>>>>>>>>>>>>>>>>>> practice for
>>>>>>>>>>>>>>>>>>>>> that section to avoid causing interruptions in our 
>>>>>>>>>>>>>>>>>>>>> automated testing?
>>>>>>>>>>>>>>>>>>>>> Should I be trying to do this outside of peak dev hours, 
>>>>>>>>>>>>>>>>>>>>> or going one VM at
>>>>>>>>>>>>>>>>>>>>> a time so others can pick up extra load, or anything like 
>>>>>>>>>>>>>>>>>>>>> that?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> Daniel Oliveira
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>

Reply via email to