Re: SIGMOD System Award for Apache Spark
woot! :D On Thu, May 12, 2022 at 4:27 PM Hyukjin Kwon wrote: > Awesome! > > On Fri, May 13, 2022 at 5:29 AM Mosharaf Chowdhury > wrote: > >> Wow! Congratulations to everyone indeed. >> >> On Thu, May 12, 2022 at 3:44 PM Matei Zaharia >> wrote: >> >>> Hi all, >>> >>> We recently found out that Apache Spark received >>> <https://sigmod.org/2022-sigmod-awards/> the SIGMOD System Award this >>> year, given by SIGMOD (the ACMβs data management research organization) to >>> impactful real-world and research systems. This puts Spark in good company >>> with some very impressive previous recipients >>> <https://sigmod.org/sigmod-awards/sigmod-systems-award/>. This award is >>> really an achievement by the whole community, so I wanted to say congrats >>> to everyone who contributes to Spark, whether through code, issue reports, >>> docs, or other means. >>> >>> Matei >>> >> -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu https://sky.cs.berkeley.edu/
Re: [Apache Spark Jenkins] build system shutting down Dec 23th, 2021
# sysctl stop jenkins # πΊ goodbye jenkins! πͺ On Mon, Dec 6, 2021 at 12:02 PM shane knapp β wrote: > hey everyone! > > after a marathon run of nearly a decade, we're finally going to be > shutting down {amp|rise}lab jenkins at the end of this month... > > the earliest snapshot i could find is from 2013 with builds for spark 0.7: > > https://web.archive.org/web/20130426155726/https://amplab.cs.berkeley.edu/jenkins/ > > it's been a hell of a run, and i'm gonna miss randomly tweaking the build > system, but technology has moved on and running a dedicated set of servers > for just one open source project is just too expensive for us here at uc > berkeley. > > if there's interest, i'll fire up a zoom session and all y'alls can watch > me type the final command: > > systemctl stop jenkins > > feeling bittersweet, > > shane > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: [Apache Spark Jenkins] build system shutting down Dec 23th, 2021
created an issue to track stuff: https://issues.apache.org/jira/browse/SPARK-37571 On Tue, Dec 7, 2021 at 8:25 AM shane knapp β wrote: > Will you be nuking all the Jenkins-related code in the repo after the 23rd? >> >> probably not right away... but soon after jenkins is shut down. bits of > the docs and spark website will need to be updated as well. > > shane > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: [Apache Spark Jenkins] build system shutting down Dec 23th, 2021
> > Will you be nuking all the Jenkins-related code in the repo after the 23rd? > > probably not right away... but soon after jenkins is shut down. bits of the docs and spark website will need to be updated as well. shane -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
[Apache Spark Jenkins] build system shutting down Dec 23th, 2021
hey everyone! after a marathon run of nearly a decade, we're finally going to be shutting down {amp|rise}lab jenkins at the end of this month... the earliest snapshot i could find is from 2013 with builds for spark 0.7: https://web.archive.org/web/20130426155726/https://amplab.cs.berkeley.edu/jenkins/ it's been a hell of a run, and i'm gonna miss randomly tweaking the build system, but technology has moved on and running a dedicated set of servers for just one open source project is just too expensive for us here at uc berkeley. if there's interest, i'll fire up a zoom session and all y'alls can watch me type the final command: systemctl stop jenkins feeling bittersweet, shane -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: [FYI] Build and run tests on Java 17 for Apache Spark 3.3
woot! nice work everyone! :) On Fri, Nov 12, 2021 at 11:37 AM Dongjoon Hyun wrote: > Hi, All. > > Apache Spark community has been working on Java 17 support under the > following JIRA. > > https://issues.apache.org/jira/browse/SPARK-33772 > > As of today, Apache Spark starts to have daily Java 17 test coverage via > GitHub Action jobs for Apache Spark 3.3. > > > https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L38-L39 > > Today's successful run is here. > > https://github.com/apache/spark/actions/runs/1453788012 > > Please note that we are still working on some new Java 17 features like > > JEP 391: macOS/AArch64 Port > https://bugs.openjdk.java.net/browse/JDK-8251280 > > For example, Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 already > support Apple Silicon natively, but some 3rd party libraries like > RocksDB/LevelDB are not ready yet. Since Mac is one of the popular dev > environments, we are going to keep monitoring and improving gradually for > Apache Spark 3.3. > > Please test Java 17 and let us know your feedback. > > Thanks, > Dongjoon. > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: [build system] quick jenkins reboot
we've been back for about an hour. :) On Fri, Oct 22, 2021 at 1:52 PM shane knapp β wrote: > system load on the primary is getting suspiciously high, and free ram has > mysteriously disappeared and we are rapidly approaching swap. whatever > could it be? > > java. > > i'm going to take this opportunity to reboot everything and start from a > clean-ish state. we'll be down for ~45m or so. > > shane > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
[build system] quick jenkins reboot
system load on the primary is getting suspiciously high, and free ram has mysteriously disappeared and we are rapidly approaching swap. whatever could it be? java. i'm going to take this opportunity to reboot everything and start from a clean-ish state. we'll be down for ~45m or so. shane -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: [build system] DNS outage @ uc berkeley, jenkins not available
this was resolved by campus IT around 930pm last night. On Tue, Aug 31, 2021 at 12:54 PM shane knapp β wrote: > > we're having some DNS issues here in the EECS department, and our > crack team is working on getting it resolved asap. until then, > jenkins isn't visible to the outside world. > > shane > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
[build system] DNS outage @ uc berkeley, jenkins not available
we're having some DNS issues here in the EECS department, and our crack team is working on getting it resolved asap. until then, jenkins isn't visible to the outside world. shane -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: [build system] quick jenkins restart
aaand we're back! On Wed, Aug 25, 2021 at 9:24 AM shane knapp β wrote: > i'll be: > - upgrading jenkins to the latest LTS > - moving jenkins to java 11 (from java 8) > - rebooting everything > > sorry for the disruption... there aren't many builds running right now so > i'll just get this sorted. > > shane > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
[build system] quick jenkins restart
i'll be: - upgrading jenkins to the latest LTS - moving jenkins to java 11 (from java 8) - rebooting everything sorry for the disruption... there aren't many builds running right now so i'll just get this sorted. shane -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: [build system] half of the jenkins workers are down
turns out that minikube/k8s and friends were being oom-killed and this was causing all sorts of weirdnesses. i've upped the ram limits on all of the k8s jobs to 8G (from 6G), and we'll keep an eye on things and see how they go. On Mon, Aug 9, 2021 at 12:02 PM shane knapp β wrote: > as workers are continuing to fail, i've stopped jenkins from accepting new > builds for the time being. > > more updates as they come. > > On Mon, Aug 9, 2021 at 9:17 AM shane knapp β wrote: > >> happy monday! >> >> the server gods did not smile upon us this weekend, and 4 of the workers >> are down. we'll most likely need to head to our colo some time today and >> give them an in-person kick and see what's going on. >> >> i'll send an update when they're back up. >> >> shane >> -- >> Shane Knapp >> Computer Guy / Voice of Reason >> UC Berkeley EECS Research / RISELab Staff Technical Lead >> https://rise.cs.berkeley.edu >> > > > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: [build system] half of the jenkins workers are down
as workers are continuing to fail, i've stopped jenkins from accepting new builds for the time being. more updates as they come. On Mon, Aug 9, 2021 at 9:17 AM shane knapp β wrote: > happy monday! > > the server gods did not smile upon us this weekend, and 4 of the workers > are down. we'll most likely need to head to our colo some time today and > give them an in-person kick and see what's going on. > > i'll send an update when they're back up. > > shane > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
[build system] half of the jenkins workers are down
happy monday! the server gods did not smile upon us this weekend, and 4 of the workers are down. we'll most likely need to head to our colo some time today and give them an in-person kick and see what's going on. i'll send an update when they're back up. shane -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
[build system] jenkins "freeze" for remainder of 2021
since we're sunsetting jenkins by the end of 2021, i'd like to institute a general freeze on package/feature requests. this includes, but is not limited to things like python packages, new versions of python, and pretty much anything that requires changes to the bare-metal systems that run jenkins. exceptions to this rule include new branches (spark 3.3, i'm looking at you!), and any major security or critical fixes required for builds. please let us know if you have any questions! thanks in advance, brian & shane -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: please read: current state and the future of the apache spark build system
3 months later, i have some updates! TLDR1: we're shutting jenkins down at the end of 2021. > > this is still the goal, exact shutdown date TBD. > long term (until EOY): > * decide what the future of spark builds and releases will look like > - do we need jenkins? > - if we do, who's responsible for hosting + ops? > this looks like github actions + some as-of-yet-tbd k8s solution for integration tests. > medium term (in 6 months): > * prepare jenkins worker ansible configs and stick in the spark repo > this is done: https://github.com/apache/spark/tree/master/dev/ansible-for-test-node > * train up brian shiratsuki (cced) to help w/ops tasks and upgrades over > the next ~6m > this is ongoing, and we now have reasonable monitoring! > * get to all of the python version, library installation, etc etc jira > requests > > i think i've knocked out most of these. > short term(weeks): > * bring up additional workers > - finish hardware/system level repairs on the bare metal > - see above, re k8s jira > * stabilize cluster > - recent jenkins LTS upgrade broke the web GUI > - finish deploying monitoring/alerting > - this hardware is OLD and literally falling over, so we have lots of > random disk and ram failures. it's literally whack-a-mole and each trip to > the colo to repair literally takes a full day > > we're generally doing alright w/all of these: the hardware has been pretty stable, the jenkins administrative GUI is still broken (but at least i can hack the xml on the bare metal), and we've got 8 workers up and running. i'll be sending out another email to this list soon regarding the impending jenkins 'freeze'. shane -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: [build system] jenkins downtime today
that actually went much faster than anticipated, and we're already back up and building! On Thu, Jul 22, 2021 at 10:24 AM shane knapp β wrote: > i'll be taking jenkins down for a couple of hours today to reboot/clean up > the workers and finish up the python package installs covered in > https://github.com/apache/spark/pull/33469/files > > shane > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
[build system] jenkins downtime today
i'll be taking jenkins down for a couple of hours today to reboot/clean up the workers and finish up the python package installs covered in https://github.com/apache/spark/pull/33469/files shane -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: quick jenkins restart
we're back up! On Fri, Jul 9, 2021 at 10:23 AM shane knapp β wrote: > the primary is running out of memory pretty quickly, and i'm going to > reboot the server quickly so that it doesn't crash over the weekend. > > we'll investigate a bit more next week. > > shane > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
quick jenkins restart
the primary is running out of memory pretty quickly, and i'm going to reboot the server quickly so that it doesn't crash over the weekend. we'll investigate a bit more next week. shane -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: How to think about SparkPullRequestBuilder-K8s?
we're back. On Fri, Jun 11, 2021 at 2:30 PM shane knapp β wrote: > btw i just noticed jenkins was down, and i restarted the primary node. > > On Fri, Jun 11, 2021 at 12:09 PM Sean Owen wrote: > >> I find that somewhat often, the K8S PR builders will fail on a PR: >> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/ >> >> ... when the PR seems totally unrelated to K8S. I've kind of learned to >> ignore them in that case but that seems wrong. Are they just kind of flaky? >> am I imagining things? Just trying to figure out how much they're >> 'accurate' in catching real vs false failures. >> > > > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: How to think about SparkPullRequestBuilder-K8s?
btw i just noticed jenkins was down, and i restarted the primary node. On Fri, Jun 11, 2021 at 12:09 PM Sean Owen wrote: > I find that somewhat often, the K8S PR builders will fail on a PR: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/ > > ... when the PR seems totally unrelated to K8S. I've kind of learned to > ignore them in that case but that seems wrong. Are they just kind of flaky? > am I imagining things? Just trying to figure out how much they're > 'accurate' in catching real vs false failures. > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: [build system] jenkins down, working on it
we're back and building! On Tue, May 4, 2021 at 4:03 PM shane knapp β wrote: > jenkins went down some time in the past few days, and i'm currently > investigating. > > if it's been down a while, i apologize as i've been dealing w/some health > issues. > > shane > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
[build system] jenkins down, working on it
jenkins went down some time in the past few days, and i'm currently investigating. if it's been down a while, i apologize as i've been dealing w/some health issues. shane -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: [SPARK-34738] issues w/k8s+minikube and PV tests
alright, my canary build w/skipping the PV integration test passed w/the docker driver: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s-clone/20/ i'll put together a PR for this over the weekend (it's a one-liner) and once we merge i can get the remaining workers upgraded early next week. On Thu, Apr 15, 2021 at 3:05 PM shane knapp β wrote: > i'm all for that... and once they're turned off, we can finish the > minikube/k8s/move-to-docker project in a couple of hours max. > > On Thu, Apr 15, 2021 at 3:00 PM Holden Karau wrote: > >> What about if we just turn off the PV tests for now? >> I'd be happy to help with the debugging/upgrading. >> >> On Thu, Apr 15, 2021 at 2:28 AM Rob Vesse wrote: >> > >> > Thereβs at least one test (the persistent volumes one) that relies on >> some Minikube functionality because we run integration tests for our >> $dayjob Spark image builds using Docker for Desktop instead and that one >> test fails because it relies on some minikube specific functionality. That >> test could be refactored because I think itβs just adding a minimal Ceph >> cluster to the K8S cluster which can be done to any K8S cluster in principal >> > >> > >> > >> > Rob >> > >> > >> > >> > From: shane knapp β >> > Date: Wednesday, 14 April 2021 at 18:56 >> > To: Frank Luo >> > Cc: dev , Brian K Shiratsuki >> > Subject: Re: [SPARK-34738] issues w/k8s+minikube and PV tests >> > >> > >> > >> > On Wed, Apr 14, 2021 at 10:32 AM Frank Luo wrote: >> > >> > Is there any hard dependency on minkube? (i.e, GPU setting), kind ( >> https://kind.sigs.k8s.io/) is a stabler and simpler k8s cluster env on a >> single machine (only requires docker) , it been widely used by k8s projects >> testing. >> > >> > >> > >> > there are no hard deps on minikube... it installs happily and >> successfully runs every integration test except for persistent volumes. >> > >> > >> > >> > i haven't tried kind yet, but my time is super limited on this and i'd >> rather not venture down another rabbit hole unless we absolutely have to. >> > >> > >> >> >> >> -- >> Twitter: https://twitter.com/holdenkarau >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> > > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: [SPARK-34738] issues w/k8s+minikube and PV tests
i'm all for that... and once they're turned off, we can finish the minikube/k8s/move-to-docker project in a couple of hours max. On Thu, Apr 15, 2021 at 3:00 PM Holden Karau wrote: > What about if we just turn off the PV tests for now? > I'd be happy to help with the debugging/upgrading. > > On Thu, Apr 15, 2021 at 2:28 AM Rob Vesse wrote: > > > > Thereβs at least one test (the persistent volumes one) that relies on > some Minikube functionality because we run integration tests for our > $dayjob Spark image builds using Docker for Desktop instead and that one > test fails because it relies on some minikube specific functionality. That > test could be refactored because I think itβs just adding a minimal Ceph > cluster to the K8S cluster which can be done to any K8S cluster in principal > > > > > > > > Rob > > > > > > > > From: shane knapp β > > Date: Wednesday, 14 April 2021 at 18:56 > > To: Frank Luo > > Cc: dev , Brian K Shiratsuki > > Subject: Re: [SPARK-34738] issues w/k8s+minikube and PV tests > > > > > > > > On Wed, Apr 14, 2021 at 10:32 AM Frank Luo wrote: > > > > Is there any hard dependency on minkube? (i.e, GPU setting), kind ( > https://kind.sigs.k8s.io/) is a stabler and simpler k8s cluster env on a > single machine (only requires docker) , it been widely used by k8s projects > testing. > > > > > > > > there are no hard deps on minikube... it installs happily and > successfully runs every integration test except for persistent volumes. > > > > > > > > i haven't tried kind yet, but my time is super limited on this and i'd > rather not venture down another rabbit hole unless we absolutely have to. > > > > > > > > -- > Twitter: https://twitter.com/holdenkarau > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: please read: current state and the future of the apache spark build system
> > medium term (in 6 months): > * prepare jenkins worker ansible configs and stick in the spark repo > - nothing fancy, but enough to config ubuntu workers > - could be used to create docker containers for testing in > THE CLOUD > > fwiw, i just decided to bang this out today: https://github.com/apache/spark/pull/32178 shane -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: [SPARK-34738] issues w/k8s+minikube and PV tests
On Wed, Apr 14, 2021 at 10:32 AM Frank Luo wrote: > Is there any hard dependency on minkube? (i.e, GPU setting), kind ( > https://kind.sigs.k8s.io/) is a stabler and simpler k8s cluster env on a > single machine (only requires docker) , it been widely used by k8s projects > testing. > > there are no hard deps on minikube... it installs happily and successfully runs every integration test except for persistent volumes. i haven't tried kind yet, but my time is super limited on this and i'd rather not venture down another rabbit hole unless we absolutely have to.
[SPARK-34738] issues w/k8s+minikube and PV tests
please see: https://issues.apache.org/jira/browse/SPARK-34738 i could really use a hand. all k8s integration tests are currently broken, and i'd rather spend the time fixing the latest version of minikube, k8s and the docker virtualization layer than debug the 'old' way which uses the kvm2/qemu virtualization layer. thanks in advance, shane -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: Increase the number of parallel jobs in GitHub Actions at ASF organization level
On Wed, Apr 7, 2021 at 6:30 AM Hyukjin Kwon wrote: > Thanks Martin for your feedback. > > > What was your reason to migrate from Apache Jenkins to Github Actions ? > > I am sure there were more reasons for migrating from Amplap Jenkins > <https://amplab.cs.berkeley.edu/jenkins/> to GitHub Actions but as far as > I can remember: > - To reduce the maintenance cost of machines > - The Jenkins machines became unstable and slow causing CI jobs to fail or > be very flaky. > - Difficulty to manage the installed libraries. > - Intermittent unknown issues in the machines > > also: - uc berkeley has been hosting the build system for spark for ~10 years "free of charge" - funding for the build system is going away (amplab funded first, riselab second) - i have been managing the build system solo for 7 years and my job is much different now... - since there are no funds coming from research labs, i am unable to staff the build system past 2021 (tbh, even this year is a stretch) - the hardware is far past EOL and literally falling over - jenkins is, and always will be a PITA to run shane -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
please read: current state and the future of the apache spark build system
this will be a relatively big update, as there are many many moving pieces with short, medium and long term goals. TLDR1: we're shutting jenkins down at the end of 2021. TLDR2: i know we're way behind on pretty much everything. most of the hardware is at or beyond EOL, and random systemic build failures (like k8s/minikube) are randomly popping up. i've had to restrict access due to new campus policies, and i will be dealing with that shortly and only for a few contributors. long term (until EOY): * decide what the future of spark builds and releases will look like - do we need jenkins? - if we do, who's responsible for hosting + ops? * we will permanently shut down amplab jenkins by the end of 2021 - uc berkeley has funded this for over 10 years, and both the funds and staff (only me, for 7 years) are going away. i'm staying at cal, but have a much different job now. :) medium term (in 6 months): * prepare jenkins worker ansible configs and stick in the spark repo - nothing fancy, but enough to config ubuntu workers - could be used to create docker containers for testing in THE CLOUD * train up brian shiratsuki (cced) to help w/ops tasks and upgrades over the next ~6m * get to all of the python version, library installation, etc etc jira requests short term(weeks): * debug and figure out why minikube/k8s broke - https://issues.apache.org/jira/browse/SPARK-34738 - i really could use some help here... * bring up additional workers - finish hardware/system level repairs on the bare metal - see above, re k8s jira * stabilize cluster - recent jenkins LTS upgrade broke the web GUI - finish deploying monitoring/alerting - this hardware is OLD and literally falling over, so we have lots of random disk and ram failures. it's literally whack-a-mole and each trip to the colo to repair literally takes a full day i'm only able to spend a few hours a week on the build system, so expect random downtime, reboots, restarts, and testing. we're testing new nodes as we deploy, and hoping to fix anything before releasing them into the wild, but some things might be flaky. but the biggest question is what you all need w/regards to build infrastructure... and who's going to be responsible for it. thanks for reading! :) shane -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: [build system] short downtime today, new workers coming soon
we're back! On Tue, Mar 23, 2021 at 12:31 PM shane knapp β wrote: > jenkins is acting up, and i'm going to take the opportunity to reboot the > primary and all the workers. > > sorry for the short notice, but on the bright side we have a bunch of > shiny new workers coming soon! > > shane > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
[build system] short downtime today, new workers coming soon
jenkins is acting up, and i'm going to take the opportunity to reboot the primary and all the workers. sorry for the short notice, but on the bright side we have a bunch of shiny new workers coming soon! shane -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: [build system] github fetches timing out
it's been happening a lot again recently... i'm investigating. On Wed, Mar 10, 2021 at 10:23 AM Liang-Chi Hsieh wrote: > Thanks Shane for looking at it! > > > shane knapp β wrote > > ...and just like that, overnight the builds started successfully git > > fetching! > > > > -- > > Shane Knapp > > Computer Guy / Voice of Reason > > UC Berkeley EECS Research / RISELab Staff Technical Lead > > https://rise.cs.berkeley.edu > > > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > ----- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: [build system] github fetches timing out
...and just like that, overnight the builds started successfully git fetching! On Tue, Mar 9, 2021 at 12:31 PM shane knapp β wrote: > it looks like over the past few days the master/branch builds have been > timing out... this hasn't happened in a few years, and honestly the last > times this happened there was nothing that either i, or github could do > about it. it cleared up after a number of weeks, and we were never able to > pinpoint the root cause. > > we're not hitting a github api ratelimit, and i'm able to successfully run > the git commands on worker nodes on the command line as the jenkins user. > > example: > > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.0-test-maven-hadoop-2.7-hive-2.3-jdk-11/1014/console > > i wish i had a more concrete answer or solution for what's going on... > i'll continue to investigate as best i can today, and if this continues, > i'll re-open my issue w/github and see if they can shed any light on the > situation. > > shane > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
[build system] github fetches timing out
it looks like over the past few days the master/branch builds have been timing out... this hasn't happened in a few years, and honestly the last times this happened there was nothing that either i, or github could do about it. it cleared up after a number of weeks, and we were never able to pinpoint the root cause. we're not hitting a github api ratelimit, and i'm able to successfully run the git commands on worker nodes on the command line as the jenkins user. example: https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.0-test-maven-hadoop-2.7-hive-2.3-jdk-11/1014/console i wish i had a more concrete answer or solution for what's going on... i'll continue to investigate as best i can today, and if this continues, i'll re-open my issue w/github and see if they can shed any light on the situation. shane -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: minikube and kubernetes cluster versions for integration testing
fwiw, upgrading minikube and the associated VM drivers is potentially a PITA. your PR will absolutely be tested before merging. :) On Thu, Mar 4, 2021 at 10:13 AM attilapiros wrote: > Thanks Shane! > > I can do the documentation task and the Minikube version check can be > incorporated into my PR. > When my PR is finalized (probably next week) I will create a jira for you > and you can set up the test systems and you can even test my PR before > merging it. Is this possible / fine for you? > > > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: minikube and kubernetes cluster versions for integration testing
rsion we should drop everything under v1.3.0. >> >> 2) I would suggest to drop v1.15.12 as kubernetes >> version version because of this issue >> <https://github.com/kubernetes/minikube/issues/10663> (I just found it >> by running my script). >> >> 3) On Minikube v1.7.2 there is this permission denied issue >> <https://github.com/kubernetes/minikube/issues/6583> so I suggest to >> support Minikube version 1.7.3 and greater. >> >> My test script is check_minikube_versions.zsh >> <https://gist.github.com/attilapiros/8648a782e0b956b59f03f914c88c2df3#file-check_minikube_versions-zsh>. >> It >> was executed on Mac but with a simple sed expression it can be tailored to >> linux too. >> >> >> >> *After all of this my questions:* >> *A) What about to change the required versions and suggest to use >> kubernetes v1.17.3 and Minikube v1.7.3 and greater for integration testing?* >> >> I would chose v1.17.3 for k8s cluster as that is the newest supported k8s >> version for that Minikube v1.7.3 (hoping it will be good for us for a long >> time). >> If you agree with this suggestion I go ahead and update the relevant >> documentation. >> >> >> >> *B) How about extending the integration test to check whether the >> Minikube version is sufficient? *By this we can provide a meaningful >> error when it is violated. >> >> Bests, >> Attila >> > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: [build system] jenkins wedged, going to restart after current builds finish
this was done about an hour ago... rebooted several of the workers to clear out lingering builds, and one worker had an SSD fail on boot and is currently offline. shane On Tue, Feb 23, 2021 at 10:13 AM shane knapp β wrote: > EOM > > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
[build system] jenkins wedged, going to restart after current builds finish
EOM -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: K8s integration test failure ("credentials Jenkins is using is probably wrong...")
stupid bash variable assignment. i'm surprised this has lingered for as long as it had (3 years). it's fixed and shouldn't be an issue any more. On Tue, Feb 23, 2021 at 9:28 AM shane knapp β wrote: > the AmplabJenks bot's github creds are out of date, which is causing that > non-fatal error. however, if you scroll back you'll see that minikube > actually failed to start. that should have definitely failed the build, so > i'll look at the job's bash logic and see what we missed. > > also, that worker (research-jenkins-worker-07) had some lingering builds > running and i bet there was a collision w/a dangling minikube instance. > i'm rebooting that worker now. > > shane > > > > On Tue, Feb 23, 2021 at 6:47 AM Sean Owen wrote: > >> Shane would you know? May be a problem with a single worker. >> >> On Tue, Feb 23, 2021 at 8:46 AM Phillip Henry >> wrote: >> >>> >>> Hi, >>> >>> Silly question: the Jenkins build for my PR is failing but it seems >>> outside of my control. What must I do to remedy this? >>> >>> I've submitted >>> >>> https://github.com/apache/spark/pull/31535 >>> >>> but Spark QA is telling me "Kubernetes integration test status failure". >>> >>> The Jenkins job says "SUCCESS" but also barfs with: >>> >>> FileNotFoundException means that the credentials Jenkins is using is >>> probably wrong. Or the user account does not have write access to the repo. >>> >>> >>> See >>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39934/consoleFull >>> >>> Can anybody please advise? >>> >>> Thanks in advance. >>> >>> Phillip >>> >>> >>> > > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: K8s integration test failure ("credentials Jenkins is using is probably wrong...")
the AmplabJenks bot's github creds are out of date, which is causing that non-fatal error. however, if you scroll back you'll see that minikube actually failed to start. that should have definitely failed the build, so i'll look at the job's bash logic and see what we missed. also, that worker (research-jenkins-worker-07) had some lingering builds running and i bet there was a collision w/a dangling minikube instance. i'm rebooting that worker now. shane On Tue, Feb 23, 2021 at 6:47 AM Sean Owen wrote: > Shane would you know? May be a problem with a single worker. > > On Tue, Feb 23, 2021 at 8:46 AM Phillip Henry > wrote: > >> >> Hi, >> >> Silly question: the Jenkins build for my PR is failing but it seems >> outside of my control. What must I do to remedy this? >> >> I've submitted >> >> https://github.com/apache/spark/pull/31535 >> >> but Spark QA is telling me "Kubernetes integration test status failure". >> >> The Jenkins job says "SUCCESS" but also barfs with: >> >> FileNotFoundException means that the credentials Jenkins is using is >> probably wrong. Or the user account does not have write access to the repo. >> >> >> See >> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39934/consoleFull >> >> Can anybody please advise? >> >> Thanks in advance. >> >> Phillip >> >> >> -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: [FYI] CI Infra issues (in both GitHub Action and Jenkins)
no, i don't think that'd be a good idea... adding additional dependencies to our cluster won't scale one bit. On Fri, Jan 8, 2021 at 2:16 PM Dongjoon Hyun wrote: > BTW, Shane, do you think we can utilize some of UCB machines as GitHub > Action runners? > > Bests, > Dongjoon. > > On Fri, Jan 8, 2021 at 2:14 PM Dongjoon Hyun > wrote: > >> The followings? >> >> >> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-3.2/1836/console >> >> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/1887/console >> >> On Fri, Jan 8, 2021 at 2:13 PM shane knapp β wrote: >> >>> 1. Jenkins machines start to fail with the following recently. >>>> (master branch) >>>> >>>> Python versions prior to 3.6 are not supported. >>>> Build step 'Execute shell' marked build as failure >>>> >>>> examples please? >>> >>> -- >>> Shane Knapp >>> Computer Guy / Voice of Reason >>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>> https://rise.cs.berkeley.edu >>> >> -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: [FYI] CI Infra issues (in both GitHub Action and Jenkins)
hmm, the ubuntu16 machines are acting up. i pinned the sbt master builds to ubuntu20 and they're happily building while i investigate wtf is up. On Fri, Jan 8, 2021 at 2:15 PM Dongjoon Hyun wrote: > The followings? > > > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-3.2/1836/console > > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/1887/console > > On Fri, Jan 8, 2021 at 2:13 PM shane knapp β wrote: > >> 1. Jenkins machines start to fail with the following recently. >>> (master branch) >>> >>> Python versions prior to 3.6 are not supported. >>> Build step 'Execute shell' marked build as failure >>> >>> examples please? >> >> -- >> Shane Knapp >> Computer Guy / Voice of Reason >> UC Berkeley EECS Research / RISELab Staff Technical Lead >> https://rise.cs.berkeley.edu >> > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: [FYI] CI Infra issues (in both GitHub Action and Jenkins)
> > 1. Jenkins machines start to fail with the following recently. > (master branch) > > Python versions prior to 3.6 are not supported. > Build step 'Execute shell' marked build as failure > > examples please? -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
[build system] jenkins downtime 01/02/2021 - 01/03/2020
the colo facility where jenkins is hosted is going down for roughly a day for some (more) power upgrades. once the colo is powered back up, we'll make sure that all the jenkins workers and primary nodes are up and happily building. if anyone notices any issues w/jenkins before, during or after this event, please send an email to research-supp...@cs.berkeley.edu and we'll get to it as quickly as we can[1]. wishing everyone here a happy holiday season, shane [1] -- these are for issues w/the build system itself, not for things like package installs and updates. keep those on the apache spark jira. :) -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: [build system] WE'RE LIVE!
ok, it's broken on the new nodes, so i tied the project to ubuntu16. i'll create a jira and investigate further at a later date. On Fri, Dec 4, 2020 at 8:58 AM shane knapp β wrote: > no, it isn't but i'll try and take a look at this later today. > > On Fri, Dec 4, 2020 at 7:12 AM Tom Graves wrote: > >> thanks Shane and folks for great work. >> >> Not sure if this is at all related but I noticed the spark master deploy >> job hasn't been running and the last one Dec 2nd failed: >> >> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/3186/ >> >> Not sure if this is result of upgrade? >> >> Thanks, >> Tom >> On Tuesday, December 1, 2020, 06:55:27 PM CST, shane knapp β < >> skn...@berkeley.edu> wrote: >> >> >> https://amplab.cs.berkeley.edu/jenkins/ >> >> i cleared the build queue, so you'll need to retrigger your PRs. there >> will be occasional downtime over the next few days and weeks as we uncover >> system-level errors and more reimaging happens... but for now, we're >> building. >> >> a big thanks goes out to jon for his work on the project! we couldn't >> have done it w/o him. >> >> shane >> -- >> Shane Knapp >> Computer Guy / Voice of Reason >> UC Berkeley EECS Research / RISELab Staff Technical Lead >> https://rise.cs.berkeley.edu >> > > > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: [build system] WE'RE LIVE!
no, it isn't but i'll try and take a look at this later today. On Fri, Dec 4, 2020 at 7:12 AM Tom Graves wrote: > thanks Shane and folks for great work. > > Not sure if this is at all related but I noticed the spark master deploy > job hasn't been running and the last one Dec 2nd failed: > > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/3186/ > > Not sure if this is result of upgrade? > > Thanks, > Tom > On Tuesday, December 1, 2020, 06:55:27 PM CST, shane knapp β < > skn...@berkeley.edu> wrote: > > > https://amplab.cs.berkeley.edu/jenkins/ > > i cleared the build queue, so you'll need to retrigger your PRs. there > will be occasional downtime over the next few days and weeks as we uncover > system-level errors and more reimaging happens... but for now, we're > building. > > a big thanks goes out to jon for his work on the project! we couldn't > have done it w/o him. > > shane > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
[build system] WE'RE LIVE!
https://amplab.cs.berkeley.edu/jenkins/ i cleared the build queue, so you'll need to retrigger your PRs. there will be occasional downtime over the next few days and weeks as we uncover system-level errors and more reimaging happens... but for now, we're building. a big thanks goes out to jon for his work on the project! we couldn't have done it w/o him. shane -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: [build system] jenkins downtime today/tomorrow
quick update: the migration to the new primary node is complete and i can bring up jenkins and it's queueing builds and looks good to go. the final bits that need attention are SSL, apache2 and firewall configs, and i'm hoping to get this sorted ASAP. once that's done, we'll start building and move on to fixing any lingering environment/system issues that pop up. shane On Mon, Nov 30, 2020 at 4:01 PM shane knapp β wrote: > amplab jenkins is down. > > On Mon, Nov 30, 2020 at 3:25 PM shane knapp β wrote: > >> old jenkins is getting shut down Real Soon Now[tm]! crossing my >> fingers! :) >> >> On Mon, Nov 30, 2020 at 10:05 AM shane knapp β >> wrote: >> >>> hey all! >>> >>> the Great Jenkins Migration[tm] is well under way, and we will be >>> sunsetting the old amp-jenkins-master server and moving to a new one. >>> >>> i've put jenkins in to quiet mode so that it won't accept new builds and >>> we'll let the ones currently running finish. once that's done, i will be >>> rysncing the entire jenkins installation to the new server and bringing >>> that up. we most definitely will have a bunch of minor bugs to knock out, >>> but i'm expecting us to be back up and building by EOD tomorrow (12/1/2020). >>> >>> thanks for your patience, and i'll be sure to send out updates as they >>> come. >>> >>> shane/brian/jon >>> -- >>> Shane Knapp >>> Computer Guy / Voice of Reason >>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>> https://rise.cs.berkeley.edu >>> >> >> >> -- >> Shane Knapp >> Computer Guy / Voice of Reason >> UC Berkeley EECS Research / RISELab Staff Technical Lead >> https://rise.cs.berkeley.edu >> > > > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: [build system] jenkins downtime today/tomorrow
amplab jenkins is down. On Mon, Nov 30, 2020 at 3:25 PM shane knapp β wrote: > old jenkins is getting shut down Real Soon Now[tm]! crossing my fingers! > :) > > On Mon, Nov 30, 2020 at 10:05 AM shane knapp β > wrote: > >> hey all! >> >> the Great Jenkins Migration[tm] is well under way, and we will be >> sunsetting the old amp-jenkins-master server and moving to a new one. >> >> i've put jenkins in to quiet mode so that it won't accept new builds and >> we'll let the ones currently running finish. once that's done, i will be >> rysncing the entire jenkins installation to the new server and bringing >> that up. we most definitely will have a bunch of minor bugs to knock out, >> but i'm expecting us to be back up and building by EOD tomorrow (12/1/2020). >> >> thanks for your patience, and i'll be sure to send out updates as they >> come. >> >> shane/brian/jon >> -- >> Shane Knapp >> Computer Guy / Voice of Reason >> UC Berkeley EECS Research / RISELab Staff Technical Lead >> https://rise.cs.berkeley.edu >> > > > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: [build system] jenkins downtime today/tomorrow
old jenkins is getting shut down Real Soon Now[tm]! crossing my fingers! :) On Mon, Nov 30, 2020 at 10:05 AM shane knapp β wrote: > hey all! > > the Great Jenkins Migration[tm] is well under way, and we will be > sunsetting the old amp-jenkins-master server and moving to a new one. > > i've put jenkins in to quiet mode so that it won't accept new builds and > we'll let the ones currently running finish. once that's done, i will be > rysncing the entire jenkins installation to the new server and bringing > that up. we most definitely will have a bunch of minor bugs to knock out, > but i'm expecting us to be back up and building by EOD tomorrow (12/1/2020). > > thanks for your patience, and i'll be sure to send out updates as they > come. > > shane/brian/jon > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
[build system] jenkins downtime today/tomorrow
hey all! the Great Jenkins Migration[tm] is well under way, and we will be sunsetting the old amp-jenkins-master server and moving to a new one. i've put jenkins in to quiet mode so that it won't accept new builds and we'll let the ones currently running finish. once that's done, i will be rysncing the entire jenkins installation to the new server and bringing that up. we most definitely will have a bunch of minor bugs to knock out, but i'm expecting us to be back up and building by EOD tomorrow (12/1/2020). thanks for your patience, and i'll be sure to send out updates as they come. shane/brian/jon -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: [build system] IMPORTANT UPDATE
alright, builds are looking solid except for SBT... if someone here could take a look at those failures i'd be most appreciative. the important ones: PRB, PRB-K8s, k8s, snapshot and maven builds all green! i'm literally gobsmacked by how smoothly this went. :) we're all going to enjoy a mellow holiday and i'll check build statuses every now and then and see if i find anything else like this: https://issues.apache.org/jira/browse/SPARK-33565 have a great holiday everyone! we'll start getting the new primary set up on monday, and hopefully by tuesday be fully up and running. shane On Wed, Nov 25, 2020 at 1:35 PM shane knapp β wrote: > hey all, work is going quite well and smoothly for this project. > > today's update: > > we will experience significant downtime monday/tuesday as we spin up the > new primary jenkins node. until then, we'll be building over the next few > days so i'll have a chance to better track down and fix any system-level > build breaks. > > but most importantly, i just added 3 of the 4 new ubuntu 20.04 workers to > the pool: research-jenkins-worker-03, 04 and 06. -05 is being difficult, > so i'm going to let it pout in the corner for a while before hitting it > again w/the ansible cannon. > > shane > > On Tue, Nov 24, 2020 at 6:08 PM shane knapp β wrote: > >> all spark builds have been ported and triggered: >> >> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/ >> >> not shown are the regular and k8s PRB, which are also running. >> >> i think i've nailed down most of the stupid PATH and JAVA_HOME issues, >> but i'm sure we'll have some stuff to work out. i'm mostly keeping an eye >> on the build history of research-jenkins-worker-01 and -02, as they're >> running the latest OS + ansible (which will be moved in to the spark repo >> asap). >> >> i'm still concerned about sbt failures, which includes the PRB. we'll >> see how things go, and just focus on getting things working on ubuntu 20 >> LTS. if we need to drop the ubuntu 16 workers from the pool temporarily, i >> would be more than happy to do that. we'll lose some capacity, but it >> looks like we have a solid template for getting these suckers redeployed so >> turn-around should be pretty quick. >> >> we also need to dedicate some time to clean up/fix our plugin configs. >> there's been a lot of change over the past three years and things like PRB >> triggers seem flaky (it took 28m instead of 5m for this job to trigger: >> https://github.com/apache/spark/pull/29994) >> >> this all being said, i'm really happy w/our progress so far and have >> started leaning towards 'cautiously optimistic'... we'll see how things go >> and recalibrate accordingly. i'll have a better idea of where we are >> tomorrow and keep the list updated. >> >> and finally: a HUGE thanks goes out to jon for the work going on at the >> colo this moment: rack rearrangement, cleaning up networking, fixing >> hardware, reimaging and generally kicking ass! >> >> have a great holiday! >> >> shane >> >> On Tue, Nov 24, 2020 at 2:24 PM shane knapp β >> wrote: >> >>> our very first ubuntu-based PRB is running: >>> >>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131701/ >>> >>> crossing my fingers! :) >>> >>> On Tue, Nov 24, 2020 at 1:30 PM shane knapp β >>> wrote: >>> >>>> due to scheduling, upcoming holiday and in-the-colo work requirements, >>>> all of the centos workers are being wiped NOW. >>>> >>>> this is great, as the sooner we can get started on fixing builds the >>>> better. i'm not going anywhere over the holiday, so i'll get a good >>>> head-start on things. >>>> >>>> thank you jon! >>>> >>>> shane >>>> >>>> On Tue, Nov 24, 2020 at 11:24 AM shane knapp β >>>> wrote: >>>> >>>>> this is a lengthy, but important read for everyone here. >>>>> >>>>> in the next few days, the remaining centos machines (PRB/SBT workers >>>>> AND primary) will have be reimaged from centos6.9 to ubuntu 20.04LTS. >>>>> >>>>> this means three important things on the very near horizon: >>>>> 1 -- the PRB and SBT tests WILL BE BROKEN (by thanksgiving) >>>>> 2 -- jenkins itself will be down for a while as we move the jenkins >>>
Re: [build system] IMPORTANT UPDATE
hey all, work is going quite well and smoothly for this project. today's update: we will experience significant downtime monday/tuesday as we spin up the new primary jenkins node. until then, we'll be building over the next few days so i'll have a chance to better track down and fix any system-level build breaks. but most importantly, i just added 3 of the 4 new ubuntu 20.04 workers to the pool: research-jenkins-worker-03, 04 and 06. -05 is being difficult, so i'm going to let it pout in the corner for a while before hitting it again w/the ansible cannon. shane On Tue, Nov 24, 2020 at 6:08 PM shane knapp β wrote: > all spark builds have been ported and triggered: > > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/ > > not shown are the regular and k8s PRB, which are also running. > > i think i've nailed down most of the stupid PATH and JAVA_HOME issues, but > i'm sure we'll have some stuff to work out. i'm mostly keeping an eye on > the build history of research-jenkins-worker-01 and -02, as they're running > the latest OS + ansible (which will be moved in to the spark repo asap). > > i'm still concerned about sbt failures, which includes the PRB. we'll see > how things go, and just focus on getting things working on ubuntu 20 LTS. > if we need to drop the ubuntu 16 workers from the pool temporarily, i would > be more than happy to do that. we'll lose some capacity, but it looks like > we have a solid template for getting these suckers redeployed so > turn-around should be pretty quick. > > we also need to dedicate some time to clean up/fix our plugin configs. > there's been a lot of change over the past three years and things like PRB > triggers seem flaky (it took 28m instead of 5m for this job to trigger: > https://github.com/apache/spark/pull/29994) > > this all being said, i'm really happy w/our progress so far and have > started leaning towards 'cautiously optimistic'... we'll see how things go > and recalibrate accordingly. i'll have a better idea of where we are > tomorrow and keep the list updated. > > and finally: a HUGE thanks goes out to jon for the work going on at the > colo this moment: rack rearrangement, cleaning up networking, fixing > hardware, reimaging and generally kicking ass! > > have a great holiday! > > shane > > On Tue, Nov 24, 2020 at 2:24 PM shane knapp β wrote: > >> our very first ubuntu-based PRB is running: >> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131701/ >> >> crossing my fingers! :) >> >> On Tue, Nov 24, 2020 at 1:30 PM shane knapp β >> wrote: >> >>> due to scheduling, upcoming holiday and in-the-colo work requirements, >>> all of the centos workers are being wiped NOW. >>> >>> this is great, as the sooner we can get started on fixing builds the >>> better. i'm not going anywhere over the holiday, so i'll get a good >>> head-start on things. >>> >>> thank you jon! >>> >>> shane >>> >>> On Tue, Nov 24, 2020 at 11:24 AM shane knapp β >>> wrote: >>> >>>> this is a lengthy, but important read for everyone here. >>>> >>>> in the next few days, the remaining centos machines (PRB/SBT workers >>>> AND primary) will have be reimaged from centos6.9 to ubuntu 20.04LTS. >>>> >>>> this means three important things on the very near horizon: >>>> 1 -- the PRB and SBT tests WILL BE BROKEN (by thanksgiving) >>>> 2 -- jenkins itself will be down for a while as we move the jenkins >>>> installation to it's new home. >>>> 3 -- those of you with accounts here will temporarily lose access >>>> >>>> regarding (1), brian (cced) will be helping me debug and fix any >>>> system-level bugs (python envs, missing packages, etc). jon (cced) will be >>>> doing the reimaging and cobbling together of hardware to keep us on our >>>> feet. their help is going to be invaluable to getting us back on the >>>> ground. >>>> >>>> we already have two ubuntu 20 workers up and building >>>> (research-jenkins-worker-0[1,2]), and the SparkPullRequestBuilder-K8s build >>>> is already green. i'll keep an eye on these workers to ensure i didn't >>>> miss anything. >>>> >>>> once we have a couple of more ubuntu 20 machines up, i'll move the PRB >>>> and SBT builds there and let them fail as often as possible so we can use >>>> the build l
Re: [build system] IMPORTANT UPDATE
all spark builds have been ported and triggered: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/ not shown are the regular and k8s PRB, which are also running. i think i've nailed down most of the stupid PATH and JAVA_HOME issues, but i'm sure we'll have some stuff to work out. i'm mostly keeping an eye on the build history of research-jenkins-worker-01 and -02, as they're running the latest OS + ansible (which will be moved in to the spark repo asap). i'm still concerned about sbt failures, which includes the PRB. we'll see how things go, and just focus on getting things working on ubuntu 20 LTS. if we need to drop the ubuntu 16 workers from the pool temporarily, i would be more than happy to do that. we'll lose some capacity, but it looks like we have a solid template for getting these suckers redeployed so turn-around should be pretty quick. we also need to dedicate some time to clean up/fix our plugin configs. there's been a lot of change over the past three years and things like PRB triggers seem flaky (it took 28m instead of 5m for this job to trigger: https://github.com/apache/spark/pull/29994) this all being said, i'm really happy w/our progress so far and have started leaning towards 'cautiously optimistic'... we'll see how things go and recalibrate accordingly. i'll have a better idea of where we are tomorrow and keep the list updated. and finally: a HUGE thanks goes out to jon for the work going on at the colo this moment: rack rearrangement, cleaning up networking, fixing hardware, reimaging and generally kicking ass! have a great holiday! shane On Tue, Nov 24, 2020 at 2:24 PM shane knapp β wrote: > our very first ubuntu-based PRB is running: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131701/ > > crossing my fingers! :) > > On Tue, Nov 24, 2020 at 1:30 PM shane knapp β wrote: > >> due to scheduling, upcoming holiday and in-the-colo work requirements, >> all of the centos workers are being wiped NOW. >> >> this is great, as the sooner we can get started on fixing builds the >> better. i'm not going anywhere over the holiday, so i'll get a good >> head-start on things. >> >> thank you jon! >> >> shane >> >> On Tue, Nov 24, 2020 at 11:24 AM shane knapp β >> wrote: >> >>> this is a lengthy, but important read for everyone here. >>> >>> in the next few days, the remaining centos machines (PRB/SBT workers AND >>> primary) will have be reimaged from centos6.9 to ubuntu 20.04LTS. >>> >>> this means three important things on the very near horizon: >>> 1 -- the PRB and SBT tests WILL BE BROKEN (by thanksgiving) >>> 2 -- jenkins itself will be down for a while as we move the jenkins >>> installation to it's new home. >>> 3 -- those of you with accounts here will temporarily lose access >>> >>> regarding (1), brian (cced) will be helping me debug and fix any >>> system-level bugs (python envs, missing packages, etc). jon (cced) will be >>> doing the reimaging and cobbling together of hardware to keep us on our >>> feet. their help is going to be invaluable to getting us back on the >>> ground. >>> >>> we already have two ubuntu 20 workers up and building >>> (research-jenkins-worker-0[1,2]), and the SparkPullRequestBuilder-K8s build >>> is already green. i'll keep an eye on these workers to ensure i didn't >>> miss anything. >>> >>> once we have a couple of more ubuntu 20 machines up, i'll move the PRB >>> and SBT builds there and let them fail as often as possible so we can use >>> the build logs during the migration of the primary. >>> >>> then we shut down jenkins and move to the new primary. >>> >>> this will all be happening in the next week to week-and-a-half. >>> >>> nearish on the horizon, we need to do two things: >>> 1 -- reimage the ubuntu 16 workers >>> 2 -- clean up the all of the breakages within jenkins plugin universe. >>> there's a lot of stacktraces everywhere after the upgrade, but things are >>> still building so i'm inclined to push this out. >>> 3 -- fix the PRB/SBT builds. >>> >>> further off, once we're stable, we (the spark community) will need to >>> have an honest conversation about where the build system lives. we don't >>> currently have enough resources here to manage the system in a way that it >>> deserves, and i can't forsee getting the staffing for long-term support any >>> time soon. >&g
Re: [build system] IMPORTANT UPDATE
our very first ubuntu-based PRB is running: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131701/ crossing my fingers! :) On Tue, Nov 24, 2020 at 1:30 PM shane knapp β wrote: > due to scheduling, upcoming holiday and in-the-colo work requirements, all > of the centos workers are being wiped NOW. > > this is great, as the sooner we can get started on fixing builds the > better. i'm not going anywhere over the holiday, so i'll get a good > head-start on things. > > thank you jon! > > shane > > On Tue, Nov 24, 2020 at 11:24 AM shane knapp β > wrote: > >> this is a lengthy, but important read for everyone here. >> >> in the next few days, the remaining centos machines (PRB/SBT workers AND >> primary) will have be reimaged from centos6.9 to ubuntu 20.04LTS. >> >> this means three important things on the very near horizon: >> 1 -- the PRB and SBT tests WILL BE BROKEN (by thanksgiving) >> 2 -- jenkins itself will be down for a while as we move the jenkins >> installation to it's new home. >> 3 -- those of you with accounts here will temporarily lose access >> >> regarding (1), brian (cced) will be helping me debug and fix any >> system-level bugs (python envs, missing packages, etc). jon (cced) will be >> doing the reimaging and cobbling together of hardware to keep us on our >> feet. their help is going to be invaluable to getting us back on the >> ground. >> >> we already have two ubuntu 20 workers up and building >> (research-jenkins-worker-0[1,2]), and the SparkPullRequestBuilder-K8s build >> is already green. i'll keep an eye on these workers to ensure i didn't >> miss anything. >> >> once we have a couple of more ubuntu 20 machines up, i'll move the PRB >> and SBT builds there and let them fail as often as possible so we can use >> the build logs during the migration of the primary. >> >> then we shut down jenkins and move to the new primary. >> >> this will all be happening in the next week to week-and-a-half. >> >> nearish on the horizon, we need to do two things: >> 1 -- reimage the ubuntu 16 workers >> 2 -- clean up the all of the breakages within jenkins plugin universe. >> there's a lot of stacktraces everywhere after the upgrade, but things are >> still building so i'm inclined to push this out. >> 3 -- fix the PRB/SBT builds. >> >> further off, once we're stable, we (the spark community) will need to >> have an honest conversation about where the build system lives. we don't >> currently have enough resources here to manage the system in a way that it >> deserves, and i can't forsee getting the staffing for long-term support any >> time soon. >> >> however, with the ansible configs (which i plan on moving to the spark >> repo), it should be much easier to replicate the build system. >> >> by this time next year, i would like to have helped find the build system >> a new home, and sunset jenkins. over the past 11 years (i think), this >> system has built spark. it's getting a little tired and needs a well >> deserved break. :) >> >> shane >> -- >> Shane Knapp >> Computer Guy / Voice of Reason >> UC Berkeley EECS Research / RISELab Staff Technical Lead >> https://rise.cs.berkeley.edu >> > > > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: [build system] IMPORTANT UPDATE
due to scheduling, upcoming holiday and in-the-colo work requirements, all of the centos workers are being wiped NOW. this is great, as the sooner we can get started on fixing builds the better. i'm not going anywhere over the holiday, so i'll get a good head-start on things. thank you jon! shane On Tue, Nov 24, 2020 at 11:24 AM shane knapp β wrote: > this is a lengthy, but important read for everyone here. > > in the next few days, the remaining centos machines (PRB/SBT workers AND > primary) will have be reimaged from centos6.9 to ubuntu 20.04LTS. > > this means three important things on the very near horizon: > 1 -- the PRB and SBT tests WILL BE BROKEN (by thanksgiving) > 2 -- jenkins itself will be down for a while as we move the jenkins > installation to it's new home. > 3 -- those of you with accounts here will temporarily lose access > > regarding (1), brian (cced) will be helping me debug and fix any > system-level bugs (python envs, missing packages, etc). jon (cced) will be > doing the reimaging and cobbling together of hardware to keep us on our > feet. their help is going to be invaluable to getting us back on the > ground. > > we already have two ubuntu 20 workers up and building > (research-jenkins-worker-0[1,2]), and the SparkPullRequestBuilder-K8s build > is already green. i'll keep an eye on these workers to ensure i didn't > miss anything. > > once we have a couple of more ubuntu 20 machines up, i'll move the PRB and > SBT builds there and let them fail as often as possible so we can use the > build logs during the migration of the primary. > > then we shut down jenkins and move to the new primary. > > this will all be happening in the next week to week-and-a-half. > > nearish on the horizon, we need to do two things: > 1 -- reimage the ubuntu 16 workers > 2 -- clean up the all of the breakages within jenkins plugin universe. > there's a lot of stacktraces everywhere after the upgrade, but things are > still building so i'm inclined to push this out. > 3 -- fix the PRB/SBT builds. > > further off, once we're stable, we (the spark community) will need to have > an honest conversation about where the build system lives. we don't > currently have enough resources here to manage the system in a way that it > deserves, and i can't forsee getting the staffing for long-term support any > time soon. > > however, with the ansible configs (which i plan on moving to the spark > repo), it should be much easier to replicate the build system. > > by this time next year, i would like to have helped find the build system > a new home, and sunset jenkins. over the past 11 years (i think), this > system has built spark. it's getting a little tired and needs a well > deserved break. :) > > shane > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: jenkins downtime tomorrow evening/weekend
i just added it to the PRB config. On Tue, Nov 24, 2020 at 2:12 AM Yuming Wang wrote: > Hi Shane, > > Did you set :export LANG=en_US.UTF-8? Some test seems failed because of > this issue: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131631/testReport/ > > Please see https://issues.apache.org/jira/browse/SPARK-27177 for more > details. > > On Tue, Nov 24, 2020 at 8:23 AM shane knapp β wrote: > >> it seems that the plugin upgrade went as smoothly as it could have... i >> still have a bunch of stack traces to filter through and see if anything is >> really broken but it's looking pretty good and things are building. >> >> if you see any bad behavior from jenkins, don't hesitate to file a jira >> and ping me here. >> >> also, my backlog of things i need to install will be addressed this >> week. the ansible is coming along nicely! >> >> On Mon, Nov 23, 2020 at 2:11 PM shane knapp β >> wrote: >> >>> the third most terrifying event in the world, a massive jenkins plugin >>> update is happening in a couple of hours. i'm going to restart jenkins and >>> start working out any bugs/issues that pop up. >>> >>> this could be short, or quite long. i'm guessing somewhere in the >>> middle. no new builds will be kicked off starting now. >>> >>> in parallel, i'm about to start porting my ansible to ubuntu 20 and >>> testing that on two freshly reinstalled workers. the ultimate goal is to >>> get the PRB running on ubuntu 20... the sbt tests will also likely be >>> broken as i've never been able to work on ubuntu 16, 18 or 20. >>> >>> shane >>> >>> On Sat, Nov 21, 2020 at 4:23 PM shane knapp β >>> wrote: >>> >>>> somehow that went pretty smoothly, tho i've got a bunch of plugins to >>>> deal with... we're back up and building w/a shiny new UI. :) >>>> >>>> On Sat, Nov 21, 2020 at 3:52 PM shane knapp β >>>> wrote: >>>> >>>>> this is starting now >>>>> >>>>> On Thu, Nov 19, 2020 at 4:34 PM shane knapp β >>>>> wrote: >>>>> >>>>>> i'm going to be upgrading jenkins to something more reasonable, and >>>>>> there will definitely be some downtime as i get things sorted. >>>>>> >>>>>> we should be back up and building by monday. >>>>>> >>>>>> shane >>>>>> -- >>>>>> Shane Knapp >>>>>> Computer Guy / Voice of Reason >>>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>>>>> https://rise.cs.berkeley.edu >>>>>> >>>>> >>>>> >>>>> -- >>>>> Shane Knapp >>>>> Computer Guy / Voice of Reason >>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>>>> https://rise.cs.berkeley.edu >>>>> >>>> >>>> >>>> -- >>>> Shane Knapp >>>> Computer Guy / Voice of Reason >>>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>>> https://rise.cs.berkeley.edu >>>> >>> >>> >>> -- >>> Shane Knapp >>> Computer Guy / Voice of Reason >>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>> https://rise.cs.berkeley.edu >>> >> >> >> -- >> Shane Knapp >> Computer Guy / Voice of Reason >> UC Berkeley EECS Research / RISELab Staff Technical Lead >> https://rise.cs.berkeley.edu >> > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
[build system] IMPORTANT UPDATE
this is a lengthy, but important read for everyone here. in the next few days, the remaining centos machines (PRB/SBT workers AND primary) will have be reimaged from centos6.9 to ubuntu 20.04LTS. this means three important things on the very near horizon: 1 -- the PRB and SBT tests WILL BE BROKEN (by thanksgiving) 2 -- jenkins itself will be down for a while as we move the jenkins installation to it's new home. 3 -- those of you with accounts here will temporarily lose access regarding (1), brian (cced) will be helping me debug and fix any system-level bugs (python envs, missing packages, etc). jon (cced) will be doing the reimaging and cobbling together of hardware to keep us on our feet. their help is going to be invaluable to getting us back on the ground. we already have two ubuntu 20 workers up and building (research-jenkins-worker-0[1,2]), and the SparkPullRequestBuilder-K8s build is already green. i'll keep an eye on these workers to ensure i didn't miss anything. once we have a couple of more ubuntu 20 machines up, i'll move the PRB and SBT builds there and let them fail as often as possible so we can use the build logs during the migration of the primary. then we shut down jenkins and move to the new primary. this will all be happening in the next week to week-and-a-half. nearish on the horizon, we need to do two things: 1 -- reimage the ubuntu 16 workers 2 -- clean up the all of the breakages within jenkins plugin universe. there's a lot of stacktraces everywhere after the upgrade, but things are still building so i'm inclined to push this out. 3 -- fix the PRB/SBT builds. further off, once we're stable, we (the spark community) will need to have an honest conversation about where the build system lives. we don't currently have enough resources here to manage the system in a way that it deserves, and i can't forsee getting the staffing for long-term support any time soon. however, with the ansible configs (which i plan on moving to the spark repo), it should be much easier to replicate the build system. by this time next year, i would like to have helped find the build system a new home, and sunset jenkins. over the past 11 years (i think), this system has built spark. it's getting a little tired and needs a well deserved break. :) shane -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: jenkins downtime tomorrow evening/weekend
it seems that the plugin upgrade went as smoothly as it could have... i still have a bunch of stack traces to filter through and see if anything is really broken but it's looking pretty good and things are building. if you see any bad behavior from jenkins, don't hesitate to file a jira and ping me here. also, my backlog of things i need to install will be addressed this week. the ansible is coming along nicely! On Mon, Nov 23, 2020 at 2:11 PM shane knapp β wrote: > the third most terrifying event in the world, a massive jenkins plugin > update is happening in a couple of hours. i'm going to restart jenkins and > start working out any bugs/issues that pop up. > > this could be short, or quite long. i'm guessing somewhere in the > middle. no new builds will be kicked off starting now. > > in parallel, i'm about to start porting my ansible to ubuntu 20 and > testing that on two freshly reinstalled workers. the ultimate goal is to > get the PRB running on ubuntu 20... the sbt tests will also likely be > broken as i've never been able to work on ubuntu 16, 18 or 20. > > shane > > On Sat, Nov 21, 2020 at 4:23 PM shane knapp β wrote: > >> somehow that went pretty smoothly, tho i've got a bunch of plugins to >> deal with... we're back up and building w/a shiny new UI. :) >> >> On Sat, Nov 21, 2020 at 3:52 PM shane knapp β >> wrote: >> >>> this is starting now >>> >>> On Thu, Nov 19, 2020 at 4:34 PM shane knapp β >>> wrote: >>> >>>> i'm going to be upgrading jenkins to something more reasonable, and >>>> there will definitely be some downtime as i get things sorted. >>>> >>>> we should be back up and building by monday. >>>> >>>> shane >>>> -- >>>> Shane Knapp >>>> Computer Guy / Voice of Reason >>>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>>> https://rise.cs.berkeley.edu >>>> >>> >>> >>> -- >>> Shane Knapp >>> Computer Guy / Voice of Reason >>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>> https://rise.cs.berkeley.edu >>> >> >> >> -- >> Shane Knapp >> Computer Guy / Voice of Reason >> UC Berkeley EECS Research / RISELab Staff Technical Lead >> https://rise.cs.berkeley.edu >> > > > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: jenkins downtime tomorrow evening/weekend
the third most terrifying event in the world, a massive jenkins plugin update is happening in a couple of hours. i'm going to restart jenkins and start working out any bugs/issues that pop up. this could be short, or quite long. i'm guessing somewhere in the middle. no new builds will be kicked off starting now. in parallel, i'm about to start porting my ansible to ubuntu 20 and testing that on two freshly reinstalled workers. the ultimate goal is to get the PRB running on ubuntu 20... the sbt tests will also likely be broken as i've never been able to work on ubuntu 16, 18 or 20. shane On Sat, Nov 21, 2020 at 4:23 PM shane knapp β wrote: > somehow that went pretty smoothly, tho i've got a bunch of plugins to deal > with... we're back up and building w/a shiny new UI. :) > > On Sat, Nov 21, 2020 at 3:52 PM shane knapp β wrote: > >> this is starting now >> >> On Thu, Nov 19, 2020 at 4:34 PM shane knapp β >> wrote: >> >>> i'm going to be upgrading jenkins to something more reasonable, and >>> there will definitely be some downtime as i get things sorted. >>> >>> we should be back up and building by monday. >>> >>> shane >>> -- >>> Shane Knapp >>> Computer Guy / Voice of Reason >>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>> https://rise.cs.berkeley.edu >>> >> >> >> -- >> Shane Knapp >> Computer Guy / Voice of Reason >> UC Berkeley EECS Research / RISELab Staff Technical Lead >> https://rise.cs.berkeley.edu >> > > > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: jenkins downtime tomorrow evening/weekend
somehow that went pretty smoothly, tho i've got a bunch of plugins to deal with... we're back up and building w/a shiny new UI. :) On Sat, Nov 21, 2020 at 3:52 PM shane knapp β wrote: > this is starting now > > On Thu, Nov 19, 2020 at 4:34 PM shane knapp β wrote: > >> i'm going to be upgrading jenkins to something more reasonable, and there >> will definitely be some downtime as i get things sorted. >> >> we should be back up and building by monday. >> >> shane >> -- >> Shane Knapp >> Computer Guy / Voice of Reason >> UC Berkeley EECS Research / RISELab Staff Technical Lead >> https://rise.cs.berkeley.edu >> > > > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: jenkins downtime tomorrow evening/weekend
this is starting now On Thu, Nov 19, 2020 at 4:34 PM shane knapp β wrote: > i'm going to be upgrading jenkins to something more reasonable, and there > will definitely be some downtime as i get things sorted. > > we should be back up and building by monday. > > shane > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
jenkins downtime tomorrow evening/weekend
i'm going to be upgrading jenkins to something more reasonable, and there will definitely be some downtime as i get things sorted. we should be back up and building by monday. shane -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
[build system] IMPORTANT: builds will be impacted this month
TL;DR: our build system is ancient, EOLed and about to get hit hard w/a secops hammer. we need to literally reinstall the entire cluster from scratch and get things working. here are the high level bullet points about what's coming up in the next month: ** all amp-jenkins-worker-* nodes are running centos 6, and the remainder ubuntu 16. these will be upgraded to ubuntu 20. i will be doing this in stages so as to minimize downtime. ** ALL BUILDS NEED TO BE PORTED TO UBUNTU 20. i can ensure that the environments on the nodes are identical, but i have yet been able to successfully build any SBT jobs on any version of ubuntu, and the MVN builds won't run on ubuntu 18 (tho they work fine on 16). i also have had difficulty getting the PRB job to successfully finish on ubuntu. for this, i will definitely need help from the dev community to get things working... and the speed at which things are fixed will be inversely proportional to how much help i get. :) ** amplab jenkins primary node will need two major upgrades: OS from centos 6 to ubuntu 20, and jenkins from 1.6 to 2.X LTS... i'm most concerned about this, as it is literally the exact same jenkins installtion that patrick wendell set up over 10 years ago. there are many publish secrets that are entered in to the jenkins config and i'd really hope that we don't lose them. my plan here is to upgrade the current jenkins, and fix any things that break. then we'll rsync jenkins' homedir to the new primary node and hope that works. :) ** user audits UC berkeley's new security standards require quarterly audits of non-affiliated accounts... this won't impact only but a few people on this list, but i'll need to work w/campus and our department on solutions for this other than local accounts on the servers. a LOT is going to happen, and i'm meeting w/my team today and will come up w/a basic plan. we will definitely experience downtime during this, but i cannot guess as to what that will look like. this might also be a good time to talk about the future of the build system, auditing our builds (do we need SBT?), or even finally getting around to dockerizing everything so i don't need such a fragile and non-atomic set of worker nodes specifically for spark. thoughts? comments? shane ps -- this is one of the reasons why i haven't been around much lately... it's been really tough keeping things up to date while trying to remotely train up one of my sysadmins to take over some of my build system duties. -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: [build system] jenkins wedged again
everything's up and jenkins is slowly chewing through the queue! :) On Wed, Oct 14, 2020 at 12:00 PM Xiao Li wrote: > Thank you, Shane! > > Xiao > > On Wed, Oct 14, 2020 at 12:00 PM shane knapp β > wrote: > >> we're mostly back up, and just waiting for a couple of ubuntu boxes to >> finish booting... prb seem to be building now! >> >> On Wed, Oct 14, 2020 at 11:48 AM shane knapp β >> wrote: >> >>> i'm going to reboot the primary and worker nodes, so it'll be a few >>> minutes before everything is back up. >>> >>> shane >>> -- >>> Shane Knapp >>> Computer Guy / Voice of Reason >>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>> https://rise.cs.berkeley.edu >>> >> >> >> -- >> Shane Knapp >> Computer Guy / Voice of Reason >> UC Berkeley EECS Research / RISELab Staff Technical Lead >> https://rise.cs.berkeley.edu >> > > > -- > > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: [build system] jenkins wedged again
we're mostly back up, and just waiting for a couple of ubuntu boxes to finish booting... prb seem to be building now! On Wed, Oct 14, 2020 at 11:48 AM shane knapp β wrote: > i'm going to reboot the primary and worker nodes, so it'll be a few > minutes before everything is back up. > > shane > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
[build system] jenkins wedged again
i'm going to reboot the primary and worker nodes, so it'll be a few minutes before everything is back up. shane -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: Running K8s integration tests for changes in core?
just revisiting this thread... re presubmit strategy: i don't think this would be easy to set up... and i'm not sure what benefit it will give us. re inadvertent errors: since we're checking out the same hash from the PR for both builds, and they'll run simultaneously, i don't think it'll be an issue. re overloading the workers: nah. the regular PRB takes ~4hr, and the k8s PRB takes ~30m and runs in parallel. i'll set this up right now and keep an eye on the queue/build results today. shane On Thu, Aug 20, 2020 at 2:28 PM Holden Karau wrote: > Sounds good, thanks for the heads up. I hope you get some time to relax :) > > On Thu, Aug 20, 2020 at 2:26 PM shane knapp β wrote: > >> fyi, i won't be making this change until the 1st week of september. i'll >> be out, off the grid all next week! :) >> >> i will send an announcement out tomorrow on how to contact my team here @ >> uc berkeley if jenkins goes down. >> >> shane >> >> On Thu, Aug 20, 2020 at 4:40 AM Prashant Sharma >> wrote: >> >>> Another option is, if we could have something like "presubmit" PR build. >>> In other words, running the entire 4 H + K8s integration on each commit >>> pushed is too much at the same time and there are chances that one thing >>> can inadvertently affect other components(as you just said). >>> >>> A presubmit(which includes K8s integration tests) build will be run, >>> once the PR receives LGTM from "Approved reviewers". This is one criteria >>> that comes to my mind, others may have better suggestions. >>> >>> On Thu, Aug 20, 2020 at 12:25 AM shane knapp β >>> wrote: >>> >>>> we'll be gated by the number of ubuntu workers w/minikube and docker, >>>> but it shouldn't be too bad as the full integration test takes ~45m, vs 4+ >>>> hrs for the regular PRB. >>>> >>>> i can enable this in about 1m of time if the consensus is for us to >>>> want this. >>>> >>>> On Wed, Aug 19, 2020 at 11:37 AM Holden Karau >>>> wrote: >>>> >>>>> Sounds good. In the meantime would folks committing things in core run >>>>> the K8s PRB or run it locally? A second change this morning was committed >>>>> that broke the K8s PR tests. >>>>> >>>>> On Tue, Aug 18, 2020 at 9:53 PM Prashant Sharma >>>>> wrote: >>>>> >>>>>> +1, we should enable. >>>>>> >>>>>> On Wed, Aug 19, 2020 at 9:18 AM Holden Karau >>>>>> wrote: >>>>>> >>>>>>> Hi Dev Folks, >>>>>>> >>>>>>> I was wondering how people feel about enabling the K8s PRB >>>>>>> automatically for all core changes? Sometimes I forget that a change >>>>>>> might >>>>>>> impact one of the K8s integration tests since a bunch of them look at >>>>>>> log >>>>>>> messages. Would folks be OK with turning on the K8s integration PRB for >>>>>>> all >>>>>>> core changes as well as K8s changes? >>>>>>> >>>>>>> Cheers, >>>>>>> >>>>>>> Holden :) >>>>>>> >>>>>>> -- >>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> Twitter: https://twitter.com/holdenkarau >>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>> >>>> >>>> >>>> -- >>>> Shane Knapp >>>> Computer Guy / Voice of Reason >>>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>>> https://rise.cs.berkeley.edu >>>> >>> >> >> -- >> Shane Knapp >> Computer Guy / Voice of Reason >> UC Berkeley EECS Research / RISELab Staff Technical Lead >> https://rise.cs.berkeley.edu >> > > > -- > Twitter: https://twitter.com/holdenkarau > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: [build system] downtime due to SSL cert errors
certs delivered and installed... we're back! On Wed, Sep 23, 2020 at 6:07 PM shane knapp β wrote: > jenkins is up and building, but not reachable via https at the moment. > i'm working on getting this sorted ASAP. > > shane > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
[build system] downtime due to SSL cert errors
jenkins is up and building, but not reachable via https at the moment. i'm working on getting this sorted ASAP. shane -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
[build system] shane out all next week (aug 22-29), support instructions
i will be disappearing off in to the wilderness for a few days of backpacking, and am handing off basic support duties to my team. if, and only if, jenkins goes down, please email research-supp...@cs.berkeley.edu and open a ticket. if you open a ticket, please let dev@ know to minimize the number of tickets opened. :) if there are any other problems, file a JIRA and assign to me. i will look at it in early september. shane -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: Running K8s integration tests for changes in core?
fyi, i won't be making this change until the 1st week of september. i'll be out, off the grid all next week! :) i will send an announcement out tomorrow on how to contact my team here @ uc berkeley if jenkins goes down. shane On Thu, Aug 20, 2020 at 4:40 AM Prashant Sharma wrote: > Another option is, if we could have something like "presubmit" PR build. > In other words, running the entire 4 H + K8s integration on each commit > pushed is too much at the same time and there are chances that one thing > can inadvertently affect other components(as you just said). > > A presubmit(which includes K8s integration tests) build will be run, once > the PR receives LGTM from "Approved reviewers". This is one criteria that > comes to my mind, others may have better suggestions. > > On Thu, Aug 20, 2020 at 12:25 AM shane knapp β > wrote: > >> we'll be gated by the number of ubuntu workers w/minikube and docker, but >> it shouldn't be too bad as the full integration test takes ~45m, vs 4+ hrs >> for the regular PRB. >> >> i can enable this in about 1m of time if the consensus is for us to want >> this. >> >> On Wed, Aug 19, 2020 at 11:37 AM Holden Karau >> wrote: >> >>> Sounds good. In the meantime would folks committing things in core run >>> the K8s PRB or run it locally? A second change this morning was committed >>> that broke the K8s PR tests. >>> >>> On Tue, Aug 18, 2020 at 9:53 PM Prashant Sharma >>> wrote: >>> >>>> +1, we should enable. >>>> >>>> On Wed, Aug 19, 2020 at 9:18 AM Holden Karau >>>> wrote: >>>> >>>>> Hi Dev Folks, >>>>> >>>>> I was wondering how people feel about enabling the K8s PRB >>>>> automatically for all core changes? Sometimes I forget that a change might >>>>> impact one of the K8s integration tests since a bunch of them look at log >>>>> messages. Would folks be OK with turning on the K8s integration PRB for >>>>> all >>>>> core changes as well as K8s changes? >>>>> >>>>> Cheers, >>>>> >>>>> Holden :) >>>>> >>>>> -- >>>>> Twitter: https://twitter.com/holdenkarau >>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>> >>>> >>> >>> -- >>> Twitter: https://twitter.com/holdenkarau >>> Books (Learning Spark, High Performance Spark, etc.): >>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>> >> >> >> -- >> Shane Knapp >> Computer Guy / Voice of Reason >> UC Berkeley EECS Research / RISELab Staff Technical Lead >> https://rise.cs.berkeley.edu >> > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: Running K8s integration tests for changes in core?
we'll be gated by the number of ubuntu workers w/minikube and docker, but it shouldn't be too bad as the full integration test takes ~45m, vs 4+ hrs for the regular PRB. i can enable this in about 1m of time if the consensus is for us to want this. On Wed, Aug 19, 2020 at 11:37 AM Holden Karau wrote: > Sounds good. In the meantime would folks committing things in core run the > K8s PRB or run it locally? A second change this morning was committed that > broke the K8s PR tests. > > On Tue, Aug 18, 2020 at 9:53 PM Prashant Sharma > wrote: > >> +1, we should enable. >> >> On Wed, Aug 19, 2020 at 9:18 AM Holden Karau >> wrote: >> >>> Hi Dev Folks, >>> >>> I was wondering how people feel about enabling the K8s PRB automatically >>> for all core changes? Sometimes I forget that a change might impact one of >>> the K8s integration tests since a bunch of them look at log messages. Would >>> folks be OK with turning on the K8s integration PRB for all core changes as >>> well as K8s changes? >>> >>> Cheers, >>> >>> Holden :) >>> >>> -- >>> Twitter: https://twitter.com/holdenkarau >>> Books (Learning Spark, High Performance Spark, etc.): >>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>> >> > > -- > Twitter: https://twitter.com/holdenkarau > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: Running K8s integration tests for changes in core?
yes, i think this is fine. the k8s prb runs concurrently to the regular prb and takes ~20m. On Tue, Aug 18, 2020 at 8:47 PM Holden Karau wrote: > Hi Dev Folks, > > I was wondering how people feel about enabling the K8s PRB automatically > for all core changes? Sometimes I forget that a change might impact one of > the K8s integration tests since a bunch of them look at log messages. Would > folks be OK with turning on the K8s integration PRB for all core changes as > well as K8s changes? > > Cheers, > > Holden :) > > -- > Twitter: https://twitter.com/holdenkarau > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
[build system] restarting jenkins now
there isn't much activity right now, and i'd like to restart jenkins quickly as it's consuming a lot of memory on the head node. shouldn't be more than a couple of minutes downtime... if something goes awry i'll send an email here. if you don't hear from me again, please carry on. :) -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: R installation broken on ubuntu workers, impacts K8s PRB builds
this is done, except for amp-jenkins-staging-worker-02 which is refusing to allow me to reinstall R... i marked that worker offline and will beat on it later today. On Fri, Jul 17, 2020 at 11:36 AM shane knapp β wrote: > starting now... pausing jenkins so no new builds are launched. > > On Thu, Jul 16, 2020 at 3:09 PM Holden Karau wrote: > >> Sounds good, thanks. No rush :) >> >> On Thu, Jul 16, 2020 at 3:03 PM shane knapp β >> wrote: >> >>> i'll get to this tomorrow afternoon, and there will be a short >>> downtime. more details to come. >>> >>> On Wed, Jul 15, 2020 at 12:17 PM Holden Karau >>> wrote: >>> >>>> Oh cool, I filed a JIRA for this already and assigned it to you >>>> (noticed in one of my PRs)- >>>> https://issues.apache.org/jira/browse/SPARK-32326 >>>> >>>> On Wed, Jul 15, 2020 at 12:09 PM shane knapp β >>>> wrote: >>>> >>>>> i'm not entirely sure when the dep for R got bumped to 3.5+, but it's >>>>> breaking the k8s builds. >>>>> >>>>> i'll need to purge these workers of all previous versions of R + >>>>> packages, then reinstall from scratch. this isn't a horrible task as i >>>>> have most of it automated but it will still require a ~few hours of >>>>> downtime. >>>>> >>>>> i'll file a JIRA, and figure out when i will be able to get to >>>>> this... possibly this afternoon. >>>>> -- >>>>> Shane Knapp >>>>> Computer Guy / Voice of Reason >>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>>>> https://rise.cs.berkeley.edu >>>>> >>>> >>>> >>>> -- >>>> Twitter: https://twitter.com/holdenkarau >>>> Books (Learning Spark, High Performance Spark, etc.): >>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>> >>> >>> >>> -- >>> Shane Knapp >>> Computer Guy / Voice of Reason >>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>> https://rise.cs.berkeley.edu >>> >> -- >> Twitter: https://twitter.com/holdenkarau >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> > > > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: R installation broken on ubuntu workers, impacts K8s PRB builds
starting now... pausing jenkins so no new builds are launched. On Thu, Jul 16, 2020 at 3:09 PM Holden Karau wrote: > Sounds good, thanks. No rush :) > > On Thu, Jul 16, 2020 at 3:03 PM shane knapp β wrote: > >> i'll get to this tomorrow afternoon, and there will be a short downtime. >> more details to come. >> >> On Wed, Jul 15, 2020 at 12:17 PM Holden Karau >> wrote: >> >>> Oh cool, I filed a JIRA for this already and assigned it to you (noticed >>> in one of my PRs)- https://issues.apache.org/jira/browse/SPARK-32326 >>> >>> On Wed, Jul 15, 2020 at 12:09 PM shane knapp β >>> wrote: >>> >>>> i'm not entirely sure when the dep for R got bumped to 3.5+, but it's >>>> breaking the k8s builds. >>>> >>>> i'll need to purge these workers of all previous versions of R + >>>> packages, then reinstall from scratch. this isn't a horrible task as i >>>> have most of it automated but it will still require a ~few hours of >>>> downtime. >>>> >>>> i'll file a JIRA, and figure out when i will be able to get to this... >>>> possibly this afternoon. >>>> -- >>>> Shane Knapp >>>> Computer Guy / Voice of Reason >>>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>>> https://rise.cs.berkeley.edu >>>> >>> >>> >>> -- >>> Twitter: https://twitter.com/holdenkarau >>> Books (Learning Spark, High Performance Spark, etc.): >>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>> >> >> >> -- >> Shane Knapp >> Computer Guy / Voice of Reason >> UC Berkeley EECS Research / RISELab Staff Technical Lead >> https://rise.cs.berkeley.edu >> > -- > Twitter: https://twitter.com/holdenkarau > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: R installation broken on ubuntu workers, impacts K8s PRB builds
i'll get to this tomorrow afternoon, and there will be a short downtime. more details to come. On Wed, Jul 15, 2020 at 12:17 PM Holden Karau wrote: > Oh cool, I filed a JIRA for this already and assigned it to you (noticed > in one of my PRs)- https://issues.apache.org/jira/browse/SPARK-32326 > > On Wed, Jul 15, 2020 at 12:09 PM shane knapp β > wrote: > >> i'm not entirely sure when the dep for R got bumped to 3.5+, but it's >> breaking the k8s builds. >> >> i'll need to purge these workers of all previous versions of R + >> packages, then reinstall from scratch. this isn't a horrible task as i >> have most of it automated but it will still require a ~few hours of >> downtime. >> >> i'll file a JIRA, and figure out when i will be able to get to this... >> possibly this afternoon. >> -- >> Shane Knapp >> Computer Guy / Voice of Reason >> UC Berkeley EECS Research / RISELab Staff Technical Lead >> https://rise.cs.berkeley.edu >> > > > -- > Twitter: https://twitter.com/holdenkarau > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
R installation broken on ubuntu workers, impacts K8s PRB builds
i'm not entirely sure when the dep for R got bumped to 3.5+, but it's breaking the k8s builds. i'll need to purge these workers of all previous versions of R + packages, then reinstall from scratch. this isn't a horrible task as i have most of it automated but it will still require a ~few hours of downtime. i'll file a JIRA, and figure out when i will be able to get to this... possibly this afternoon. -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: [DISCUSS] Drop Python 2, 3.4 and 3.5
this is seriously great news! let's all take a moment and welcome apache spark's python support to the present. ;) On Mon, Jul 13, 2020 at 7:26 PM Holden Karau wrote: > Awesome, thanks you for driving this forward :) > > On Mon, Jul 13, 2020 at 7:25 PM Hyukjin Kwon wrote: > >> Thank you all. Python 2, 3.4 and 3.5 are dropped now in the master branch >> at https://github.com/apache/spark/pull/28957 >> >> 2020λ 7μ 3μΌ (κΈ) μ€μ 10:01, Hyukjin Kwon λμ΄ μμ±: >> >>> Thanks Dongjoon. That makes much more sense now! >>> >>> 2020λ 7μ 3μΌ (κΈ) μ€μ 12:11, Dongjoon Hyun λμ΄ μμ±: >>> >>>> Thank you, Hyukjin. >>>> >>>> According to the Python community, Python 3.5 is also EOF at 2020-09-13 >>>> (only two months left). >>>> >>>> - https://www.python.org/downloads/ >>>> >>>> So, targeting live Python versions at Apache Spark 3.1.0 (December >>>> 2020) looks reasonable to me. >>>> >>>> For old Python versions, we still have Apache Spark 2.4 LTS and also >>>> Apache Spark 3.0.x will work. >>>> >>>> Bests, >>>> Dongjoon. >>>> >>>> >>>> On Wed, Jul 1, 2020 at 10:50 PM Yuanjian Li >>>> wrote: >>>> >>>>> +1, especially Python 2 >>>>> >>>>> Holden Karau δΊ2020εΉ΄7ζ2ζ₯ε¨ε δΈε10:20ειοΌ >>>>> >>>>>> Iβm ok with us dropping Python 2, 3.4, and 3.5 in Spark 3.1 forward. >>>>>> It will be exciting to get to use more recent Python features. The most >>>>>> recent Ubuntu LTS ships with 3.7, and while the previous LTS ships with >>>>>> 3.5, if folks really canβt upgrade thereβs conda. >>>>>> >>>>>> Is there anyone with a large Python 3.5 fleet who canβt use conda? >>>>>> >>>>>> On Wed, Jul 1, 2020 at 7:15 PM Hyukjin Kwon >>>>>> wrote: >>>>>> >>>>>>> Yeah, sure. It will be dropped at Spark 3.1 onwards. I don't think >>>>>>> we should make such changes in maintenance releases >>>>>>> >>>>>>> 2020λ 7μ 2μΌ (λͺ©) μ€μ 11:13, Holden Karau λμ΄ μμ±: >>>>>>> >>>>>>>> To be clear the plan is to drop them in Spark 3.1 onwards, yes? >>>>>>>> >>>>>>>> On Wed, Jul 1, 2020 at 7:11 PM Hyukjin Kwon >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> I would like to discuss dropping deprecated Python versions 2, 3.4 >>>>>>>>> and 3.5 at https://github.com/apache/spark/pull/28957. I assume >>>>>>>>> people support it in general >>>>>>>>> but I am writing this to make sure everybody is happy. >>>>>>>>> >>>>>>>>> Fokko made a very good investigation on it, see >>>>>>>>> https://github.com/apache/spark/pull/28957#issuecomment-652022449. >>>>>>>>> Assuming from the statistics, I think we're pretty safe to drop >>>>>>>>> them. >>>>>>>>> Also note that dropping Python 2 was actually declared at >>>>>>>>> https://python3statement.org/ >>>>>>>>> >>>>>>>>> Roughly speaking, there are many main advantages by dropping them: >>>>>>>>> 1. It removes a bunch of hacks we added around 700 lines in >>>>>>>>> PySpark. >>>>>>>>> 2. PyPy2 has a critical bug that causes a flaky test, >>>>>>>>> https://issues.apache.org/jira/browse/SPARK-28358 given my >>>>>>>>> testing and investigation. >>>>>>>>> 3. Users can use Python type hints with Pandas UDFs without >>>>>>>>> thinking about Python version >>>>>>>>> 4. Users can leverage one latest cloudpickle, >>>>>>>>> https://github.com/apache/spark/pull/28950. With Python 3.8+ it >>>>>>>>> can also leverage C pickle. >>>>>>>>> 5. ... >>>>>>>>> >>>>>>>>> So it benefits both users and dev. WDYT guys? >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>>>> >>>>>>> -- >>>>>> Twitter: https://twitter.com/holdenkarau >>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>> >>>>> > > -- > Twitter: https://twitter.com/holdenkarau > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: Welcoming some new Apache Spark committers
welcome, all! On Tue, Jul 14, 2020 at 10:37 AM Matei Zaharia wrote: > Hi all, > > The Spark PMC recently voted to add several new committers. Please join me > in welcoming them to their new roles! The new committers are: > > - Huaxin Gao > - Jungtaek Lim > - Dilip Biswal > > All three of them contributed to Spark 3.0 and weβre excited to have them > join the project. > > Matei and the Spark PMC > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: restarting jenkins build system tomorrow (7/8) ~930am PDT
alright, the system load graphs show that we've had a generally decreasing load since friday, and have burned through ~3k builds/day since the reboot last week! i don't see many timeouts, and the PRB builds have been generally green for a couple of days. again, i will keep an eye on things but i feel we're out of the woods right now. :) shane On Fri, Jul 10, 2020 at 3:43 PM Frank Yin wrote: > Great. Thanks. > > On Fri, Jul 10, 2020 at 3:39 PM shane knapp β wrote: > >> no, 8 hours is plenty. things will speed up soon once the backlog of >> builds works through i limited the number of PRB builds to 4 per >> worker, and things are looking better. let's see how we look next week. >> >> On Fri, Jul 10, 2020 at 3:31 PM Frank Yin wrote: >> >>> Can we also increase the build timeout? >>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125617 >>> This one fails because it times out, not because of test failures. >>> >>> On Fri, Jul 10, 2020 at 2:16 PM Frank Yin wrote: >>> >>>> Yeah, that's what I figured -- those workers are under load. Thanks. >>>> >>>> On Fri, Jul 10, 2020 at 12:43 PM shane knapp β >>>> wrote: >>>> >>>>> only 125561, 125562 and 125564 were impacted by -9. >>>>> >>>>> 125565 exited w/a code of 15 (143 - 128), which means the process was >>>>> terminated for unknown reasons. >>>>> >>>>> 125563 looks like mima failed due to a bunch of errors. >>>>> >>>>> i just spot checked a bunch of recent failed PRB builds from today and >>>>> they all seemed to be legit. >>>>> >>>>> another thing that might be happening is an overload of PRB builds on >>>>> the workers due to the backlog... the workers are under a LOT of load >>>>> right now, and i can put some rate limiting in to see if that helps out. >>>>> >>>>> shane >>>>> >>>>> On Fri, Jul 10, 2020 at 11:31 AM Frank Yin >>>>> wrote: >>>>> >>>>>> Like from build number 125565 to 125561, all impacted by kill -9. >>>>>> >>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console >>>>>> >>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125564/console >>>>>> >>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125563/console >>>>>> >>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125562/console >>>>>> >>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125561/console >>>>>> >>>>>> On Fri, Jul 10, 2020 at 9:35 AM shane knapp β >>>>>> wrote: >>>>>> >>>>>>> define "a lot" and provide some links to those builds, please. >>>>>>> there are roughly 2000 builds per day, and i can't do more than keep a >>>>>>> cursory eye on things. >>>>>>> >>>>>>> the infrastructure that the tests run on hasn't changed one bit on >>>>>>> any of the workers, and 'kill -9' could be a timeout, flakiness caused >>>>>>> by >>>>>>> old build processes remaining on the workers after the master went >>>>>>> down, or >>>>>>> me trying to clean things up w/o a reboot. or, perhaps, something wrong >>>>>>> w/the infra. :) >>>>>>> >>>>>>> On Fri, Jul 10, 2020 at 9:28 AM Frank Yin >>>>>>> wrote: >>>>>>> >>>>>>>> Agree, but Iβve seen a lot of kill by signal 9, assuming that >>>>>>>> infrastructure? >>>>>>>> >>>>>>>> On Fri, Jul 10, 2020 at 8:19 AM shane knapp β >>>>>>>> wrote: >>>>>>>> >>>>>>>>> yeah, i can't do much for flaky tests... just flaky >>>>>>>>> infrastructure. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Jul 10, 2020 at 12:41 AM Hyukjin Kwon >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Couple of flaky tests c
Re: restarting jenkins build system tomorrow (7/8) ~930am PDT
no, 8 hours is plenty. things will speed up soon once the backlog of builds works through i limited the number of PRB builds to 4 per worker, and things are looking better. let's see how we look next week. On Fri, Jul 10, 2020 at 3:31 PM Frank Yin wrote: > Can we also increase the build timeout? > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125617 > This one fails because it times out, not because of test failures. > > On Fri, Jul 10, 2020 at 2:16 PM Frank Yin wrote: > >> Yeah, that's what I figured -- those workers are under load. Thanks. >> >> On Fri, Jul 10, 2020 at 12:43 PM shane knapp β >> wrote: >> >>> only 125561, 125562 and 125564 were impacted by -9. >>> >>> 125565 exited w/a code of 15 (143 - 128), which means the process was >>> terminated for unknown reasons. >>> >>> 125563 looks like mima failed due to a bunch of errors. >>> >>> i just spot checked a bunch of recent failed PRB builds from today and >>> they all seemed to be legit. >>> >>> another thing that might be happening is an overload of PRB builds on >>> the workers due to the backlog... the workers are under a LOT of load >>> right now, and i can put some rate limiting in to see if that helps out. >>> >>> shane >>> >>> On Fri, Jul 10, 2020 at 11:31 AM Frank Yin wrote: >>> >>>> Like from build number 125565 to 125561, all impacted by kill -9. >>>> >>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console >>>> >>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125564/console >>>> >>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125563/console >>>> >>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125562/console >>>> >>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125561/console >>>> >>>> On Fri, Jul 10, 2020 at 9:35 AM shane knapp β >>>> wrote: >>>> >>>>> define "a lot" and provide some links to those builds, please. there >>>>> are roughly 2000 builds per day, and i can't do more than keep a cursory >>>>> eye on things. >>>>> >>>>> the infrastructure that the tests run on hasn't changed one bit on any >>>>> of the workers, and 'kill -9' could be a timeout, flakiness caused by old >>>>> build processes remaining on the workers after the master went down, or me >>>>> trying to clean things up w/o a reboot. or, perhaps, something wrong >>>>> w/the >>>>> infra. :) >>>>> >>>>> On Fri, Jul 10, 2020 at 9:28 AM Frank Yin wrote: >>>>> >>>>>> Agree, but Iβve seen a lot of kill by signal 9, assuming that >>>>>> infrastructure? >>>>>> >>>>>> On Fri, Jul 10, 2020 at 8:19 AM shane knapp β >>>>>> wrote: >>>>>> >>>>>>> yeah, i can't do much for flaky tests... just flaky infrastructure. >>>>>>> >>>>>>> >>>>>>> On Fri, Jul 10, 2020 at 12:41 AM Hyukjin Kwon >>>>>>> wrote: >>>>>>> >>>>>>>> Couple of flaky tests can happen. It's usual. Seems it got better >>>>>>>> now at least. I will keep monitoring the builds. >>>>>>>> >>>>>>>> 2020λ 7μ 10μΌ (κΈ) μ€ν 4:33, ukby1234 λμ΄ μμ±: >>>>>>>> >>>>>>>>> Looks like Jenkins isn't stable still. My PR fails two times in a >>>>>>>>> row: >>>>>>>>> >>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console >>>>>>>>> >>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Sent from: >>>>>>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/ >>>>>>>>> >>>>>>>>> >>>>>>>>> - >>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Shane Knapp >>>>>>> Computer Guy / Voice of Reason >>>>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>>>>>> https://rise.cs.berkeley.edu >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> Shane Knapp >>>>> Computer Guy / Voice of Reason >>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>>>> https://rise.cs.berkeley.edu >>>>> >>>> >>> >>> -- >>> Shane Knapp >>> Computer Guy / Voice of Reason >>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>> https://rise.cs.berkeley.edu >>> >> -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: restarting jenkins build system tomorrow (7/8) ~930am PDT
only 125561, 125562 and 125564 were impacted by -9. 125565 exited w/a code of 15 (143 - 128), which means the process was terminated for unknown reasons. 125563 looks like mima failed due to a bunch of errors. i just spot checked a bunch of recent failed PRB builds from today and they all seemed to be legit. another thing that might be happening is an overload of PRB builds on the workers due to the backlog... the workers are under a LOT of load right now, and i can put some rate limiting in to see if that helps out. shane On Fri, Jul 10, 2020 at 11:31 AM Frank Yin wrote: > Like from build number 125565 to 125561, all impacted by kill -9. > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125564/console > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125563/console > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125562/console > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125561/console > > On Fri, Jul 10, 2020 at 9:35 AM shane knapp β wrote: > >> define "a lot" and provide some links to those builds, please. there are >> roughly 2000 builds per day, and i can't do more than keep a cursory eye on >> things. >> >> the infrastructure that the tests run on hasn't changed one bit on any of >> the workers, and 'kill -9' could be a timeout, flakiness caused by old >> build processes remaining on the workers after the master went down, or me >> trying to clean things up w/o a reboot. or, perhaps, something wrong w/the >> infra. :) >> >> On Fri, Jul 10, 2020 at 9:28 AM Frank Yin wrote: >> >>> Agree, but Iβve seen a lot of kill by signal 9, assuming that >>> infrastructure? >>> >>> On Fri, Jul 10, 2020 at 8:19 AM shane knapp β >>> wrote: >>> >>>> yeah, i can't do much for flaky tests... just flaky infrastructure. >>>> >>>> >>>> On Fri, Jul 10, 2020 at 12:41 AM Hyukjin Kwon >>>> wrote: >>>> >>>>> Couple of flaky tests can happen. It's usual. Seems it got better now >>>>> at least. I will keep monitoring the builds. >>>>> >>>>> 2020λ 7μ 10μΌ (κΈ) μ€ν 4:33, ukby1234 λμ΄ μμ±: >>>>> >>>>>> Looks like Jenkins isn't stable still. My PR fails two times in a row: >>>>>> >>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console >>>>>> >>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ >>>>>> >>>>>> - >>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>> >>>>>> >>>> >>>> -- >>>> Shane Knapp >>>> Computer Guy / Voice of Reason >>>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>>> https://rise.cs.berkeley.edu >>>> >>> >> >> -- >> Shane Knapp >> Computer Guy / Voice of Reason >> UC Berkeley EECS Research / RISELab Staff Technical Lead >> https://rise.cs.berkeley.edu >> > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: restarting jenkins build system tomorrow (7/8) ~930am PDT
define "a lot" and provide some links to those builds, please. there are roughly 2000 builds per day, and i can't do more than keep a cursory eye on things. the infrastructure that the tests run on hasn't changed one bit on any of the workers, and 'kill -9' could be a timeout, flakiness caused by old build processes remaining on the workers after the master went down, or me trying to clean things up w/o a reboot. or, perhaps, something wrong w/the infra. :) On Fri, Jul 10, 2020 at 9:28 AM Frank Yin wrote: > Agree, but Iβve seen a lot of kill by signal 9, assuming that > infrastructure? > > On Fri, Jul 10, 2020 at 8:19 AM shane knapp β wrote: > >> yeah, i can't do much for flaky tests... just flaky infrastructure. >> >> >> On Fri, Jul 10, 2020 at 12:41 AM Hyukjin Kwon >> wrote: >> >>> Couple of flaky tests can happen. It's usual. Seems it got better now at >>> least. I will keep monitoring the builds. >>> >>> 2020λ 7μ 10μΌ (κΈ) μ€ν 4:33, ukby1234 λμ΄ μμ±: >>> >>>> Looks like Jenkins isn't stable still. My PR fails two times in a row: >>>> >>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console >>>> >>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport >>>> >>>> >>>> >>>> -- >>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ >>>> >>>> ----- >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>> >>>> >> >> -- >> Shane Knapp >> Computer Guy / Voice of Reason >> UC Berkeley EECS Research / RISELab Staff Technical Lead >> https://rise.cs.berkeley.edu >> > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: restarting jenkins build system tomorrow (7/8) ~930am PDT
yeah, i can't do much for flaky tests... just flaky infrastructure. On Fri, Jul 10, 2020 at 12:41 AM Hyukjin Kwon wrote: > Couple of flaky tests can happen. It's usual. Seems it got better now at > least. I will keep monitoring the builds. > > 2020λ 7μ 10μΌ (κΈ) μ€ν 4:33, ukby1234 λμ΄ μμ±: > >> Looks like Jenkins isn't stable still. My PR fails two times in a row: >> >> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console >> >> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport >> >> >> >> -- >> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ >> >> ----- >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: restarting jenkins build system tomorrow (7/8) ~930am PDT
i'm seeing green PRB builds now, so i feel that we've gotten things building again! :) On Thu, Jul 9, 2020 at 5:33 PM Hyukjin Kwon wrote: > Thank you Shane. > > 2020λ 7μ 10μΌ (κΈ) μ€μ 2:35, shane knapp β λμ΄ μμ±: > >> and -06 is back! i'll keep an eye on things today, but suffice to >> say on each worker i: >> >> 1) rebooted >> 2) cleaned ~/.ivy2, ~/.m2, and other associated caches >> >> we should be g2g! please reply here if you continue to see weirdness. >> >> On Thu, Jul 9, 2020 at 10:08 AM shane knapp β >> wrote: >> >>> ok, we're back up and building (just waiting for one worker, -06 to >>> finish cleaning itself up). >>> >>> On Thu, Jul 9, 2020 at 9:30 AM shane knapp β >>> wrote: >>> >>>> this is happening now. >>>> >>>> On Wed, Jul 8, 2020 at 9:07 AM shane knapp β >>>> wrote: >>>> >>>>> this will be happening tomorrow... today is Meeting Hell Day[tm]. >>>>> >>>>> On Tue, Jul 7, 2020 at 1:59 PM shane knapp β >>>>> wrote: >>>>> >>>>>> i wasn't able to get to it today, so i'm hoping to squeeze in a quick >>>>>> trip to the colo tomorrow morning. if not, then first thing thursday. >>>>>> >>>>>> -- >>>>>> Shane Knapp >>>>>> Computer Guy / Voice of Reason >>>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>>>>> https://rise.cs.berkeley.edu >>>>>> >>>>> >>>>> >>>>> -- >>>>> Shane Knapp >>>>> Computer Guy / Voice of Reason >>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>>>> https://rise.cs.berkeley.edu >>>>> >>>> >>>> >>>> -- >>>> Shane Knapp >>>> Computer Guy / Voice of Reason >>>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>>> https://rise.cs.berkeley.edu >>>> >>> >>> >>> -- >>> Shane Knapp >>> Computer Guy / Voice of Reason >>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>> https://rise.cs.berkeley.edu >>> >> >> >> -- >> Shane Knapp >> Computer Guy / Voice of Reason >> UC Berkeley EECS Research / RISELab Staff Technical Lead >> https://rise.cs.berkeley.edu >> > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: restarting jenkins build system tomorrow (7/8) ~930am PDT
and -06 is back! i'll keep an eye on things today, but suffice to say on each worker i: 1) rebooted 2) cleaned ~/.ivy2, ~/.m2, and other associated caches we should be g2g! please reply here if you continue to see weirdness. On Thu, Jul 9, 2020 at 10:08 AM shane knapp β wrote: > ok, we're back up and building (just waiting for one worker, -06 to finish > cleaning itself up). > > On Thu, Jul 9, 2020 at 9:30 AM shane knapp β wrote: > >> this is happening now. >> >> On Wed, Jul 8, 2020 at 9:07 AM shane knapp β wrote: >> >>> this will be happening tomorrow... today is Meeting Hell Day[tm]. >>> >>> On Tue, Jul 7, 2020 at 1:59 PM shane knapp β >>> wrote: >>> >>>> i wasn't able to get to it today, so i'm hoping to squeeze in a quick >>>> trip to the colo tomorrow morning. if not, then first thing thursday. >>>> >>>> -- >>>> Shane Knapp >>>> Computer Guy / Voice of Reason >>>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>>> https://rise.cs.berkeley.edu >>>> >>> >>> >>> -- >>> Shane Knapp >>> Computer Guy / Voice of Reason >>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>> https://rise.cs.berkeley.edu >>> >> >> >> -- >> Shane Knapp >> Computer Guy / Voice of Reason >> UC Berkeley EECS Research / RISELab Staff Technical Lead >> https://rise.cs.berkeley.edu >> > > > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: restarting jenkins build system tomorrow (7/8) ~930am PDT
ok, we're back up and building (just waiting for one worker, -06 to finish cleaning itself up). On Thu, Jul 9, 2020 at 9:30 AM shane knapp β wrote: > this is happening now. > > On Wed, Jul 8, 2020 at 9:07 AM shane knapp β wrote: > >> this will be happening tomorrow... today is Meeting Hell Day[tm]. >> >> On Tue, Jul 7, 2020 at 1:59 PM shane knapp β wrote: >> >>> i wasn't able to get to it today, so i'm hoping to squeeze in a quick >>> trip to the colo tomorrow morning. if not, then first thing thursday. >>> >>> -- >>> Shane Knapp >>> Computer Guy / Voice of Reason >>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>> https://rise.cs.berkeley.edu >>> >> >> >> -- >> Shane Knapp >> Computer Guy / Voice of Reason >> UC Berkeley EECS Research / RISELab Staff Technical Lead >> https://rise.cs.berkeley.edu >> > > > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: restarting jenkins build system tomorrow (7/8) ~930am PDT
this is happening now. On Wed, Jul 8, 2020 at 9:07 AM shane knapp β wrote: > this will be happening tomorrow... today is Meeting Hell Day[tm]. > > On Tue, Jul 7, 2020 at 1:59 PM shane knapp β wrote: > >> i wasn't able to get to it today, so i'm hoping to squeeze in a quick >> trip to the colo tomorrow morning. if not, then first thing thursday. >> >> -- >> Shane Knapp >> Computer Guy / Voice of Reason >> UC Berkeley EECS Research / RISELab Staff Technical Lead >> https://rise.cs.berkeley.edu >> > > > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: restarting jenkins build system tomorrow (7/8) ~930am PDT
this will be happening tomorrow... today is Meeting Hell Day[tm]. On Tue, Jul 7, 2020 at 1:59 PM shane knapp β wrote: > i wasn't able to get to it today, so i'm hoping to squeeze in a quick trip > to the colo tomorrow morning. if not, then first thing thursday. > > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
restarting jenkins build system tomorrow (7/8) ~930am PDT
i wasn't able to get to it today, so i'm hoping to squeeze in a quick trip to the colo tomorrow morning. if not, then first thing thursday. -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: m2 cache issues in Jenkins?
ok, i'm gonna have to reboot all the workers tomorrow and wipe the m2 caches. it looks like zombie builds were lingering post-jenkins-wedging and corrupting the repos. fixed on -05. On Mon, Jul 6, 2020 at 2:17 PM Jungtaek Lim wrote: > Just encountered the same and it's worker-05 again. (You can find [error] > in the console to see what's the problem. I guess jetty artifacts in the > worker might be messed up.) > > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125127/consoleFull > > > On Tue, Jul 7, 2020 at 5:35 AM Jungtaek Lim > wrote: > >> Could this be a flaky or persistent issue? It failed with Scala gendoc >> but it didn't fail with the part the PR modified. It ran from worker-05. >> >> >> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125121/consoleFull >> >> On Tue, Jul 7, 2020 at 2:10 AM shane knapp β wrote: >> >>> i killed and retriggered the PRB jobs on 04, and wiped that workers' m2 >>> cache. >>> >>> On Mon, Jul 6, 2020 at 9:24 AM shane knapp β >>> wrote: >>> >>>> once the jobs running on that worker are finished, yes. >>>> >>>> On Sun, Jul 5, 2020 at 7:41 PM Hyukjin Kwon >>>> wrote: >>>> >>>>> Shane, can we remove .m2 in worker machine 4? >>>>> >>>>> 2020λ 7μ 3μΌ (κΈ) μ€μ 8:18, Jungtaek Lim λμ΄ >>>>> μμ±: >>>>> >>>>>> Looks like Jenkins service itself becomes unstable. It took >>>>>> considerable time to just open the test report for a specific build, and >>>>>> Jenkins doesn't pick the request on rebuild (retest this, please) in >>>>>> Github >>>>>> comment. >>>>>> >>>>>> On Thu, Jul 2, 2020 at 2:12 PM Hyukjin Kwon >>>>>> wrote: >>>>>> >>>>>>> Ah, okay. Actually there already is - >>>>>>> https://issues.apache.org/jira/browse/SPARK-31693. I am reopening. >>>>>>> >>>>>>> 2020λ 7μ 2μΌ (λͺ©) μ€ν 2:06, Holden Karau λμ΄ μμ±: >>>>>>> >>>>>>>> We don't I didn't file one originally, but Shane reminded me to in >>>>>>>> the future. >>>>>>>> >>>>>>>> On Wed, Jul 1, 2020 at 9:44 PM Hyukjin Kwon >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Nope, do we have an existing ticket? I think we can reopen if >>>>>>>>> there is. >>>>>>>>> >>>>>>>>> 2020λ 7μ 2μΌ (λͺ©) μ€ν 1:43, Holden Karau λμ΄ μμ±: >>>>>>>>> >>>>>>>>>> Huh interesting that itβs the same worker. Have you filed a >>>>>>>>>> ticket to Shane? >>>>>>>>>> >>>>>>>>>> On Wed, Jul 1, 2020 at 8:50 PM Hyukjin Kwon >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hm .. seems this is happening again in amp-jenkins-worker-04 ;(. >>>>>>>>>>> >>>>>>>>>>> 2020λ 6μ 25μΌ (λͺ©) μ€μ 3:15, shane knapp β λμ΄ >>>>>>>>>>> μμ±: >>>>>>>>>>> >>>>>>>>>>>> done: >>>>>>>>>>>> -bash-4.1$ cd .m2 >>>>>>>>>>>> -bash-4.1$ ls >>>>>>>>>>>> repository >>>>>>>>>>>> -bash-4.1$ time rm -rf * >>>>>>>>>>>> >>>>>>>>>>>> real17m4.607s >>>>>>>>>>>> user0m0.950s >>>>>>>>>>>> sys 0m18.816s >>>>>>>>>>>> -bash-4.1$ >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Jun 24, 2020 at 10:50 AM shane knapp β < >>>>>>>>>>>> skn...@berkeley.edu> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> ok, i've taken that worker offline and once the job running on >>>>>>>>>>>>> it finishes, i'll wipe the cache. >>>>>>>>>>>>> >>>>>>>>>>>>> in the future, plea
Re: m2 cache issues in Jenkins?
i killed and retriggered the PRB jobs on 04, and wiped that workers' m2 cache. On Mon, Jul 6, 2020 at 9:24 AM shane knapp β wrote: > once the jobs running on that worker are finished, yes. > > On Sun, Jul 5, 2020 at 7:41 PM Hyukjin Kwon wrote: > >> Shane, can we remove .m2 in worker machine 4? >> >> 2020λ 7μ 3μΌ (κΈ) μ€μ 8:18, Jungtaek Lim λμ΄ >> μμ±: >> >>> Looks like Jenkins service itself becomes unstable. It took considerable >>> time to just open the test report for a specific build, and Jenkins doesn't >>> pick the request on rebuild (retest this, please) in Github comment. >>> >>> On Thu, Jul 2, 2020 at 2:12 PM Hyukjin Kwon wrote: >>> >>>> Ah, okay. Actually there already is - >>>> https://issues.apache.org/jira/browse/SPARK-31693. I am reopening. >>>> >>>> 2020λ 7μ 2μΌ (λͺ©) μ€ν 2:06, Holden Karau λμ΄ μμ±: >>>> >>>>> We don't I didn't file one originally, but Shane reminded me to in the >>>>> future. >>>>> >>>>> On Wed, Jul 1, 2020 at 9:44 PM Hyukjin Kwon >>>>> wrote: >>>>> >>>>>> Nope, do we have an existing ticket? I think we can reopen if there >>>>>> is. >>>>>> >>>>>> 2020λ 7μ 2μΌ (λͺ©) μ€ν 1:43, Holden Karau λμ΄ μμ±: >>>>>> >>>>>>> Huh interesting that itβs the same worker. Have you filed a ticket >>>>>>> to Shane? >>>>>>> >>>>>>> On Wed, Jul 1, 2020 at 8:50 PM Hyukjin Kwon >>>>>>> wrote: >>>>>>> >>>>>>>> Hm .. seems this is happening again in amp-jenkins-worker-04 ;(. >>>>>>>> >>>>>>>> 2020λ 6μ 25μΌ (λͺ©) μ€μ 3:15, shane knapp β λμ΄ μμ±: >>>>>>>> >>>>>>>>> done: >>>>>>>>> -bash-4.1$ cd .m2 >>>>>>>>> -bash-4.1$ ls >>>>>>>>> repository >>>>>>>>> -bash-4.1$ time rm -rf * >>>>>>>>> >>>>>>>>> real17m4.607s >>>>>>>>> user0m0.950s >>>>>>>>> sys 0m18.816s >>>>>>>>> -bash-4.1$ >>>>>>>>> >>>>>>>>> On Wed, Jun 24, 2020 at 10:50 AM shane knapp β < >>>>>>>>> skn...@berkeley.edu> wrote: >>>>>>>>> >>>>>>>>>> ok, i've taken that worker offline and once the job running on it >>>>>>>>>> finishes, i'll wipe the cache. >>>>>>>>>> >>>>>>>>>> in the future, please file a JIRA and assign it to me so i don't >>>>>>>>>> have to track my work through emails to the dev@ list. ;) >>>>>>>>>> >>>>>>>>>> thanks! >>>>>>>>>> >>>>>>>>>> shane >>>>>>>>>> >>>>>>>>>> On Wed, Jun 24, 2020 at 10:48 AM Holden Karau < >>>>>>>>>> hol...@pigscanfly.ca> wrote: >>>>>>>>>> >>>>>>>>>>> The most recent one I noticed was >>>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124437/console >>>>>>>>>>> which >>>>>>>>>>> was run on amp-jenkins-worker-04. >>>>>>>>>>> >>>>>>>>>>> On Wed, Jun 24, 2020 at 10:44 AM shane knapp β < >>>>>>>>>>> skn...@berkeley.edu> wrote: >>>>>>>>>>> >>>>>>>>>>>> for those weird failures, it's super helpful to provide which >>>>>>>>>>>> workers are showing these issues. :) >>>>>>>>>>>> >>>>>>>>>>>> i'd rather not wipe all of the m2 caches on all of the workers, >>>>>>>>>>>> as we'll then potentially get blacklisted again if we download too >>>>>>>>>>>> many >>>>>>>>>>>> packages from apache.org. >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Jun 23, 2020
Re: m2 cache issues in Jenkins?
once the jobs running on that worker are finished, yes. On Sun, Jul 5, 2020 at 7:41 PM Hyukjin Kwon wrote: > Shane, can we remove .m2 in worker machine 4? > > 2020λ 7μ 3μΌ (κΈ) μ€μ 8:18, Jungtaek Lim λμ΄ μμ±: > >> Looks like Jenkins service itself becomes unstable. It took considerable >> time to just open the test report for a specific build, and Jenkins doesn't >> pick the request on rebuild (retest this, please) in Github comment. >> >> On Thu, Jul 2, 2020 at 2:12 PM Hyukjin Kwon wrote: >> >>> Ah, okay. Actually there already is - >>> https://issues.apache.org/jira/browse/SPARK-31693. I am reopening. >>> >>> 2020λ 7μ 2μΌ (λͺ©) μ€ν 2:06, Holden Karau λμ΄ μμ±: >>> >>>> We don't I didn't file one originally, but Shane reminded me to in the >>>> future. >>>> >>>> On Wed, Jul 1, 2020 at 9:44 PM Hyukjin Kwon >>>> wrote: >>>> >>>>> Nope, do we have an existing ticket? I think we can reopen if there is. >>>>> >>>>> 2020λ 7μ 2μΌ (λͺ©) μ€ν 1:43, Holden Karau λμ΄ μμ±: >>>>> >>>>>> Huh interesting that itβs the same worker. Have you filed a ticket to >>>>>> Shane? >>>>>> >>>>>> On Wed, Jul 1, 2020 at 8:50 PM Hyukjin Kwon >>>>>> wrote: >>>>>> >>>>>>> Hm .. seems this is happening again in amp-jenkins-worker-04 ;(. >>>>>>> >>>>>>> 2020λ 6μ 25μΌ (λͺ©) μ€μ 3:15, shane knapp β λμ΄ μμ±: >>>>>>> >>>>>>>> done: >>>>>>>> -bash-4.1$ cd .m2 >>>>>>>> -bash-4.1$ ls >>>>>>>> repository >>>>>>>> -bash-4.1$ time rm -rf * >>>>>>>> >>>>>>>> real17m4.607s >>>>>>>> user0m0.950s >>>>>>>> sys 0m18.816s >>>>>>>> -bash-4.1$ >>>>>>>> >>>>>>>> On Wed, Jun 24, 2020 at 10:50 AM shane knapp β >>>>>>>> wrote: >>>>>>>> >>>>>>>>> ok, i've taken that worker offline and once the job running on it >>>>>>>>> finishes, i'll wipe the cache. >>>>>>>>> >>>>>>>>> in the future, please file a JIRA and assign it to me so i don't >>>>>>>>> have to track my work through emails to the dev@ list. ;) >>>>>>>>> >>>>>>>>> thanks! >>>>>>>>> >>>>>>>>> shane >>>>>>>>> >>>>>>>>> On Wed, Jun 24, 2020 at 10:48 AM Holden Karau < >>>>>>>>> hol...@pigscanfly.ca> wrote: >>>>>>>>> >>>>>>>>>> The most recent one I noticed was >>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124437/console >>>>>>>>>> which >>>>>>>>>> was run on amp-jenkins-worker-04. >>>>>>>>>> >>>>>>>>>> On Wed, Jun 24, 2020 at 10:44 AM shane knapp β < >>>>>>>>>> skn...@berkeley.edu> wrote: >>>>>>>>>> >>>>>>>>>>> for those weird failures, it's super helpful to provide which >>>>>>>>>>> workers are showing these issues. :) >>>>>>>>>>> >>>>>>>>>>> i'd rather not wipe all of the m2 caches on all of the workers, >>>>>>>>>>> as we'll then potentially get blacklisted again if we download too >>>>>>>>>>> many >>>>>>>>>>> packages from apache.org. >>>>>>>>>>> >>>>>>>>>>> On Tue, Jun 23, 2020 at 5:58 PM Holden Karau < >>>>>>>>>>> hol...@pigscanfly.ca> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Folks, >>>>>>>>>>>> >>>>>>>>>>>> I've been see some weird failures on Jenkins and it looks like >>>>>>>>>>>> it might be from the m2 cache. Would it be OK to clean it out? Or >>>>>>>>>>
Re: Jenkins is down
hey all, i was out of town for the weekend and noticed it was down this morning and restarted the service. it's been pretty flaky recently, so i'll take a much closer look at things this coming week. On Sun, Jul 5, 2020 at 1:14 PM Dongjoon Hyun wrote: > Hi, All. > > Now, AmpLab Jenkins farm came back online. > > > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/ > > Also, many PRBuilder jobs were re-started 10 minutes ago. > > Bests, > Dongjoon. > > > On Fri, Jul 3, 2020 at 4:43 AM Hyukjin Kwon wrote: > >> Hi all and Shane, >> >> Is there something wrong with the Jenkins machines? Seems they are down. >> > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: m2 cache issues in Jenkins?
done: -bash-4.1$ cd .m2 -bash-4.1$ ls repository -bash-4.1$ time rm -rf * real17m4.607s user0m0.950s sys 0m18.816s -bash-4.1$ On Wed, Jun 24, 2020 at 10:50 AM shane knapp β wrote: > ok, i've taken that worker offline and once the job running on it > finishes, i'll wipe the cache. > > in the future, please file a JIRA and assign it to me so i don't have to > track my work through emails to the dev@ list. ;) > > thanks! > > shane > > On Wed, Jun 24, 2020 at 10:48 AM Holden Karau > wrote: > >> The most recent one I noticed was >> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124437/console >> which >> was run on amp-jenkins-worker-04. >> >> On Wed, Jun 24, 2020 at 10:44 AM shane knapp β >> wrote: >> >>> for those weird failures, it's super helpful to provide which workers >>> are showing these issues. :) >>> >>> i'd rather not wipe all of the m2 caches on all of the workers, as we'll >>> then potentially get blacklisted again if we download too many packages >>> from apache.org. >>> >>> On Tue, Jun 23, 2020 at 5:58 PM Holden Karau >>> wrote: >>> >>>> Hi Folks, >>>> >>>> I've been see some weird failures on Jenkins and it looks like it might >>>> be from the m2 cache. Would it be OK to clean it out? Or is it important? >>>> >>>> Cheers, >>>> >>>> Holden >>>> >>>> -- >>>> Twitter: https://twitter.com/holdenkarau >>>> Books (Learning Spark, High Performance Spark, etc.): >>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>> >>> >>> >>> -- >>> Shane Knapp >>> Computer Guy / Voice of Reason >>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>> https://rise.cs.berkeley.edu >>> >> >> >> -- >> Twitter: https://twitter.com/holdenkarau >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> > > > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu