ok, i dug through the logs and noticed that rsyslogd was dropping messages
to do imuxsock being spammed by postfix...  which i then tracked down to
our installation of fail2ban being incorrectly configured and attempting to
send IP ban/unban status emails to 'em...@example.com'.

since we're a university, and especially one w/a reputation like ours, we
are constantly under attack.  the logs of the attempted dictionary attacks
would astound you in their size and scope.  since we have so many ban/unban
actions happening for all of these unique IP address, each of which
generates an email that was directed to an invalid address, we ended up
w/well over 100M of plain-text messages waiting in the mail queue.  postfix
was continually trying to send these messages, which was causing the system
to behave strangely, including breaking rsyslogd.

so, i disabled email reports in fail2ban, restarted the impacted services,
picked my sysadmin's brain and then purged the mail queue (when was the
last time anyone actually used postfix?).  jenkins now seems to be behaving
(maybe?).

i'm not entirely sure that this will fix the strange GUI hangs, but all
reports i found on stackoverflow and other sites detail strange system
behavior across the board when rsyslogd starts dropping messages.  at the
very least we won't be (potentially) losing system-level log messages
anymore, which might actually help me track down what's happening if
jenkins gets wedged again.

and finally, the obligatory IT Crowd clip:
https://www.youtube.com/watch?v=5UT8RkSmN4k

shane (who expects jenkins to crash within 5 minutes of this email going
out)

On Fri, Mar 15, 2019 at 8:22 PM Sean Owen <sro...@gmail.com> wrote:

> It's not responding again. Is there any way to kick it harder? I know
> it's well understood but this means not much can be merged in Spark
>
> On Fri, Mar 15, 2019 at 12:08 PM shane knapp <skn...@berkeley.edu> wrote:
> >
> > well, that box rebooted in record time!  we're back up and building.
> >
> > and as always, i'll keep a close eye on things today...  jenkins usually
> works great, until it doesn't.  :\
> >
> > On Fri, Mar 15, 2019 at 9:52 AM shane knapp <skn...@berkeley.edu> wrote:
> >>
> >> as some of you may have noticed, jenkins got itself in a bad state
> multiple times over the past couple of weeks.  usually restarting the
> service is sufficient, but it appears that i need to hit it w/the reboot
> hammer.
> >>
> >> jenkins will be down for the next 20-30 minutes as the node reboots and
> jenkins spins back up.  i'll reply here w/any updates.
> >>
> >> shane
> >> --
> >> Shane Knapp
> >> UC Berkeley EECS Research / RISELab Staff Technical Lead
> >> https://rise.cs.berkeley.edu
> >
> >
> >
> > --
> > Shane Knapp
> > UC Berkeley EECS Research / RISELab Staff Technical Lead
> > https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

Reply via email to