Even with the rate limiting on, I'm seeing 100% of 1 CPU dedicated to rsyslogd, I'm *guessing* that is TLS connection overhead, but I really don't have any insight there. I can say that I'm still seeing some agents get "could not connect" or "connection reset by peer" errors.
I certainly think we should land your changes so far, as it does fix a real bug, and we can always continue to iterate on fixing other things. John =:-> On Tue, Sep 16, 2014 at 6:28 PM, Wayne Witzel <[email protected]> wrote: > John, > > Thank you taking the time to run the logging through these tests. > > I found that there is also the $SystemLogRateLimitInterval setting. > Setting that to 0 will disable the default rate limiting in rsyslog. > http://www.rsyslog.com/changing-the-settings/ > > I would be curious to see what the performance and logging looks like with > the rate limiting completely disabled. > > Should I submit the changes at a PR? Since even though they aren't > perfect, it does resolve a real bug (size mismatch). > > Not sure why the second file is hanging around, I wasn't able to replicate > that locally and I wasn't able to find anything that suggested there was a > minimum age. There is a maxage, which ensures old files get deleted, but > not the inverse. I will continue to research optimizations and fixes for > our rsyslog configuration. > > Wayne > > > On Tue, Sep 16, 2014 at 4:13 AM, John Meinel <[email protected]> > wrote: > >> Oh, and after all this has been running for a while, I did see: >> -rw------- 1 syslog adm 178M Sep 16 08:11 all-machines.log >> -rw------- 1 syslog adm 513M Sep 16 07:53 all-machines.log.1 >> -rw------- 1 syslog adm 513M Sep 16 07:26 all-machines.log.2 >> >> So it is successfully rotating the log file, but it seems to be keeping >> too many of them? >> >> Maybe it is keeping more because they are happening on the same day and >> there is some sort of "minimum age before deleting" rule? >> >> John >> =:-> >> >> On Tue, Sep 16, 2014 at 12:11 PM, John Meinel <[email protected]> >> wrote: >> >>> So poking around and filing >>> https://bugs.launchpad.net/juju-core/+bug/1369909 >>> I found lines about >>> Sep 16 07:48:14 ip-10-39-60-247 rsyslogd-2079: message repeated 498 >>> times: [too many tcp sessions - dropping incoming request [try >>> http://www.rsyslog.com/e/2079 ]] >>> >>> That is 500 times rsyslog had to deny someone. (So probably 500 Rsyslog >>> workers that got restarted, triggering 500*5 lines of machine-0.log >>> information.) >>> >>> By putting this line in our /etc/rsyslog.d/25-juju.conf >>> >>> $InputTCPMaxSessions 20000 # Maximum TCP sessions (default 200) >>> >>> In the "ModLoad imtcp" section (before the InputTCPServerRun) and then >>> restarting (reloading is not enough), it seemed to make the system a lot >>> happier. >>> >>> It isn't 100% happy, as still the #1 thing in machine-0.log is >>> information about WatchForRsyslogChanges being stopped. But it is at a >>> lower rate. >>> >>> Again filtering for a single connection it seems that the watcher is >>> being stopped every 2 minutes, and then restarted 3s later. >>> >>> Given that this is are logging infrastructure, failures like this mean >>> that it is going to be hard to debug, since we can't aggregate logs to the >>> central point that say "I could not connect to the logging infrastructure". >>> >>> John >>> =:-> >>> >>> >>> >>> On Tue, Sep 16, 2014 at 11:39 AM, John Meinel <[email protected]> >>> wrote: >>> >>>> I tested your branch in the same configuration. First I had to change >>>> "logging-config: <root>=DEBUG" again. I thought that if you did "juju >>>> bootstrap --debug" that it always left DEBUG level on. >>>> >>>> Anyway, something doesn't quite seem right, as I saw: >>>> -rw------- 1 syslog adm 500M Sep 16 07:25 all-machines.log >>>> ... >>>> -rw------- 1 syslog syslog 300M Sep 16 07:21 >>>> machine-0-2014-09-16T07-21-02.193.log >>>> -rw------- 1 syslog syslog 300M Sep 16 07:25 >>>> machine-0-2014-09-16T07-25-14.840.log >>>> -rw------- 1 syslog syslog 31M Sep 16 07:25 machine-0.log >>>> >>>> >>>> Which means that machine-0.log has gotten more than 600MB of data for >>>> sure (I saw it rotate at least once above this). But there is only a single >>>> all-machines.log of only 500MB. >>>> >>>> As I sat there watching, I did manage to see: >>>> -rw------- 1 syslog adm 7.0M Sep 16 07:26 all-machines.log >>>> -rw------- 1 syslog adm 513M Sep 16 07:26 all-machines.log.1 >>>> ... >>>> >>>> Which looks correct. But it would seem that something (load?) was >>>> causing it to not get all of the machine-0 messages. >>>> Maybe it is getting rate limited? >>>> I can watch the machine-0.log file fill at about 1MB per second. And >>>> then after about 5s I see all-machines.log only go up by a couple of MB. >>>> (machine-0 log seemed to go up by about 12MB, and then all-machines.log >>>> only went up by 3MB.) >>>> >>>> Now, I'm also seeing a rediculous amount of logging, and lots of lines >>>> like: >>>> http://paste.ubuntu.com/8356019/ >>>> >>>> Which seems to indicate that every 3 seconds the Rsyslog Watcher is >>>> dying and causing everything to restart and reconnect and ask for the new >>>> credentials. >>>> >>>> I'm also a bit surprised that we're telling them to connect to a DNS >>>> name, and not an IP address, but the resetting is the important bit. I >>>> filed https://bugs.launchpad.net/juju-core/+bug/1369900 to track that >>>> bug. >>>> >>>> Anyone care to try to figure out why I'm seeing 3-4x the amount of data >>>> in machine-0.log than seems to be ending up in all-machines.log ? >>>> >>>> John >>>> =:-> >>>> >>>> On Mon, Sep 15, 2014 at 7:13 PM, Wayne Witzel < >>>> [email protected]> wrote: >>>> >>>>> Here is a branch that addresses the concerns and bug. I've tested it >>>>> locally and with digitalocean. I would love for it to be tested under the >>>>> scaling scenario where you first encountered the issues. >>>>> >>>>> https://github.com/wwitzel3/juju/tree/ww3-rsyslogd-logrotate >>>>> >>>>> I've changed the size in logrotate to an arbitrarily small size, since >>>>> the rotation is driven by rsyslog. I've also updated the "style" of the >>>>> rotation to create a new log file and use postrotate to HUP rsyslog. >>>>> >>>>> Also I removed compression. I don't think we need it given that we are >>>>> only keeping one rotation around and it is size restricted. This will just >>>>> make it easier to get at information across both of the log files. >>>>> >>>>> Wayne >>>>> >>>>> On Mon, Sep 15, 2014 at 10:30 AM, Nate Finch <[email protected] >>>>> > wrote: >>>>> >>>>>> tl;dr: rsyslog sees 512,000,000 bytes and tells logrotate "time to >>>>>> rotate!" and logrotate sees less than 512MB and say "nah, not big >>>>>> enough" >>>>>> and rsyslog never writes the logs anymore because the file is too big. >>>>>> >>>>>> On Mon, Sep 15, 2014 at 10:07 AM, Wayne Witzel < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> When this was being implemented and during the review process the >>>>>>> actual size of the rotated files got adjusted. I tested very thoroughly >>>>>>> the >>>>>>> actual rotation script and also tested the triggering of the rotation >>>>>>> script. Enough that I was happy landing the changes. >>>>>>> >>>>>>> When the sizes were adjusted the same amount of diligence was not >>>>>>> applied in testing. This is what introduced the bug. >>>>>>> >>>>>>> The size mismatch is problem. rsyslogd will stop logging to the file >>>>>>> and attempted to execute the script in the outchannel every time it >>>>>>> gets a >>>>>>> new log message, which runs logrotate, which does nothing because the >>>>>>> file >>>>>>> isn't big enough. >>>>>>> >>>>>>> Here is the rsyslog docs for implementation of sized based rotation. >>>>>>> http://www.rsyslog.com/doc/log_rotation_fix_size.html >>>>>>> >>>>>>> I first attempted use the mv command, but with our rsyslogd >>>>>>> configuration, when I performed a mv, rsyslogd would stop logging to the >>>>>>> file until I actually did a reload on the process. Also this didn't >>>>>>> easily >>>>>>> support things like compression or adding more archives if we decided to >>>>>>> keep more than one rotation around. So the choice to use logrotate is >>>>>>> was >>>>>>> made. >>>>>>> >>>>>>> As for the actual act of reloading rsyslog. You can use copytruncate >>>>>>> to avoid it all together or you can use postrotate to reload rsyslogd. >>>>>>> In >>>>>>> both cases there is a small window of possible data loss. Though with >>>>>>> copyandtruncate it is more likley to happen when the system is under >>>>>>> load. >>>>>>> So changing to a postrotate that reloads rsyslogd is probably a good >>>>>>> idea. >>>>>>> >>>>>>> The rotate setting is just the number of files to keep around. I >>>>>>> asked some people after the size update if 1 rotation, a total of 1GB >>>>>>> was >>>>>>> enough to keep around. Most people thought it was fine, since the >>>>>>> original >>>>>>> purpose of the sized based long rotation was to address a ticket that >>>>>>> was >>>>>>> complaining about the juju log folder taking up 3GB+ of space. >>>>>>> >>>>>>> I will get a ticket created to address the size mismatch issue and >>>>>>> switch the copyandtruncate approach to a postrotate of rsyslogd. >>>>>>> >>>>>>> Wayne >>>>>>> >>>>>>> On Mon, Sep 15, 2014 at 5:30 AM, John Meinel <[email protected] >>>>>>> > wrote: >>>>>>> >>>>>>>> So we are using rsyslog.conf to have it figure out when rotation >>>>>>>> needs to be done with: >>>>>>>> # Maximum size for the log on this outchannel is 512MB >>>>>>>> # The command to execute when an outchannel as reached its size >>>>>>>> limit cannot accept any arguments >>>>>>>> # that is why we have created the helper script for executing >>>>>>>> logrotate. >>>>>>>> $outchannel >>>>>>>> logRotation,{{logDir}}/all-machines.log,512000000,{{logrotateHelperPath}} >>>>>>>> >>>>>>>> I would think that would not require SIGHUP along with having it >>>>>>>> run the script at our request. >>>>>>>> >>>>>>>> John >>>>>>>> =:-> >>>>>>>> >>>>>>>> On Mon, Sep 15, 2014 at 12:13 PM, Stuart Bishop < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> On 15 September 2014 12:38, John Meinel <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> > 7) "copytruncate" seems the wrong setting for interactive with >>>>>>>>> rsyslog. I >>>>>>>>> > believe rsyslog is already aware that the file needs to be >>>>>>>>> rotated, and thus >>>>>>>>> >>>>>>>>> It is only aware if you sent it a HUP signal. >>>>>>>>> >>>>>>>>> > it shouldn't be trying to write to the same file handle (and >>>>>>>>> thus we don't >>>>>>>>> > need to truncate in place). I'm not 100% sure on the >>>>>>>>> interactions here, but >>>>>>>>> > "copytruncate" seems to have an inherent likelyhood of dropping >>>>>>>>> data (while >>>>>>>>> > you are copying, if any data gets written then you'll miss those >>>>>>>>> last few >>>>>>>>> > bytes when you go to truncate, right?) >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Stuart Bishop <[email protected]> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Juju-dev mailing list >>>>>>>> [email protected] >>>>>>>> Modify settings or unsubscribe at: >>>>>>>> https://lists.ubuntu.com/mailman/listinfo/juju-dev >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Wayne Witzel III >>>>>>> [email protected] >>>>>>> >>>>>>> -- >>>>>>> Juju-dev mailing list >>>>>>> [email protected] >>>>>>> Modify settings or unsubscribe at: >>>>>>> https://lists.ubuntu.com/mailman/listinfo/juju-dev >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Wayne Witzel III >>>>> [email protected] >>>>> >>>> >>>> >>> >> > > > -- > Wayne Witzel III > [email protected] >
-- Juju-dev mailing list [email protected] Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev
