did you have a chance to look at debug log?

  did more testing in the meantime, figured out that:

  6 senders, 1 collector version 7.5.4, 300 kB/s  always works

  6 senders, 1 collector versions 7.5.5, 7.5.6 300 kB/s do not work

6 senders, 2 collectors versions 7.5.5, 7.5.6 300 kB/s mostly work but not always (periods of silence are much more predicable if the re is no reload/HUP signal)

1 sender, one collector any version always works (this has less files open plusthe total traffic is around 100lB/s, not sure if that's siginificant)

Looked at strace on both sides (sender and receiver) during time when traffic went from 300kB/s down to virtually zero but see nothing suspicious, essentially there is lot of messages sent, then one message confirming all of them. Then at some point there is timeout, then few messages are confirmed one by one then nothing, then it starts again.

As far as I can tell it's the rename of the files that causes the problem (we rename athen remove them but if I comment out the remove part the problem is still there). We never rename the files that rsyslogd is actually writing to (even though they are still open).

Given the behaviour in the scenarios above it seems something got broken between 7.5.4 and 7.5.5. Did the code that handles renamed files change in any way? I looked at changelog (http://www.rsyslog.com/changelog-for-7-5-5-v7-devel/) before but didn't see anything that looks related.

  Any ideas how to troubleshoot this further?

  thanks!

        erik

On 11/05/2013 10:04 AM, Rainer Gerhards wrote:
On Tue, Nov 5, 2013 at 3:41 AM, Erik Steffl <[email protected]> wrote:

   See the debug log attached, the first line is the last message in one of
the files (maybe not last message overall, but all communication stops at
that second, there's number of files so it's hard to figure out which
message is the last one).

   Also noticed that number of messages in log with same number at the
beginning drops right after this message, I assume the beginning of the
line has something to do with time, these are the counts and values at the
beginning of the debug log line:


It's the last 5 digits of a Unix timestamp - so you don't get an absolute
time, but an idea how things progress. After the dot I think there are ms
(or so) and after the colon there is a thread ID (same ID = same thread).

Hope that helps a bit, will try to look at the debug log tomorrow morning.

Rainer



   21264 9547.

   22378 9548.
    2725 9549.
    1877 9550.
...9551 to 9969 left out, low counts around 2000
...9970 to 0519 left out, low counts around  6
       6 0520.
       6 0521.
   39302 0522.
   51997 0523.

   thanks!

         erik


On 11/03/2013 01:05 PM, David Lang wrote:

I would post everything in the log file starting with the last log that
it successfully sent and continuing a while after that message (a while
being a couple hundred lines or so to be safe)

once we look at the log we may ask for more.

David Lang


   On Sun, 3 Nov 2013, Erik Steffl wrote:

  Date: Sun, 03 Nov 2013 11:45:03 -0800
From: Erik Steffl <[email protected]>
Reply-To: rsyslog-users <[email protected]>
To: rsyslog-users <[email protected]>
Subject: Re: [rsyslog] Rsyslog with RELP not sending/receiving
messages for
     long intervals

  I have debug log where I was able to find last message in one of the
logs files (right before the silence period) but it's 3 GB :) Any hint
how much to pick the relevant part of the debug log? Is the timestamp
part of the log message? There is something in the beginning of each
debug log line but I don't know how to translate it to time (part of
it is transaction id, the rest not sure about)

  thanks! (and of course I unserstand you can't drop everything and
work on my problem, thanks for answers!)

     erik

On 11/02/2013 08:51 AM, Rainer Gerhards wrote:

On Fri, Nov 1, 2013 at 11:31 PM, Erik Steffl <[email protected]> wrote:

     running reload every 43 seconds, behaviour is exactly the same
(30 min
traffic, 15 min silence) so I think it's not related to reload.

    will test the latest version and I guess re-test the older
version and
go from there...


  It would probably be a good idea to try to capture a debug log so
that we
can see what rsyslog itself thinks what happens. Note that you can also
just enable it shortly before you expect the problem to happen
(google for
"debug on demand").

Right now, I am totally busy with complex work and cannot look at it. If
this is for a business, I would strongly suggest to purchase one of the
low-priced support options. That would enable an engineer to begin
working
on the case within half a day or so. I would *really love* to drop
this to
someone else, as I really don't like the smell of this bug -- but if
I do,
I am pretty sure other bugs will creep up, and that would again be the
death of the new work I am doing (and it took me 3 years or so to get to
this point!).

So it looks like non-paid work needs to stay focused on that work, which
will be useful for all users in the hopefully not so distant future. All
non-emergency bugs (usually security related things) or request thus
need
to be on hold for a couple of days or maybe till the end of month.

If you get a debug log, I will have time to at least look quickly at it,
but I can't say if that will be sufficient to provide a solution.

sorry for that and I hope for your understanding.

Rainer

    thanks!


          erik


On 11/01/2013 02:29 PM, David Lang wrote:

  Ok, this is definantly sounding like a bug in RELP, it worked on an
earlier version, check if it works on the latest version, and then I
would suggest fileing a bug on the www.rsyslog.com bugtracker

This sounds like something Rainer is going to need to look at. He's in
Germany, so I believe his weekend has started, he may see it over the
weekend or it may wait until his work week starts.

David Lang

    On Fri, 1 Nov 2013, Erik Steffl wrote:

    pretty sure reload is sending HUP (that's upstart job as
inlcluded in

adiscon packages), the rsyslog keeps the same PID, I even tried to
strace it while running reload, here's syslog message:

Nov  1 10:47:25 ip-10-238-198-103 rsyslogd: [origin
software="rsyslogd" swVersion="7.5.5" x-pid="3865"
x-info="http://www.rsyslog.com";] rsyslogd was HUPed

   but I think it must be related to something else cause when I
removed
the reload (so the log mover was doing only rename, upload,
remove) it
was still exhibiting the same works for 30 min, silent for 15 min
etc.

   running reload every 43 seconds now and will see what happens...

      erik

On 11/01/2013 12:57 PM, David Lang wrote:

  a HUP of the server should not cause any interruption in processing
messages.

are you sure upstart is seing HUP not doing a full stop/start?

could you try sending HUP to rsyslog manually a few times to see
if you
can duplicate the problem?

David Lang

On Fri, 1 Nov 2013, Erik Steffl wrote:

    sorry, it wasn't entirely clear in my previous email but we
are doing

ubuntu upstart reload which is sending HUP signal, pretty sure it's
doing the right thing since logrotate scripts use it, syslog
message
says rsyslog received HUP signal, rsyslogd keeps the same PID
and all
files are closed (as verified using lsof).

   On top of that if, during one of these quiet periods, I try to
send
messages from another client (on which I just restarted rsyslog
so it
has no error counters etc.) I think it's also not sending anything,
when stracing it (the sender) I see it recieved messages from
logger
but is not sending them anywhere. So I think that the collector
rsyslog continues to send some kind of signal that it's not
ready but
not sure why or how to figure out whether it actually does it.

   Any ideas how to troubleshoot this? I think I'll record more of
strace output and try to reconstruct RELP messages, I guess
there is
some RELP response that has something else than RSP-CODE = 200
(looking at http://www.rsyslog.com/doc/relp.html)

   I was just about to publish the S3 upload scripts but since they
cause rsyslog to misbehave this way I didn't do it yet, maybe I'll
post them as they are and fix later...

      erik

On 11/01/2013 01:02 AM, David Lang wrote:

  the purpose of RELP is to tell the senders if the receiver was
able to
accept the message or not, so that's working as designed.

When you do a full restart of rsyslog, it stops processing new
messages
and disconnects all senders (you are doing a full stop of
rsyslog and
then starting it from scratch, this takes time). My guess is
that the
senders are detecting 'too many failures' when trying to send
messages,
so they are backing off and sleeping for a while rather tthan
performing
a mini DOS attack on the network and server. Every third
restart is
probably triggering a threshold.

Instead of doing a full restart, just send the rsyslog daemon a
HUP
signal instead. That will tell rsyslog to flush and close all
files so
that you can rotate them (If you are doing compression on the
files, you
may need to sleep for a few seconds to let the fluch complete).
Rsyslog
can continue to receive new messages during this time, so your
senders
will not see an outage.

By the way, I'm needing a script to upload rsyslog archives to S3,
could
you send me a copy of yours? (remove any passwords first please
:-)

David Lang


    On Thu, 31 Oct 2013, Erik Steffl wrote:

   Date: Thu, 31 Oct 2013 18:44:23 -0700

From: Erik Steffl <[email protected]>
Reply-To: rsyslog-users <[email protected]>
To: rsyslog-users <[email protected]>
Subject: [rsyslog] Rsyslog with RELP not sending/receiving
messages
for long
      intervals

We have a fairly simple setup of 6 hosts sending syslog
messages to
one collector host, all of these run rsyslog 7.5.5-0adiscon2 from
adiscon repo and use RELP to transfer messages. There is also
load
balancer in front of the collector machine but I dont' think it
matters in this case.

Rsyslog on collector machine is configured to write to files,
switching to a new file every 15 minute, using config like this
(abbreviated a bit):

template(name="jsonFilename" type="list") {
   constant(value="/path/")
   property(name="$now")
   constant(value="/")
   property(name="$hour")
   constant(value="/")
   property(name="$qhour")
   constant(value="/")
   constant(value="log.json")
}

action(type="omfile" DynaFile="jsonFilename"
Template="jsonFormat")

   We run a script at every 2, 17, 32, 47 minute of the hour
and upload
the just finished file to S3. The uploading works like this:

   - let's say it's 3:02:00, rsyslog is writing to
/path/2013-10-10/03/00/log.json

   - get the filename log.json (anything that's not current,
usually
just one previous file which in the example would be
/path/2013-10-10/02/03/log.json)

   - rename /path/2013-10-10/02/03/log.json to
/path/2013-10-10/02/03/log.json.uploading.0

   - reload rsyslog (to make sure that even if for some reason
it was
writing to just renamed file it would close it and open a new
file)

   - upload /path/2013-10-10/02/03/log.json.uploading.0 to S3

   - remove /path/2013-10-10/02/03/log.json.uploading.0

Here's what happens every third run (yes, regularly EVERY
THIRD RUN)
of this script:

   - rsyslog stops writing to the CURRENT file
(/path/2013-10-10/03/00/log.json, the one that is NOT being
renamed)
few seconds into the run of the script (e.g. 3:02:04)

   - 6 hosts that were sending syslog messages to the log
collector
STOP
sending anything (as verified by stracing rsyslogd, tcpdump
and in
amazon AWS console metric for network in)

   - after this nothing is ever written into
/path/2013-10-10/03/00/log.json

   - the 6 clients start sending sysog messages again when the
next
file
is created (in this example it would be
/path/2013-10-10/03/01/log.json)

   I checked and double check the files, dates, verified that the
current file is not touched but can't figure out what's going
on. I
tried the script without reload rsyslog but it didn't make any
difference. If I don't run this script rsyslog works flawlessly.

   any ideas how to troubleshoot this? What could be causing the
rsyslog
to stop writing to the file and for the senders to stop sending
syslog
messages? I assume the rsyslog on the collector host somehow
signals
to the 6 hosts that send messages that it's not ready or
something...

   thanks!

      erik
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT
POST
if you DON'T LIKE THAT.

   _______________________________________________

rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
myriad
of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST
if you
DON'T LIKE THAT.


_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO
NOT POST
if you DON'T LIKE THAT.

   _______________________________________________

rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
myriad
of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST
if you
DON'T LIKE THAT.


_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT
POST
if you DON'T LIKE THAT.

   _______________________________________________

rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
myriad
of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
DON'T LIKE THAT.


_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
myriad
of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
DON'T LIKE THAT.

  _______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT
POST if you DON'T LIKE THAT.


_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST
if you DON'T LIKE THAT.

  _______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
DON'T LIKE THAT.



_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
DON'T LIKE THAT.

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.


_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to