Re: [rsyslog] Qualys scan against rsyslog causes it to segfault

deoren Mon, 23 Oct 2017 16:57:07 -0700

On 10/7/2017 10:44 AM, deoren wrote:

On 10/7/2017 5:25 AM, Rainer Gerhards wrote:
2017-10-07 7:57 GMT+02:00 deoren
<rsyslog-users-lists.adiscon....@whyaskwhy.org>:
As I dig more into this, I'm beginning to think the only thing theQualys
scan did was aggravate an existing problem and cause rsyslog to tip over
more quickly.

While I didn't observe the problem at the time, I recently made some
configuration changes to introduce a lookup table to map known IPs toknownhostnames (something I plan to programmatically build from a centraldb atsome point in the future). I opted to try this in an effort to stopusing
$hostname directly (which gives odd results with invalid messages) for
dynamic directory/filenames.

A long story short, I found that systemd is now (without a Qualys scan
actively running) restarting rsyslog after it dies within a minute orso of
running. My guess is that when a remote client sees that the rsyslog
receiver is running again it retries delivery and when the receiver is
passed the message it performs the lookup. At that point the realcause for
the segfault is triggered and rsyslog goes down.

Looking at the output of the dmesg command I found many instances of
messages like this one:
[364581.440049] rs:main Q:Reg[82570]: segfault at 0 ip00007faeb1fa55f4 sp
00007faeaac431d0 error 4 in libfastjson.so.4.1.0[7faeb1fa1000+a000]

I stopped rsyslog and then manually started it up like so:

rsyslogd -dn > rsyslog-debug.log 2>&1

This was printed to the console:

Segmentation fault (core dumped)

Here are the last 20 lines from the debug log file:

7587.842410212:main Q:Reg/w0  : ../action.c: action 'action 21': set
suspended state to 0
7587.842412422:main Q:Reg/w0  : rainerscript.c:     STOP
7587.842418787:main Q:Reg/w0  : ruleset.c: processBATCH: next msg 44:
<167>2017-10-07T01:15:37.701Z vms2.example.com Rhttpproxy: verbose
rhttpproxy[79FC5B70] [Originator@6876 sub=Proxy Req 829
7587.842421471:main Q:Reg/w0  : rainerscript.c:     CALL
[receiver-origin-hostname-lookup, queue:0]
7587.842427317:main Q:Reg/w0  : rainerscript.c:     IF
7587.842433062:main Q:Reg/w0 : rainerscript.c: var'fromhost-ip'
7587.842442260:main Q:Reg/w0  : rainerscript.c:       ==
7587.842449663:main Q:Reg/w0 : rainerscript.c: string'127.0.0.1'7587.842460890:main Q:Reg/w0 : rainerscript.c: eval expr 0x22147a0,type
'CMP_EQ'
7587.842463187:main Q:Reg/w0 : rainerscript.c: eval expr 0x22138e0,type
'V'
7587.842466060:main Q:Reg/w0 : rainerscript.c: rainerscript:(string) var
8: '192.168.5.142'
7587.842468406:main Q:Reg/w0 : rainerscript.c: eval expr 0x22138e0,return
datatype 'S':0
7587.842470812:main Q:Reg/w0 : rainerscript.c: eval expr 0x22147a0,return
datatype 'N':0
7587.842473056:main Q:Reg/w0  : ruleset.c: if condition result is 0
7587.842475315:main Q:Reg/w0  : rainerscript.c:     IF
7587.842480945:main Q:Reg/w0  : rainerscript.c:         var
'!origin!hostname'
7587.842490085:main Q:Reg/w0  : rainerscript.c:       ==
7587.842497393:main Q:Reg/w0  : rainerscript.c:         string ''
7587.842508499:main Q:Reg/w0 : rainerscript.c: eval expr 0x22148f0,type
'CMP_EQ'
7587.842511262:main Q:Reg/w0 : rainerscript.c: eval expr 0x22146f0,type
'V'

I checked my JSON lookup table file and found a duplicate key (but with
different values). This was unintentional and was a result of amistake when
building out the file (copy, paste, modify, oops!).
I was hopeful that fixing that would resolve the issue, but no suchluck.
Just in case it isn't readable above, here is the same output plus the
ruleset I'm using:

https://gist.github.com/deoren/7e8e799428d0d87c4e467fd4add51d09

Anything stand out as obviously wrong?

I've gone back over the Expressions, Control Structures and Variable
(Property) types docs and it doesn't appear that I'm attempting to
access/set the variables improperly, but I could be missing something.
The lookup table feature was added in 8.17.0 as an experimental
feature. There seems to be at least one hard-to-find bug inside it,
see here:

https://github.com/rsyslog/rsyslog/issues/1071

In the past days, I have looked at the issue as the original
contributor seems to have lost interest in it. It sounds like what you
experience is related to this problem. It would be great if you could
run rsyslog under valgrind and see if the problem persists and, if so,
what valgrind reports. You can use valgrind simply by installing the
distro-provided valgrind package and prefixing the command with
"valgrind", as such:

$ valgrind rsyslogd -dn > rsyslog-debug.log 2>&1

This would be very helpful. It might also be very, very useful if you
could add your findings to above-mentioned github issue tracker. I am
99.999% sure it is the same root cause.

Rainer
Thanks in advance for your help.
Sure, I can do that.
I ended up eventually getting my configuration to a "stable" point wherewhere rsyslog was no longer crashing, though I don't yet fullyunderstand why what I changed worked.
In my current "stable" configuration I have the same ruleset linked toimudp (514/udp), imptcp (514/tcp) and imrelp (2514/tcp). This rulesetessentially does the following (by calling other rulesets as needed):
#1) Runs mmjsonparse
#2) Sets $.tag equal to $syslogtag and $.msg equal to $msg, only if$!tag and $!msg (respectively) were not set, otherwise the localvariables are set equal to the $! counterpart (e.g., $.msg equal to$!msg, ...)
#3) Checks to see if '$!origin!hostname' is equal to "" and if so,performs the lookup. If it is not empty, then $.hostname is set to thatvalue.
#4) A standard set of actions runs which records local copies offorwarded messages and then forwards the received messages on to othersin the chain (test rsyslog instances and a Graylog instance).
This works fine now, even though the input on 514/udp and 514/tcp isalways a standard syslog message (no embedded CEE JSON).
Before setting the same ruleset on all three inputs, I had the 514/udpand 514/tcp inputs use the ruleset which immediately applied step #3 andstep #4. Even with having steps #2 through #4 applied to those inputswasn't enough, I had to have step #1 run as well.
I didn't have debug logging turned on anymore at that point, but it"appears" that there was an issue checking whether '$!origin!hostname'was blank when mmjsonparse was not called against the incoming message(which did not have a CEE payload included).
Perhaps this is in some way related to this issue?

https://github.com/rsyslog/rsyslog/issues/1283
I will try modifying my configuration to not even use the lookup table,but instead just perform this check (which appears to trigger thesegfault):
   if ($!origin!hostname == "") then {

and then set a static value in place of the lookup table.
On a likely unrelated note, one of our rsyslog "client" nodes is nowunable to send messages to our main receiver (discussed above). Ichecked the forward queue and it is filling up (both in-memory anddisk-assisted). I temporarily worked around the issue by forcing arestart of the daemon on the client, but I did not check to see if themessages were lost or eventually sent.
Later (as noted a moment ago), the queue is filling up again. I checkednetstat and lsof and there are no active connections to the receiver.
I toggled debug mode (debug on demand) and found messages like these:
9085.173631038:ForwardToReceiver1 queue[DA]:Reg/w0: omrelp.c:relpTcpSend: sock 18, lenbuf 245, send returned 245 [errno 11]9085.173634149:ForwardToReceiver1 queue[DA]:Reg/w0: omrelp.c: sendbufNOT added to unacked list9085.173638247:ForwardToReceiver1 queue[DA]:Reg/w0: omrelp.c: resendingframe '16 syslog 273 <189>2017-09-27T07:01:53.077199-05:00 redmine1redmine-prod @cee:{ "metadata": { "filename":"\/opt\/redmine\/log\/production.log" }, "tag": "redmine-prod", "msg":"REDMINE LOG CONTENT HERE" }
'
9085.173649270:ForwardToReceiver1 queue[DA]:Reg/w0: omrelp.c:relpTcpSend: sock 18, lenbuf 288, send returned 288 [errno 11]9085.173681929:ForwardToReceiver1 queue[DA]:Reg/w0: omrelp.c: sendbufNOT added to unacked list9085.173686927:ForwardToReceiver1 queue[DA]:Reg/w0: omrelp.c: resendingframe '17 syslog 560 <189>2017-09-27T07:01:53.274558-05:00 redmine1redmine-prod @cee:{ "metadata": { "filename":"\/opt\/redmine\/log\/production.log" }, "tag": "redmine-prod", "msg":"REDMINE LOG CONTENT HERE" }
'
9085.173693759:ForwardToReceiver1 queue[DA]:Reg/w0: omrelp.c:relpTcpSend: sock 18, lenbuf 575, send returned -1 [errno 32]9085.173697323:ForwardToReceiver1 queue[DA]:Reg/w0: omrelp.c: afterTryReestablish, sess state 4
We experienced this earlier in the week when the segfaults were firsttriggered (but not immediately noticed thanks to systemd aggressivelyrestarting the primary receiver when it crashed, since the messagesappeared to eventually sync up).
I suspect that this is a separate issue, but figured it was worthmentioning due to the context.

I managed to reproduce the segfault in a specific test environment. Iwill work to simply the configuration used (the same, somewhat moduleconfiguration is used on a large number of systems) so that I can sharethe configuration for review.

The same end result is shown when running rsyslog under normal vs debugmode:


==18767== Thread 7 rs:main Q:Reg:
==18767== Invalid read of size 4

==18767== at 0x5885904: fjson_object_iter_begin (in/usr/lib/libfastjson.so.4.2.0)

==18767==    by 0x58835D2: ??? (in /usr/lib/libfastjson.so.4.2.0)

==18767== by 0x5883E48: fjson_object_object_add_ex (in/usr/lib/libfastjson.so.4.2.0)

==18767==    by 0x432BB7: ??? (in /usr/sbin/rsyslogd)
==18767==    by 0x432C23: ??? (in /usr/sbin/rsyslogd)
==18767==    by 0x4372FF: msgGetJSONPropJSONorString (in /usr/sbin/rsyslogd)
==18767==    by 0x42162D: cnfexprEval (in /usr/sbin/rsyslogd)
==18767==    by 0x4212AB: cnfexprEval (in /usr/sbin/rsyslogd)
==18767==    by 0x424105: cnfexprEvalBool (in /usr/sbin/rsyslogd)
==18767==    by 0x453A34: ??? (in /usr/sbin/rsyslogd)
==18767==    by 0x453A1A: ??? (in /usr/sbin/rsyslogd)
==18767==    by 0x453A1A: ??? (in /usr/sbin/rsyslogd)
==18767==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==18767==
==18767==
==18767== Process terminating with default action of signal 11 (SIGSEGV)
==18767==  Access not within mapped region at address 0x0

==18767== at 0x5885904: fjson_object_iter_begin (in/usr/lib/libfastjson.so.4.2.0)

==18767==    by 0x58835D2: ??? (in /usr/lib/libfastjson.so.4.2.0)

==18767== by 0x5883E48: fjson_object_object_add_ex (in/usr/lib/libfastjson.so.4.2.0)

==18767==    by 0x432BB7: ??? (in /usr/sbin/rsyslogd)
==18767==    by 0x432C23: ??? (in /usr/sbin/rsyslogd)
==18767==    by 0x4372FF: msgGetJSONPropJSONorString (in /usr/sbin/rsyslogd)
==18767==    by 0x42162D: cnfexprEval (in /usr/sbin/rsyslogd)
==18767==    by 0x4212AB: cnfexprEval (in /usr/sbin/rsyslogd)
==18767==    by 0x424105: cnfexprEvalBool (in /usr/sbin/rsyslogd)
==18767==    by 0x453A34: ??? (in /usr/sbin/rsyslogd)
==18767==    by 0x453A1A: ??? (in /usr/sbin/rsyslogd)
==18767==    by 0x453A1A: ??? (in /usr/sbin/rsyslogd)
==18767==  If you believe this happened as a result of a stack
==18767==  overflow in your program's main thread (unlikely but
==18767==  possible), you can try to increase the size of the
==18767==  main thread stack using the --main-stacksize= flag.
==18767==  The main thread stack size used in this run was 8388608.
==18767==
==18767== HEAP SUMMARY:
==18767==     in use at exit: 6,027,403 bytes in 5,445 blocks

==18767== total heap usage: 12,202 allocs, 6,757 frees, 7,550,747bytes allocated

==18767==
==18767== LEAK SUMMARY:
==18767==    definitely lost: 131,073 bytes in 1 blocks
==18767==    indirectly lost: 0 bytes in 0 blocks
==18767==      possibly lost: 3,456 bytes in 12 blocks
==18767==    still reachable: 5,892,874 bytes in 5,432 blocks
==18767==         suppressed: 0 bytes in 0 blocks
==18767== Rerun with --leak-check=full to see details of leaked memory
==18767==
==18767== For counts of detected and suppressed errors, rerun with: -v

==18767== Use --track-origins=yes to see where uninitialised values comefrom

==18767== ERROR SUMMARY: 3 errors from 2 contexts (suppressed: 0 from 0)
==18767== could not unlink /tmp/vgdb-pipe-from-vgdb-to-18767-by-root-on-???
==18767== could not unlink /tmp/vgdb-pipe-to-vgdb-from-18767-by-root-on-???

==18767== could not unlink/tmp/vgdb-pipe-shared-mem-vgdb-18767-by-root-on-???

As I may have mentioned already, making sure to use the mmjsonparsebefore calling this ruleset works around the problem:


ruleset(name="receiver-origin-hostname-lookup") {


###########################################################################
    # Use lookup table to map client syslog IP to hard-coded hostnames.
    #
    # This is done in an effort to avoid spoofing attempts by misconfigured
    # systems or security scans that place invalid data in the hostname
    # field of incoming syslog messages. At some future date the plan is to
    # generate the lookup type programmatically from a central database
    # instead of maintaining the file by hand (where it is more likely to
    # contain outdated information).
    #
    # Once we obtain this value, it needs to be used on the original remote
    # receiver and any later in the chain

###########################################################################

    # if this ruleset is called on one of our receiver nodes for messages

# generated locally we will use the $$myhostname value instead oftrying

    # to perform a lookup against the mapping table.
    if ($fromhost-ip == "127.0.0.1") then {
        set $.hostname = $$myhostname;
    }

    else {

# Try to determine if this receiver node is the first recipientin the# chain. If so (empty value), then perform a lookup to get thehostname

        # value from the mapping table.
        if ($!origin!hostname == "") then {

set $.search_result = lookup("ip_hostname_mappings",$fromhost-ip);


            if ($.search_result == "unknown") then {

# If there isn't a match against the syslog client IPthen use the

                # syslog client IP as the hostname value.
                set $.hostname = $fromhost-ip;
            }
            else {

# Use the hostname value returned when searching by thesyslog# client IP. This value is set for a local variablethat will# be used in this and any called rulesets and then inthe $!

                # object for use by remote receivers (JSON payload)
                set $.hostname = $.search_result;
                set $!origin!hostname = $.search_result;

                # Save this value for later reference in Graylog or other
                # structured output targets (e.g., database)
                set $!origin!ipaddr = $fromhost-ip;

            }
        }

# If set, this node is likely a downstream receiver from theprimary and should# reference the value set on the primary receiver instead ofattempting another# lookup which would credit the syslog client IP as theoriginal receiver

        else {

            set $.hostname = $!origin!hostname;

        }
    }
}

I've posted all of this to the GitHub issue tracker here:
https://github.com/rsyslog/rsyslog/issues/1071#issuecomment-338828757

My next step is to review the debug log output created by running thefollowing command in an attempt to remove any sensitive information:


valgrind rsyslogd -dn > rsyslog-debug.log 2>&1

One thing I've not yet mentioned on the GitHub issue is that Ireproduced the issue with restarting the syslog daemon with either ESXi5.5.0 or 6.0.0 (latest patch levels of each). I'll try to run tcpdump tocapture the incoming syslog message and will provide that if I can.

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Re: [rsyslog] Qualys scan against rsyslog causes it to segfault

Reply via email to