Hello Reiner,

I'm seeing the same behavior with omrelp module. I've configured omrelp to
non-routable ip address:
module(load="omrelp")
local0.* action(type="omrelp"
    target="10.0.0.1"
    port="2514"
    name="debug_relp"
    queue.filename="relp_queue"
    queue.maxdiskspace="1G"
    queue.saveonshutdown="on"
    queue.spoolDirectory="/var/spool/rsyslog"
    queue.type="Disk"
)
Then I try to log a line:
[root@314959ac8cb0 prism]# logger -p local0.info test
[root@314959ac8cb0 prism]# ls /var/spool/rsyslog/
relp_queue.00000001
# I shutdown rsyslog at this point
[root@314959ac8cb0 prism]# ls /var/spool/rsyslog/
# after shutdown queue is deleted and message is lost
[root@314959ac8cb0 prism]#

Here is rsyslog log, so omrelp starts transaction and then I shutdown
rsyslog, it seems at some point queue is deleted:

8184.798028612:debug_relp queue:Reg/w0: omrelp: beginTransaction
^C8194.245122305:main thread    : DoDie called.
8194.245301525:main thread    : exiting on signal 2
8194.245532709:main thread    : Terminating input threads...
8194.245873683:main thread    : request term via SIGTTIN for input thread
'imuxsock' 0xcb767700
8194.246079748:imuxsock.c     : thrdStarter: usrThrdMain imuxsock -
0x7f9bcb767700 returned with iRet 0, exiting now.
8194.247556671:main thread    : non-cancel input thread termination
succeeded for thread imuxsock 0xcb767700
8194.249357547:imuxsock.c     : destructor for debug call stack
0x7f9bbc0008c0 called
8194.249784271:main thread    : imuxsock: unlinking unix socket file[0]
/dev/log
8194.250043368:main thread    : main Q: qqueueAdd: entry added, size now
log 1, phys 1 entries
8194.250321193:main thread    : main Q: EnqueueMsg advised worker start
8194.250493505:main Q:Reg/w0  : wti 0x1c13bb0: worker awoke from idle
processing
8194.250735988:main Q:Reg/w0  : DeleteProcessedBatch: we deleted 0 objects
and enqueued 0 objects
8194.250959294:main Q:Reg/w0  : doDeleteBatch: delete batch from store, new
sizes: log 1, phys 1
8194.251913535:main Q:Reg/w0  : processBATCH: batch of 1 elements must be
processed
8194.252193683:main Q:Reg/w0  : processBATCH: next msg 0:  [origin
software="rsyslogd" swVersion="8.18.0.master" x-pid="30298" x-info="
http://www.rsyslog.com";] exiting on signal 2.
8194.252377044:main Q:Reg/w0  :     PRIFILT 'local0.*'
8194.252669683:main Q:Reg/w0  :     pmask:  X  X  X  X  X  X  X  X  X  X  X
 X  X  X  X  X FF  X  X  X  X  X  X  X  X  X
8194.252898063:main Q:Reg/w0  : PRIFILT condition result is 0
8194.252902083:main Q:Reg/w0  :     PRIFILT
'*.info;mail.none;authpriv.none;cron.none'
8194.252908847:main Q:Reg/w0  :     pmask: 7F 7F  X 7F 7F 7F 7F 7F 7F  X  X
7F 7F 7F 7F 7F 7F 7F 7F 7F 7F 7F 7F 7F 7F 7F
8194.252956209:main Q:Reg/w0  : PRIFILT condition result is 1
8194.252959849:main Q:Reg/w0  :     ACTION 1
[builtin:omfile:/var/log/messages]
8194.252967510:main Q:Reg/w0  : executing action 1
8194.252971541:main Q:Reg/w0  : action 'action 1': called, logging to
builtin:omfile (susp 0/0, direct q 1)
8194.252977583:main Q:Reg/w0  : action 'action 1': is transactional -
executing in commit phase
8194.252982638:main Q:Reg/w0  : Action 1 transitioned to state: itx
8194.252986300:main Q:Reg/w0  : action 'action 1': set suspended state to 0
8194.252989729:main Q:Reg/w0  :     PRIFILT 'authpriv.*'
8194.252996951:main Q:Reg/w0  :     pmask:  X  X  X  X  X  X  X  X  X  X FF
 X  X  X  X  X  X  X  X  X  X  X  X  X  X  X
8194.253051858:main Q:Reg/w0  : PRIFILT condition result is 0
8194.253056228:main Q:Reg/w0  :     PRIFILT 'mail.*'
8194.253065299:main Q:Reg/w0  :     pmask:  X  X FF  X  X  X  X  X  X  X  X
 X  X  X  X  X  X  X  X  X  X  X  X  X  X  X
8194.253407472:main Q:Reg/w0  : PRIFILT condition result is 0
8194.253412679:main Q:Reg/w0  :     PRIFILT 'cron.*'
8194.253422055:main Q:Reg/w0  :     pmask:  X  X  X  X  X  X  X  X  X FF  X
 X  X  X  X  X  X  X  X  X  X  X  X  X  X  X
8194.253493551:main Q:Reg/w0  : PRIFILT condition result is 0
8194.253498038:main Q:Reg/w0  :     PRIFILT '*.emerg'
8194.253507065:main Q:Reg/w0  :     pmask:  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
8194.253580307:main Q:Reg/w0  : PRIFILT condition result is 0
8194.253584400:main Q:Reg/w0  :     PRIFILT 'uucp,news.crit'
8194.253593442:main Q:Reg/w0  :     pmask:  X  X  X  X  X  X  X  7  7  X  X
 X  X  X  X  X  X  X  X  X  X  X  X  X  X  X
8194.253664984:main Q:Reg/w0  : PRIFILT condition result is 0
8194.253669184:main Q:Reg/w0  :     PRIFILT 'local7.*'
8194.253678130:main Q:Reg/w0  :     pmask:  X  X  X  X  X  X  X  X  X  X  X
 X  X  X  X  X  X  X  X  X  X  X  X FF  X  X
8194.254445021:main Q:Reg/w0  : PRIFILT condition result is 0
8194.254450651:main Q:Reg/w0  : END batch execution phase, entering to
commit phase [processed 1 of 1 messages]
8194.254456257:main Q:Reg/w0  : actionCommitAll: action 1, state 1, nbr to
commit 0 isTransactional 1
8194.254461646:main Q:Reg/w0  : doTransaction: have commitTransaction IF,
using that, pWrkrInfo 0x1c13cf0
8194.254466238:main Q:Reg/w0  : entering actionCallCommitTransaction(),
state: itx, actionNbr 1, nMsgs 1
8194.254471876:main Q:Reg/w0  : omfile: write to stream, pData->pStrm
0x7f9bc4002480, lenBuf 161, strt data Mar 28 23:36:34 314959ac8cb0
rsyslogd: [origin software="rsyslogd" swVersion="8.18.0.master"
x-pid="30298" x-info="http://www.rs
8194.254478237:main Q:Reg/w0  : strm 0x7f9bc4002480: file 4(messages)
flush, buflen 161
8194.254483210:main Q:Reg/w0  : strmPhysWrite, stream 0x7f9bc4002480, len
161
8194.254500329:main Q:Reg/w0  : strm 0x7f9bc4002480: file 4 write wrote 161
bytes
8194.254506254:main Q:Reg/w0  : Action 1 transitioned to state: rdy
8194.254511506:main Q:Reg/w0  : Action 1 transitioned to state: itx
8194.254515961:main Q:Reg/w0  : Action 1 transitioned to state: rdy
8194.254520243:main Q:Reg/w0  : actionCommit, in retry loop, iRet 0
8194.254524679:main Q:Reg/w0  : processBATCH: batch of 1 elements has been
processed
8194.254529731:main Q:Reg/w0  : regular consumer finished, iret=0, szlog 0
sz phys 1
8194.254536104:main Q:Reg/w0  : DeleteProcessedBatch: we deleted 1 objects
and enqueued 0 objects
8194.254540861:main Q:Reg/w0  : doDeleteBatch: delete batch from store, new
sizes: log 0, phys 0
8194.254545792:main Q:Reg/w0  : regular consumer finished, iret=4, szlog 0
sz phys 0
8194.254550314:main Q:Reg/w0  : main Q:Reg/w0: worker IDLE, waiting for
work.
8194.254556418:main thread    : Terminating main queue...
8194.254560776:main thread    : main Q: initiating worker thread shutdown
sequence
8194.254564653:main thread    : main Q: trying shutdown of regular workers
8194.254572961:main thread    : sent SIGTTIN to worker thread 0xcaf66700
8194.254577260:main thread    : main Q:Reg: waiting 1499ms on worker thread
termination, 1 still running
8194.254585507:main Q:Reg/w0  : wti 0x1c13bb0: worker awoke from idle
processing
8194.254589465:main Q:Reg/w0  : DeleteProcessedBatch: we deleted 0 objects
and enqueued 0 objects
8194.254593070:main Q:Reg/w0  : doDeleteBatch: delete batch from store, new
sizes: log 0, phys 0
8194.254596813:main Q:Reg/w0  : regular consumer finished, iret=4, szlog 0
sz phys 0
8194.254600412:main Q:Reg/w0  : wti 0x1c13bb0: terminating worker
terminateRet=5, bInactivityTOOccured=0
8194.254604469:main Q:Reg/w0  : DDDD: wti 0x1c13bb0: worker cleanup action
instances
8194.254608268:main Q:Reg/w0  : wti 0x1c13bb0, action 0, ptr (nil)
8194.254611752:main Q:Reg/w0  : wti 0x1c13bb0, action 1, ptr 0x7f9bc40022a0
8194.254616114:main Q:Reg/w0  : wti 0x1c13bb0, action 2, ptr (nil)
8194.254619450:main Q:Reg/w0  : wti 0x1c13bb0, action 3, ptr (nil)
8194.254622767:main Q:Reg/w0  : wti 0x1c13bb0, action 4, ptr (nil)
8194.254626075:main Q:Reg/w0  : wti 0x1c13bb0, action 5, ptr (nil)
8194.254629392:main Q:Reg/w0  : wti 0x1c13bb0, action 6, ptr (nil)
8194.254632707:main Q:Reg/w0  : wti 0x1c13bb0, action 7, ptr (nil)
8194.254636124:main Q:Reg/w0  : wti 0x1c13bb0: worker exiting
8194.254639920:main Q:Reg/w0  : main Q:Reg: Worker thread 1c13bb0,
terminated, num workers now 0
8194.254656279:main thread    : main Q: regular queue workers shut down.
8194.254656402:main thread    : main Q: checking to see if we need to
cancel any worker threads of the primary queue
8194.254656524:main thread    : main Q: worker threads terminated,
remaining queue size log 0, phys 0.
8194.254656647:main thread    : main Q: queue (type 0) will lose 0
messages, destroying...
8194.255021761:main Q:Reg/w0  : destructor for debug call stack
0x7f9bc40008c0 called
8194.255069521:main thread    : Terminating outputs...
8194.255084818:main thread    : destructing ruleset 0x1bf6150, name
0x1bf6190
8194.255089525:main thread    : debug_relp queue: initiating worker thread
shutdown sequence
8194.255093230:main thread    : debug_relp queue: trying shutdown of
regular workers
8194.255099210:main thread    : sent SIGTTIN to worker thread 0xca765700
8194.255119598:main thread    : debug_relp queue:Reg: waiting 0ms on worker
thread termination, 1 still running
8194.255126357:debug_relp queue:Reg/w0: end relpSessConnect, iRet 10014
8194.255132022:debug_relp queue:Reg/w0: Action 0 transitioned to state: rtry
8194.255135551:debug_relp queue:Reg/w0: doTransaction: action 0, currIParam
1
8194.255139603:debug_relp queue:Reg/w0: actionDoRetry: debug_relp enter
loop, iRetries=0
8194.255210939:main thread    : debug_relp queue:Reg: timeout waiting on
worker thread termination
8194.255211086:main thread    : sent SIGTTIN to worker thread 0xca765700
8194.255211257:main thread    : debug_relp queue: regular shutdown timed
out on primary queue (this is OK)
8194.255211429:main thread    : debug_relp queue: trying to shutdown
workers within Action Timeoutsetting EnqOnly mode
8194.255215492:main thread    : debug_relp queue: trying immediate shutdown
of regular workers (if any)
8194.255221059:main thread    : sent SIGTTIN to worker thread 0xca765700
8194.255226792:main thread    : debug_relp queue:Reg: waiting 1000ms on
worker thread termination, 1 still running
8194.255232713:debug_relp queue:Reg/w0: end relpSessConnect, iRet 10014
8194.255237080:debug_relp queue:Reg/w0: actionDoRetry: debug_relp
action->tryResume returned -2007
8194.255240684:debug_relp queue:Reg/w0: actionDoRetry: debug_relp check for
max retries, iResumeRetryCount 0, iRetries 0
8194.255244459:debug_relp queue:Reg/w0: Action 0 transitioned to state: susp
8194.255257975:debug_relp queue:Reg/w0: Called LogMsg, msg: action
'debug_relp' suspended, next retry is Mon Mar 28 23:37:04 2016
8194.255266857:debug_relp queue:Reg/w0: submitMsg2() could not submit
message - queue does (no longer?) exist - ignored
8194.255271065:debug_relp queue:Reg/w0: action 'debug_relp' suspended,
earliest retry=1459208224 (now 1459208194), iNbrResRtry 0, duration 30
8194.255275244:debug_relp queue:Reg/w0: actionTryResume: action 0x1c09af0
state: susp, next retry (if applicable): 1459208224 [now 1459208194]
8194.255278825:debug_relp queue:Reg/w0: actionCommit, in retry loop, iRet
-2123
8194.255282409:debug_relp queue:Reg/w0: regular consumer finished,
iret=-2123, szlog 0 sz phys 1
8194.255287286:debug_relp queue:Reg/w0: DeleteProcessedBatch: we deleted 1
objects and enqueued 0 objects
8194.255290952:debug_relp queue:Reg/w0: doDeleteBatch: delete batch from
store, new sizes: log 0, phys 0
8194.255295204:debug_relp queue:Reg/w0: wti 0x1c12880: terminating worker
because of TERMINATE_NOW mode, del iRet 0
8194.255298822:debug_relp queue:Reg/w0: DDDD: wti 0x1c12880: worker cleanup
action instances
8194.255302306:debug_relp queue:Reg/w0: wti 0x1c12880, action 0, ptr
0x7f9bc00020d0
8194.255305929:debug_relp queue:Reg/w0: relp engine destructing client
0x7f9bc00027e0
8194.255309843:debug_relp queue:Reg/w0: wti 0x1c12880, action 1, ptr (nil)
8194.255313226:debug_relp queue:Reg/w0: wti 0x1c12880, action 2, ptr (nil)
8194.255316479:debug_relp queue:Reg/w0: wti 0x1c12880, action 3, ptr (nil)
8194.255319704:debug_relp queue:Reg/w0: wti 0x1c12880, action 4, ptr (nil)
8194.255322934:debug_relp queue:Reg/w0: wti 0x1c12880, action 5, ptr (nil)
8194.255326175:debug_relp queue:Reg/w0: wti 0x1c12880, action 6, ptr (nil)
8194.255329409:debug_relp queue:Reg/w0: wti 0x1c12880, action 7, ptr (nil)
8194.255332655:debug_relp queue:Reg/w0: wti 0x1c12880: worker exiting
8194.255336231:debug_relp queue:Reg/w0: debug_relp queue:Reg: Worker thread
1c12880, terminated, num workers now 0
8194.255344540:debug_relp queue:Reg/w0: destructor for debug call stack
0x7f9bc00008c0 called
8194.255371534:main thread    : debug_relp queue: checking to see if we
need to cancel any worker threads of the primary queue
8194.255376798:main thread    : debug_relp queue: worker threads
terminated, remaining queue size log 0, phys 0.
8194.255381256:main thread    : debug_relp queue: persisting queue to disk,
0 entries...
8194.255385525:main thread    : strm 0x1c024c0: file 5(relp_queue) closing
8194.255389222:main thread    : strm 0x1c024c0: file 5(relp_queue) flush,
buflen 0 (no need to flush)
8194.255398716:main thread    : strm 0x1c02760: file 6(relp_queue) closing
8194.255404785:main thread    : strm 0x1c02a00: file -1(relp_queue) closing
8194.255409145:main thread    : strmCloseFile: deleting
'/var/spool/rsyslog/relp_queue.00000001'
8194.255460812:main thread    : strm 0x7f9bc4002480: file 4(messages)
closing
8194.255466309:main thread    : strm 0x7f9bc4002480: file 4(messages)
flush, buflen 0 (no need to flush)
8194.255479325:main thread    : all primary multi-thread sources have been
terminated - now doing aux cleanup...
8194.255482809:main thread    : destructing current config...
8194.255486662:main thread    : calling freeCnf(0x1bf41e0) for module
'builtin:omfile'
8194.255490098:main thread    : calling freeCnf(0x1bf43a0) for module
'builtin:ompipe'
8194.255493809:main thread    : calling freeCnf(0x1bf53a0) for module
'builtin:omfwd'
8194.255497793:main thread    : calling freeCnf(0x1c04b00) for module
'imuxsock'
8194.255501565:main thread    : calling freeCnf(0x1c05c70) for module
'imklog'
8194.255543579:main thread    : destructing parser 'rsyslog.rfc5424'
8194.255547468:main thread    : destructing parser 'rsyslog.rfc3164'
8194.255550784:main thread    : pmrfc3164: free parser instance 0x1bf58b0
8194.255559659:main thread    : file rsyslogd.c released module 'lmnet',
reference count now 2
8194.255563470:main thread    : module lmnet NOT unloaded because it still
has a refcount of 2
8194.255566676:main thread    : Unloading module builtin:omfile
8194.255570420:main thread    : module lmnet NOT unloaded because it still
has a refcount of 2
8194.255573674:main thread    : Unloading module builtin:ompipe
8194.255576857:main thread    : module lmnet NOT unloaded because it still
has a refcount of 2
8194.255579938:main thread    : Unloading module builtin-shell
8194.255583060:main thread    : module lmnet NOT unloaded because it still
has a refcount of 2
8194.255586144:main thread    : Unloading module builtin:omdiscard
8194.255589509:main thread    : module lmnet NOT unloaded because it still
has a refcount of 2
8194.255592658:main thread    : Unloading module builtin:omfwd
8194.255596441:main thread    : file omfwd.c released module 'lmnet',
reference count now 1
8194.255599981:main thread    : module lmnet NOT unloaded because it still
has a refcount of 1
8194.255603068:main thread    : Unloading module builtin:omusrmsg
8194.255606240:main thread    : module lmnet NOT unloaded because it still
has a refcount of 1
8194.255609287:main thread    : Unloading module builtin:pmrfc5424
8194.255612755:main thread    : module lmnet NOT unloaded because it still
has a refcount of 1
8194.255615826:main thread    : Unloading module builtin:pmrfc3164
8194.255619252:main thread    : module lmnet NOT unloaded because it still
has a refcount of 1
8194.255622315:main thread    : Unloading module builtin:smfile
8194.255625406:main thread    : module lmnet NOT unloaded because it still
has a refcount of 1
8194.255628486:main thread    : Unloading module builtin:smtradfile
8194.255631753:main thread    : module lmnet NOT unloaded because it still
has a refcount of 1
8194.255634819:main thread    : Unloading module builtin:smfwd
8194.255637916:main thread    : module lmnet NOT unloaded because it still
has a refcount of 1
8194.255640987:main thread    : Unloading module builtin:smtradfwd
8194.255644187:main thread    : module lmnet NOT unloaded because it still
has a refcount of 1
8194.255647267:main thread    : Unloading module imuxsock
8194.255676232:main thread    : module lmnet NOT unloaded because it still
has a refcount of 1
8194.255680751:main thread    : Unloading module imklog
8194.255694428:main thread    : module lmnet NOT unloaded because it still
has a refcount of 1
8194.255698085:main thread    : Unloading module omrelp
8194.255750953:main thread    : module lmnet NOT unloaded because it still
has a refcount of 1
8194.255755236:main thread    : module lmregexp NOT unloaded because it
still has a refcount of 1
8194.255759705:main thread    : file conf.c released module 'lmnet',
reference count now 0
8194.255762875:main thread    : module 'lmnet' has zero reference count,
unloading...
8194.255765901:main thread    : Unloading module lmnet
8194.255780970:main thread    : rsyslog runtime de-initialized, current
users 0
8194.255784704:main thread    : Clean shutdown completed, bye



On Sun, Mar 13, 2016 at 9:30 PM, David Lang <[email protected]> wrote:

> On Sun, 13 Mar 2016, Kane Kim wrote:
>
> +1to David, having module handle the errors is not 100% reliable, what if
>> module dies before having a chance to save error log, those messages will
>> be lost.
>>
>
> well, the module is a thread of rsyslog, so if it dies, rsyslog has died.
>
> But if you are doing large batches (hundreds or thousands), putting all of
> them in a 'bad messages' file because one has a problem seems like the
> wrong thing to do.
>
> given that the error is not detected until the batch is being flushed for
> the output, any outputs that have already flushed their batches can't be
> changed, so the question is if it's worse to have messages resent to those
> outputs as the batch gets re-processed, or having to have the output go to
> a "couldn't deliver" file.
>
> It's bad enough if the problem is a corrupt message that can never be
> delivered and the error is a permanent one, but if the problem is instead a
> temporary error (server is down), you will have a LOT of messages going
> into this 'couldn't deliver' bucket, when what should really be happening
> is that rsyslog should be pausing in the processing of messages, retrying
> after the suspended timeout.
>
> David Lang
>
>
> On Mar 13, 2016 9:07 PM, "David Lang" <[email protected]> wrote:
>>
>> On Sun, 13 Mar 2016, Rainer Gerhards wrote:
>>>
>>> Rainer, can you explain what happens when a batch fails on an action?
>>>
>>>>
>>>>>
>>>> It depends on when what happens and if we have batches of one or
>>>> batches of many. For batches of one, the failure is immediately
>>>> detected and acted upon. For batches of many,all action strings or
>>>> stored temporarily. During that time, it is assume that the messages
>>>> could be processed. There is no way around this assumption, because we
>>>> now run each action completely through the rule engine before the next
>>>> one is begun to process, which means that other parts of the config
>>>> have actually acted on a message, which means we cannot undo that
>>>> state. If is needed to act on the "real" state, we need to process
>>>> batches of one. There simply is no other solution possible. So let's
>>>> stay with batches of many. Once the batch is completed, the temporary
>>>> strings are passed to the output's commitTransaction entry point. It
>>>> is currently assumed that the output does inteligently handle the
>>>> case, and as it looks most do this well (no regression tests failed,
>>>> no bad reports since we introduced this with the initial v8 version
>>>> (15+ month ago). If the action fails, status is reflected accordingly.
>>>> Also, the core engine *should* (but obviously does not support this
>>>> right now) write an error file with the messages in error (just like
>>>> omelasticsearch does, which has proven to be quite good).
>>>>
>>>>
>>> well, it's not in the specs that modules should have local error files
>>> the
>>> way omelasticsearch does.
>>>
>>> What I would expect if there is a problem when commiting the batch is
>>> that
>>> the entire batch would be marked as not delivered and try again with half
>>> of it.
>>>
>>> This will cause messages to other outputs working from the same queue to
>>> get duplciate messages, but it's better than loosing an entire batch of
>>> messages because one is corrupt in some way. or there is a temporary
>>> problem with the output.
>>>
>>> David Lang
>>>
>>> I have managed to look at least briefly at the commitTransaction core
>>>
>>>> engine code. I think we can even improve it and may get some ability
>>>> to do retries if the action itself does not handle that. At the
>>>> minimum, it should have the ability (which was intended originally,
>>>> but pushed away by more urgent work) to write that error file.
>>>>
>>>> I am very grateful that Kane is working on a repro with the standard
>>>> modules. Once we have that, I can go deeper into the code and check
>>>> what I am thinking about. It might be possible that we cannot manage
>>>> to repro this with the standard modules. Then we need to think about a
>>>> way via the testing module to do the same thing.
>>>>
>>>> I will try to address this within the 8.18 timeframe, but cannot
>>>> totally commit to it. There has a lot of work been going on, and
>>>> especially all that CI testing has taken up lots of works.
>>>> Nevertheless, I've been able to iron out quite some nits, so it's
>>>> probably worth it. I would still like to continue to get the CI
>>>> environment into better shape, because now I know what I am doing. If
>>>> I stop again and let it go for a couple of weeks, I would lose a lot
>>>> of momentum. I have also do to some work for support customers, which
>>>> keeps me able to work on rsyslog in general, so this, too, has
>>>> priority (and validly has so because without support customers I would
>>>> be totally unable to work on rsyslog other than as a hobby)
>>>>
>>>> I hope this explains the current state.
>>>> Rainer
>>>>
>>>>
>>>> Kane, can you try and duplicate your problem using standard modules? I'm
>>>>> thinking that if you define a omrelp action that doesn't go anywhere,
>>>>> it
>>>>> should return errors similar to what you are trying to do, so you
>>>>> should
>>>>> be
>>>>> able to duplicate the problem that way.
>>>>>
>>>>> David Lang
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, 23 Feb 2016, Kane Kim wrote:
>>>>>
>>>>> Looking at omkafka module source code it seems that it relies on
>>>>> rsyslog
>>>>>
>>>>>> retries in DoAction, returning RS_RET_SUSPENDED:
>>>>>>
>>>>>> DBGPRINTF("omkafka: writeKafka returned %d\n", iRet);
>>>>>> if(iRet != RS_RET_OK) {
>>>>>>   iRet = RS_RET_SUSPENDED;
>>>>>> }
>>>>>>
>>>>>> I've tried similar code in DoAction, it seems that action processing
>>>>>> is suspended at that point and messages are not retried to module.
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 23, 2016 at 9:06 AM, Kane Kim <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> Thanks for your help Rainer, I'll try to debug what's going on, so far
>>>>>>
>>>>>>> it
>>>>>>> seems rsyslog doesn't retry even with batch size 1. It doesn't retry
>>>>>>> if I
>>>>>>> return RS_RET_SUSPENDED from DoAction as well.
>>>>>>>
>>>>>>> On Tue, Feb 23, 2016 at 9:05 AM, Rainer Gerhards
>>>>>>> <[email protected]
>>>>>>>
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> 2016-02-23 18:03 GMT+01:00 David Lang <[email protected]>:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Tue, 23 Feb 2016, Rainer Gerhards wrote:
>>>>>>>>>
>>>>>>>>> 2016-02-23 17:38 GMT+01:00 Kane Kim <[email protected]>:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hello Rainer, thanks for the prompt reply! To give you some
>>>>>>>>>>> context:
>>>>>>>>>>> I
>>>>>>>>>>> want
>>>>>>>>>>> to write module that both using batching and also can't loose
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> messages in
>>>>>>>>
>>>>>>>>
>>>>>>>>> any circumstances. Are you saying it is by design that rsyslog
>>>>>>>>>>> can't
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> do
>>>>>>>>
>>>>>>>>
>>>>>>>>> that together? According to documentation rsyslog will retry if
>>>>>>>>>>> module
>>>>>>>>>>> returns any error. Do you plan to fix this in rsyslog or update
>>>>>>>>>>> documentation to say batching and retries don't work?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> It depends on many things. In almost all cases, the retry should
>>>>>>>>>> work
>>>>>>>>>> well (and does so in practice). Unfortunately, I am pretty
>>>>>>>>>> swamped.
>>>>>>>>>> I
>>>>>>>>>> need to go to a conference tomorrow and have had quite some
>>>>>>>>>> unexpected
>>>>>>>>>> work today. It would probably be good if you could ping me next
>>>>>>>>>> week
>>>>>>>>>> to see if we can look into more details what is causing you pain.
>>>>>>>>>> But
>>>>>>>>>> I can't guarantee that I will be available early next week.
>>>>>>>>>>
>>>>>>>>>> In general, we cannot handle a fatal error here from an engine
>>>>>>>>>> PoV,
>>>>>>>>>> because everything is already processed and we do no longer have
>>>>>>>>>> the
>>>>>>>>>> original messages. This is simply needed if you want to process
>>>>>>>>>> messages one after another through the full config (a goal for v8
>>>>>>>>>> that
>>>>>>>>>> was muuuuch requested). As I said, the solution is to use batches
>>>>>>>>>> of
>>>>>>>>>> one, because otherwise we would really need to turn back time and
>>>>>>>>>> undo
>>>>>>>>>> everything that was already done on the messages in question by
>>>>>>>>>> other
>>>>>>>>>> modules (including state advances).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I thought that if a batch failed, it pushed all the messages back
>>>>>>>>> on
>>>>>>>>> the
>>>>>>>>> queue and retried with a half size batch until it got to the one
>>>>>>>>> message
>>>>>>>>> that could not be processed and only did a fatal fail on that
>>>>>>>>> message.
>>>>>>>>>
>>>>>>>>> Now, there is a big difference between a module giving a hard error
>>>>>>>>>
>>>>>>>>>
>>>>>>>> "this
>>>>>>>>
>>>>>>>>
>>>>>>>>> message is never going to be able to be processed no matter how
>>>>>>>>> many
>>>>>>>>>
>>>>>>>>>
>>>>>>>> times
>>>>>>>>
>>>>>>>>
>>>>>>>>> it's retried" vs a soft error "there is a problem delivering things
>>>>>>>>> to
>>>>>>>>>
>>>>>>>>>
>>>>>>>> this
>>>>>>>>
>>>>>>>>
>>>>>>>>> destination right now, retry later". I thought the batch processing
>>>>>>>>>
>>>>>>>>>
>>>>>>>> handled
>>>>>>>>
>>>>>>>>
>>>>>>>>> these differently.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> That's no longer possible with v8, at least generically. As I said
>>>>>>>> above, we would need to turn back time.
>>>>>>>>
>>>>>>>> But I really run out of time now...
>>>>>>>>
>>>>>>>> Rainer
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> David Lang
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> rsyslog mailing list
>>>>>>>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>>>>>>>>> http://www.rsyslog.com/professional-services/
>>>>>>>>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>>>>>>>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
>>>>>>>>>
>>>>>>>>>
>>>>>>>> myriad of
>>>>>>>>
>>>>>>>>
>>>>>>>>> sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
>>>>>>>>>
>>>>>>>>>
>>>>>>>> DON'T
>>>>>>>>
>>>>>>>>
>>>>>>>>> LIKE THAT.
>>>>>>>>>
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> rsyslog mailing list
>>>>>>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>>>>>>>> http://www.rsyslog.com/professional-services/
>>>>>>>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>>>>>>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
>>>>>>>> myriad
>>>>>>>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if
>>>>>>>> you
>>>>>>>> DON'T LIKE THAT.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>>
>>>>>> rsyslog mailing list
>>>>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>>>>>> http://www.rsyslog.com/professional-services/
>>>>>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>>>>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
>>>>>> myriad
>>>>>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
>>>>>> DON'T
>>>>>> LIKE THAT.
>>>>>>
>>>>>> _______________________________________________
>>>>>>
>>>>> rsyslog mailing list
>>>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>>>>> http://www.rsyslog.com/professional-services/
>>>>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>>>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
>>>>> myriad
>>>>> of
>>>>> sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
>>>>> DON'T
>>>>> LIKE THAT.
>>>>>
>>>>> _______________________________________________
>>>> rsyslog mailing list
>>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>>>> http://www.rsyslog.com/professional-services/
>>>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
>>>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
>>>> DON'T LIKE THAT.
>>>>
>>>> _______________________________________________
>>>>
>>> rsyslog mailing list
>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>>> http://www.rsyslog.com/professional-services/
>>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
>>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
>>> DON'T LIKE THAT.
>>>
>>> _______________________________________________
>> rsyslog mailing list
>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>> http://www.rsyslog.com/professional-services/
>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
>> DON'T LIKE THAT.
>>
>> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com/professional-services/
> What's up with rsyslog? Follow https://twitter.com/rgerhards
> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
> DON'T LIKE THAT.
>
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to