jmestwa-coder opened a new issue, #680:
URL: https://github.com/apache/logging-log4cxx/issues/680
`SyslogAppender` currently decides whether to split outgoing syslog packets
using `LogString::size()`, while the actual UDP payload is produced only after
transcoding the message into encoded bytes.
This creates a transport boundary mismatch for multibyte encodings such as
UTF-8:
* split decision → based on character count
* emitted datagram size → based on encoded byte length
As a result, messages containing multibyte characters can remain below the
configured `MaxMessageLength` threshold while still producing UDP datagrams
larger than the configured maximum.
### Runtime Reproduction
Configuration:
* `MaxMessageLength = 100`
* `PatternLayout("%m")`
* Syslog output to localhost UDP receiver
Test message:
* 40 Euro symbols (`€`)
* UTF-8 encoding (`€` = 3 bytes)
Observed behavior with current implementation:
* `msg.size()` = 40
* encoded payload length = 120 bytes
* no splitting occurred
* emitted UDP datagram size = 124 bytes including syslog prefix
This exceeds the configured maximum despite the existing split logic.
### Root Cause
Current split logic uses:
```cpp
if (msg.size() > _priv->maxMessageLength)
```
However:
* `LogString::size()` reflects internal character/code-unit count
* UDP transport size depends on encoded byte length after transcoding
For multibyte UTF-8 content, these values diverge.
### Why This Matters
This affects transport boundary reliability rather than trusted
configuration validation.
Oversized syslog datagrams may:
* exceed expected relay or collector limits
* increase truncation/drop risk
* produce inconsistent packet chunking behavior across encodings
The issue is especially visible with:
* UTF-8 multibyte characters
* emoji
* CJK characters
* mixed-width log content
### Additional Investigation Notes
I prototyped a byte-aware splitting implementation to validate the issue.
That investigation confirmed:
* byte-aware splitting resolves the demonstrated overflow case
* however, a naive implementation introduces additional considerations:
* prefix/suffix accounting must be dynamic rather than heuristic
* repeated transcoding inside the split loop can introduce avoidable
hot-path overhead
For example:
* enabling `facilityPrinting`
* while using a fixed suffix reserve
* still allowed a packet to exceed `MaxMessageLength` by 1 byte in testing
Because of this, I am opening this issue first rather than immediately
proposing the prototype patch.
### Suggested Direction
A more robust solution may involve:
* enforcing limits using encoded byte length
* dynamically accounting for syslog prefix/suffix overhead
* preserving valid UTF-8/codepoint boundaries during splitting
* avoiding repeated transcoding work in the logging hot path
I can provide the reproduction test and prototype implementation details if
helpful.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]