[I] SyslogAppender may exceed MaxMessageLength for multibyte UTF-8 messages because splitting is based on LogString character count instead of encoded byte length [logging-log4cxx]

via GitHub Mon, 18 May 2026 10:29:35 -0700


jmestwa-coder opened a new issue, #680:
URL: https://github.com/apache/logging-log4cxx/issues/680


   `SyslogAppender` currently decides whether to split outgoing syslog packets 
using `LogString::size()`, while the actual UDP payload is produced only after 
transcoding the message into encoded bytes.
   
   This creates a transport boundary mismatch for multibyte encodings such as 
UTF-8:
   
   * split decision → based on character count
   * emitted datagram size → based on encoded byte length
   
   As a result, messages containing multibyte characters can remain below the 
configured `MaxMessageLength` threshold while still producing UDP datagrams 
larger than the configured maximum.
   
   ### Runtime Reproduction
   
   Configuration:
   
   * `MaxMessageLength = 100`
   * `PatternLayout("%m")`
   * Syslog output to localhost UDP receiver
   
   Test message:
   
   * 40 Euro symbols (`€`)
   * UTF-8 encoding (`€` = 3 bytes)
   
   Observed behavior with current implementation:
   
   * `msg.size()` = 40
   * encoded payload length = 120 bytes
   * no splitting occurred
   * emitted UDP datagram size = 124 bytes including syslog prefix
   
   This exceeds the configured maximum despite the existing split logic.
   
   ### Root Cause
   
   Current split logic uses:
   
   ```cpp
   if (msg.size() > _priv->maxMessageLength)
   ```
   
   However:
   
   * `LogString::size()` reflects internal character/code-unit count
   * UDP transport size depends on encoded byte length after transcoding
   
   For multibyte UTF-8 content, these values diverge.
   
   ### Why This Matters
   
   This affects transport boundary reliability rather than trusted 
configuration validation.
   
   Oversized syslog datagrams may:
   
   * exceed expected relay or collector limits
   * increase truncation/drop risk
   * produce inconsistent packet chunking behavior across encodings
   
   The issue is especially visible with:
   
   * UTF-8 multibyte characters
   * emoji
   * CJK characters
   * mixed-width log content
   
   ### Additional Investigation Notes
   
   I prototyped a byte-aware splitting implementation to validate the issue.
   
   That investigation confirmed:
   
   * byte-aware splitting resolves the demonstrated overflow case
   * however, a naive implementation introduces additional considerations:
   
     * prefix/suffix accounting must be dynamic rather than heuristic
     * repeated transcoding inside the split loop can introduce avoidable 
hot-path overhead
   
   For example:
   
   * enabling `facilityPrinting`
   * while using a fixed suffix reserve
   * still allowed a packet to exceed `MaxMessageLength` by 1 byte in testing
   
   Because of this, I am opening this issue first rather than immediately 
proposing the prototype patch.
   
   ### Suggested Direction
   
   A more robust solution may involve:
   
   * enforcing limits using encoded byte length
   * dynamically accounting for syslog prefix/suffix overhead
   * preserving valid UTF-8/codepoint boundaries during splitting
   * avoiding repeated transcoding work in the logging hot path
   
   I can provide the reproduction test and prototype implementation details if 
helpful.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] SyslogAppender may exceed MaxMessageLength for multibyte UTF-8 messages because splitting is based on LogString character count instead of encoded byte length [logging-log4cxx]

Reply via email to