[MariaDB developers] Partial_rows_log_event and binlog-in-engine

Kristian Nielsen via developers Tue, 03 Mar 2026 03:57:18 -0800

Hi Brandon,

I accidentally spottet this description (in MDEV-38765):


-----------------------------------------------------------------------
MDEV-32570 (added to 12.3) added a new binlog event type,
Partial_rows_log_event, to support fragmenting large ROW log events into
groups of these smaller event type, that when their content is put together,
will recreate the original ROW log event. This happens when writing the
events to the binary log itself, i.e. if a transaction produces large
amounts of ROW data, then the server will write Partial_rows_log_event
into the binary log.

This doesn't work with the new in-engine binlog added in MDEV-34705 (also
added to 12.3). The in-engine binlog write mechanism spills row data changes
to the binary log as out-of-band chunks as pages become full. To support
replication (i.e. sending this data to slaves), the server still needs to
support ROW events as if nothing changed. To support this, the patch created
a new binlog_reader API to read these out-of-band chunks and use them to
create a regular ROW event. This ROW event still must conform to both
limitations tackled by MDEV-32570 (i.e. 1) it must not exceed
slave_max_allowed_packet to be transmitted to the slave, and 2) it must
not have more than 4GB of data to conform to the ROW event type header size
limitation).

This logic added by MDEV-34705 to create a ROW event from out-of-band chunks
of binlog data should be extended to support Partial_rows_log_event,
so the server can transmit large transactions to the slave when configured
with --binlog-storage-engine=innodb.
-----------------------------------------------------------------------

I don't understand this.

First, binlog-in-engine doesn't have any code that creates ROW events from
out-of-band chunks. In fact, one of the things I'm really happy about with
the binlog-in-engine design is that the engine binlog doesn't have any
knowledge of the internal format of replication events. It simply stores
opaque sequences of bytes. Whatever the server layer puts into the binlog
transaction cache during execution of the transaction will be what the dump
thread gets back when reading the binlog.

Second, "this happens when writing the events to the binary log itself",
also does not match what I see in the code. In the legacy binlog, event data
gets written "to the binary log itself" in Event_log::write_cache():

  /*
    If possible, just copy the cache over byte-by-byte with pre-computed
    checksums.
  */
  if (likely(binlog_checksum_options == (ulong)cache_data->checksum_opt) &&
      likely(!crypto.scheme) &&
      likely(!opt_binlog_legacy_event_pos))
  {
    int res=
        my_b_copy_to_cache(cache, &log_file, cache_data->length_for_read());

This code copies the transaction cache directly into the binary log, without
any code that would fragment large ROW events? So just like with
binlog-in-engine, the contents of the transaction cache goes unchanged into
the binlog and out again into the dump thread?

I found in the code a place where the fragmenting of large ROW event seems
to happen, in Event_log::flush_and_set_pending_rows_event():

    if (pending->rows_data_size_exceeds(
            static_cast<ulonglong>(max_rows_ev_len)))
    {
      Rows_log_event_fragmenter fragmenter= Rows_log_event_fragmenter(
          thd, is_transactional, opt_binlog_row_event_fragment_threshold,
          pending);
      Rows_log_event_fragmenter::Fragmented_rows_log_event *frag_ev;
      if (!(frag_ev= fragmenter.fragment()))

This code path seems to be common between legacy binlog and binlog-in-engine
(as it should be).

So I don't understand what the problem is here - is there indeed any problem
at all? It looks to me like the fragmentation will happen as it should,
before the event data even gets to the binlog transaction cache, independent
of binlog in engine; what did I miss?

 - Kristian.
_______________________________________________
developers mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[MariaDB developers] Partial_rows_log_event and binlog-in-engine

Reply via email to