[MariaDB developers] Re: Partial_rows_log_event and binlog-in-engine

Brandon Nesterenko via developers Wed, 04 Mar 2026 14:09:03 -0800

Hi Kristian,

Indeed I think you are right. When writing the ticket, I was thinking
the binlog-in-engine implementation would bypass the pending rows event
and have its own cache that it monitored to write out-of-bounds
chunks live. I recall you mentioning getting rid of the pending rows
event last September in Helsinki (though I think you meant generally).
But if it uses the pending data and same binlog transaction cache, then
it should already be supported (though we need to test this).


So the thought behind ticket was that the dump thread would have to
transform the Rows_ events that the ha_innodb_binlog_reader would output
into Partial_rows_log_event types, when exceeding
binlog_row_event_fragment_threshold.

Thanks for checking in,
Brandon

On Tue, Mar 3, 2026 at 4:56 AM Kristian Nielsen <[email protected]>
wrote:

> Hi Brandon,
>
> I accidentally spottet this description (in MDEV-38765):
>
> -----------------------------------------------------------------------
> MDEV-32570 (added to 12.3) added a new binlog event type,
> Partial_rows_log_event, to support fragmenting large ROW log events into
> groups of these smaller event type, that when their content is put
> together,
> will recreate the original ROW log event. This happens when writing the
> events to the binary log itself, i.e. if a transaction produces large
> amounts of ROW data, then the server will write Partial_rows_log_event
> into the binary log.
>
> This doesn't work with the new in-engine binlog added in MDEV-34705 (also
> added to 12.3). The in-engine binlog write mechanism spills row data
> changes
> to the binary log as out-of-band chunks as pages become full. To support
> replication (i.e. sending this data to slaves), the server still needs to
> support ROW events as if nothing changed. To support this, the patch
> created
> a new binlog_reader API to read these out-of-band chunks and use them to
> create a regular ROW event. This ROW event still must conform to both
> limitations tackled by MDEV-32570 (i.e. 1) it must not exceed
> slave_max_allowed_packet to be transmitted to the slave, and 2) it must
> not have more than 4GB of data to conform to the ROW event type header size
> limitation).
>
> This logic added by MDEV-34705 to create a ROW event from out-of-band
> chunks
> of binlog data should be extended to support Partial_rows_log_event,
> so the server can transmit large transactions to the slave when configured
> with --binlog-storage-engine=innodb.
> -----------------------------------------------------------------------
>
> I don't understand this.
>
> First, binlog-in-engine doesn't have any code that creates ROW events from
> out-of-band chunks. In fact, one of the things I'm really happy about with
> the binlog-in-engine design is that the engine binlog doesn't have any
> knowledge of the internal format of replication events. It simply stores
> opaque sequences of bytes. Whatever the server layer puts into the binlog
> transaction cache during execution of the transaction will be what the dump
> thread gets back when reading the binlog.
>
> Second, "this happens when writing the events to the binary log itself",
> also does not match what I see in the code. In the legacy binlog, event
> data
> gets written "to the binary log itself" in Event_log::write_cache():
>
>   /*
>     If possible, just copy the cache over byte-by-byte with pre-computed
>     checksums.
>   */
>   if (likely(binlog_checksum_options == (ulong)cache_data->checksum_opt) &&
>       likely(!crypto.scheme) &&
>       likely(!opt_binlog_legacy_event_pos))
>   {
>     int res=
>         my_b_copy_to_cache(cache, &log_file,
> cache_data->length_for_read());
>
> This code copies the transaction cache directly into the binary log,
> without
> any code that would fragment large ROW events? So just like with
> binlog-in-engine, the contents of the transaction cache goes unchanged into
> the binlog and out again into the dump thread?
>
> I found in the code a place where the fragmenting of large ROW event seems
> to happen, in Event_log::flush_and_set_pending_rows_event():
>
>     if (pending->rows_data_size_exceeds(
>             static_cast<ulonglong>(max_rows_ev_len)))
>     {
>       Rows_log_event_fragmenter fragmenter= Rows_log_event_fragmenter(
>           thd, is_transactional, opt_binlog_row_event_fragment_threshold,
>           pending);
>       Rows_log_event_fragmenter::Fragmented_rows_log_event *frag_ev;
>       if (!(frag_ev= fragmenter.fragment()))
>
> This code path seems to be common between legacy binlog and
> binlog-in-engine
> (as it should be).
>
> So I don't understand what the problem is here - is there indeed any
> problem
> at all? It looks to me like the fragmentation will happen as it should,
> before the event data even gets to the binlog transaction cache,
> independent
> of binlog in engine; what did I miss?
>
>  - Kristian.
>

_______________________________________________
developers mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[MariaDB developers] Re: Partial_rows_log_event and binlog-in-engine

Reply via email to