Hello Samuel, Thanks for high quality comments, I appreciate it a lot. Here you can find the updated patch that addresses most of your comments. This time 1 patch to avoid mistakes like time.
1) Benchmark - have not yet done a comprehensive benchmark maybe it will be more meaningful with all the hooks in place. More on that below. 2) implemented escaping of JDB magic 3) V1 and V2. We now support both, in a way that we zero out the extra fields that V2 brings if we encounter V1... V1 is anyways obsolete as far as i understand. 4) I have reordered functions in the journal.h as you suggest, i have also removed the lower level read and write (they are in static in journal.c) 5) journal_read_block now passes out_buf to store_read as you suggested. 6) We stopped re-reading the journal superblock in journal_update_superblock, indeed it doesn't need to be done at all after journal init. 7) the yield() has been replaced by a conditional variable, thanks for that. 8) diskfs_notify_change ....No its not the only place for it. In fact it needs to be in many more places. Some in ext2 (implemented with this version of the patch) and some in libdiskfs (discussion below) 9) 5-sec loop. So, periodic_sync_thread is what eventually calls diskfs_sync_everything, and it happens every 30 seconds currently. It is a heavy operation, and 30 second is quite a lot of time to lose in a crash. I don't think we want to be committing a journal transaction that infrequently. What we do here is add a separate shorter interval kjournal thread (similar to what Linux does) that is much more lightweight, it just persists the running transaction and by default it runs every 5 seconds. It caps the maximum work lost to 5 seconds in case of a crash, but without the giant "stop the world" pause of syncing everything. 10) Ordering of journal writes vs normal writes. Yes we do follow the correct ordering... take a look at write_disknode_journaled function, we start the transaction, we then call write_node() functions which is purely in memory, then we mark dirty blocks, and then stop the transaction. Commit happens at the regular pace. if lazy pager (sync_everything) is called in between the commit will also happen there. I have refactored journal_commit to make it easier to see (hopefully) the order in which things are written to disk 11) removed the hardcoded 50 and other magic numbers, and now they are computed at journal startup based on the size of the journal 12) E2BIG doesn't happen any more 13) ext2_panic doesn't happen any more 14) Now journal gets flushed (and tail and head moved) on every diskfs_sync_everything (so every 30 seconds), so whenever the disk is actually written. This should make it much less likely that journal gets full. (even if the extremely small journal is manufactured with tunefs) For the discussion from 8)... some of the file system operations are happening completely inside libdiskfs, dir_rename, link, unlink, set trans etc etc. In order to support that properly we would need access to journal inside libdiskfs, but the journal is very ext2 specific. To make this possible my plan would be to follow the existing pattern in Hurd and add Journaling functions inside libdiskfs as weak symbols (so that they don't do anything for other filesystems). These 4 would probably suffice: diskfs_journal_start_transaction() diskfs_journal_stop_transaction() diskfs_journal_commit_transaction() diskfs_journal_is_running() (diskfs_notify_change could be removed if we have these!!) Yet ext2fs (or other file systems if there is a need) will implement them and have it working correctly. I didn't want to bundle that in this patch, because it would add many more files (somewhere around 10) and its already getting large. Re-testing I retested some of the tests that i have...one of which is: Turn on debug logging for the journal, run qemu with the patched ext2fs. Login to Hurd. chmod a file to 777 (it had 664 before) >chmod 777 blah.txt Wait a bit to see the tx getting committed. ( a second or two). Kill Qemu (simulate a power crash) Inspect the journal with debugfs to see if it looks right. > sudo debugfs -R "logdump -a" /dev/nbd0 debugfs 1.47.3 (8-Jul-2025) Journal starts at block 1, transaction 3244 Found expected sequence 3244, type 1 (descriptor block) at block 1 Dumping descriptor block, sequence 3244, at block 1: FS block 65602 logged at journal block 2 (flags 0x2) FS block 65538 logged at journal block 3 (flags 0x2) FS block 393236 logged at journal block 4 (flags 0x2) FS block 393235 logged at journal block 5 (flags 0x2) FS block 917506 logged at journal block 6 (flags 0x2) Yeah it does. Check the mode of the file with debugfs to confirm that sync_everything didn't happen and that indeed wrong value is on disk > sudo debugfs -R "stat /home/loshmi/blah.txt" /dev/nbd0 debugfs 1.47.3 (8-Jul-2025) Inode: 148424 Type: regular Mode: 0664 Yes indeed it is. now run > sudo e2fsck -f /dev/nbd0 and after that is done Check the value again with debugfs sudo debugfs -R "stat /home/loshmi/blah.txt" /dev/nbd0 debugfs 1.47.3 (8-Jul-2025) Inode: 148424 Type: regular Mode: 0777 Journal has corrected the mode, as expected. and the journal log is consumed: > sudo debugfs -R "logdump -a" /dev/nbd0 [sudo] password for loshmi: debugfs 1.47.3 (8-Jul-2025) Journal starts at block 0, transaction 3262 Let me know your thoughts, And thank you for the review! Milos On Mon, Feb 9, 2026 at 12:10 PM Samuel Thibault <[email protected]> wrote: > Hello, > > Thanks for your work which seems promising! I have a lot of comments and > questions. > > - avoid splitting the patch in two when patch 2 cannot build without > patch 3, you can as well just submit just one patch since it goes > altogether. > > - have you benchmarked the overhead that is introduced? > - It seems that it is the writeback mode that is implemented? It would be > useful to say so in the header. > - For data block, I have seen in the jbd2 documentation that > some escaping is needed when the data block happens to start > with the htobe32(JBD2_MAGIC_NUMBER) magic number. > > https://www.kernel.org/doc/html/latest/filesystems/ext4/journal.html#data-block > > - there are two versions of the journal superbloc (1 vs 2), don't we need > to support that somehow? > > - It would be useful to reorder functions in the file, to have them in the > order they happen: > start, dirty, stop, commit, force order > > - journal_read_block should pass out_buf to store_read for in-place > filling, and > and only copy if store_read reallocated (which should very rarely happen > in > practice) > - why re-reading the journal superblock in journal_update_superblock? > there is no reason why it would change, you can read once at mount, and > just write the new version > - the yield() call will lead to performance troubles :) The scheduler is > really not forced to actually yield, better use a condition variable, > that will behave exactly like you need. > > - diskfs_notify_change: is that the only call place to be? > it seems there are many more if (diskfs_synchronous) > diskfs_node_update() calls > > - About the journal_force_checkpoint 5-sec loop, we would want to > combine that with the periodic_sync_thread? > > - More generally, it is not clear that you are respecting the ordering > of journal writes vs normal writes. AIUI the principle is that > we write to the journal and flush that *before* writing anything > normally, so that either we have the old version only, or we also have > the journal entry which we can replay, or we have the new version, and > no mixture of these. > > - The hardcoded 50 value in journal_commit_transaction seems a bit ugly :) > We'd probably want to put a lower limit, but with a fast system, 50 > might > be not enough for proper overlapping for performance, we'd probably > rather compute a value based on the journal size. > - journal_transaction_commit may return E2BIG, we should detect that > earlier to split the transaction. > - there is an ext2_panic call when the transaction is too huge for the > journal, can that actually happen? We should probably check for that > early in journal_dirty_block? > - It's not clear to me how the journal is getting flushed. The only thing > I > have seen is that when the journal gets full, we call > journal_sync_everything which syncs everything > AIUI normally we should be progressively flushing the log, along the > normal writes getting done? > > Samuel >
From 56c06651521983bc32d2b8133a4f51a0b3248c98 Mon Sep 17 00:00:00 2001 From: Milos Nikic <[email protected]> Date: Thu, 5 Feb 2026 16:41:37 -0800 Subject: [PATCH] ext2fs: Add support for JBD2 journaling This patch introduces a JBD2-compliant journaling driver to ext2fs, enabling crash consistency for metadata operations. The implementation is binary-compatible with Linux's JBD2, allowing standard tools like e2fsck to replay the journal and recover the filesystem state. Key Features: 1. JBD2 Compatibility: Implements the JBD2 on-disk format (v2), including descriptor blocks, commit records, and revocation tables (basic). The filesystem is recognized as "ext3/ext4" by Linux tools. 2. Writeback Mode: Implements "Writeback" journaling semantics. Metadata (inodes, bitmaps) is journaled and crash-consistent. File data is flushed lazily by the VM pager. This provides the best performance but allows for "stale data" in recently allocated blocks if a crash occurs before the data flush (similar to ext3/4 `data=writeback`). Synchronous mounts (`-o sync`) force data flushes before commit, providing full data safety. 3. Journal Thread (kjournald): A dedicated thread manages the commit lifecycle, waking up periodically (default 5s) to batch transactions. This decouples metadata updates from disk I/O, preventing the "stop- the-world" latency of synchronous metadata updates. 4. Locking Strategy: - Introduces `journal_t` with internal state locking. --- ext2fs/Makefile | 2 +- ext2fs/ext2_fs.h | 3 +- ext2fs/ext2fs.c | 24 + ext2fs/ext2fs.h | 4 + ext2fs/getblk.c | 13 +- ext2fs/hyper.c | 9 + ext2fs/inode.c | 48 +- ext2fs/jbd2_format.h | 102 ++++ ext2fs/journal.c | 996 ++++++++++++++++++++++++++++++++++++++ ext2fs/journal.h | 97 ++++ ext2fs/pager.c | 38 ++ ext2fs/truncate.c | 19 +- libdiskfs/diskfs.h | 5 + libdiskfs/node-modified.c | 28 ++ libdiskfs/priv.h | 6 + 15 files changed, 1386 insertions(+), 8 deletions(-) create mode 100644 ext2fs/jbd2_format.h create mode 100644 ext2fs/journal.c create mode 100644 ext2fs/journal.h create mode 100644 libdiskfs/node-modified.c diff --git a/ext2fs/Makefile b/ext2fs/Makefile index 0c2f4a24..a2b0f1ee 100644 --- a/ext2fs/Makefile +++ b/ext2fs/Makefile @@ -22,7 +22,7 @@ makemode := server target = ext2fs SRCS = balloc.c dir.c ext2fs.c getblk.c hyper.c ialloc.c \ inode.c pager.c pokel.c truncate.c storeinfo.c msg.c xinl.c \ - xattr.c + xattr.c journal.c OBJS = $(SRCS:.c=.o) HURDLIBS = diskfs pager iohelp fshelp store ports ihash shouldbeinlibc LDLIBS = -lpthread $(and $(HAVE_LIBBZ2),-lbz2) $(and $(HAVE_LIBZ),-lz) diff --git a/ext2fs/ext2_fs.h b/ext2fs/ext2_fs.h index 195e9b6b..5712d5d0 100644 --- a/ext2fs/ext2_fs.h +++ b/ext2fs/ext2_fs.h @@ -492,7 +492,8 @@ struct ext2_super_block { #define EXT2_FEATURE_INCOMPAT_ANY 0xffffffff #define EXT2_FEATURE_COMPAT_SUPP EXT2_FEATURE_COMPAT_EXT_ATTR -#define EXT2_FEATURE_INCOMPAT_SUPP EXT2_FEATURE_INCOMPAT_FILETYPE +#define EXT2_FEATURE_INCOMPAT_SUPP (EXT2_FEATURE_INCOMPAT_FILETYPE | \ + EXT3_FEATURE_INCOMPAT_RECOVER) #define EXT2_FEATURE_RO_COMPAT_SUPP (EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER| \ EXT2_FEATURE_RO_COMPAT_LARGE_FILE| \ EXT2_FEATURE_RO_COMPAT_BTREE_DIR) diff --git a/ext2fs/ext2fs.c b/ext2fs/ext2fs.c index 11d1cdf4..ee35b771 100644 --- a/ext2fs/ext2fs.c +++ b/ext2fs/ext2fs.c @@ -32,6 +32,7 @@ #include <hurd/store.h> #include <version.h> #include "ext2fs.h" +#include "journal.h" /* ---------------------------------------------------------------- */ @@ -81,6 +82,7 @@ unsigned long desc_per_block; unsigned long addr_per_block; unsigned long groups_count; +struct journal *ext2_journal = NULL; /* ---------------------------------------------------------------- */ @@ -252,6 +254,28 @@ main (int argc, char **argv) ext2_panic ("no root node!"); pthread_mutex_unlock (&diskfs_root_node->lock); + if (sblock->s_feature_compat & EXT3_FEATURE_COMPAT_HAS_JOURNAL) + { + JRNL_LOG_DEBUG ("\n[JOURNAL CHECK] >>> Inode 8 DETECTED! <<<"); + JRNL_LOG_DEBUG ("[JOURNAL CHECK] s_journal_inum: %u (Expected: 8)", + sblock->s_journal_inum); + JRNL_LOG_DEBUG ("[JOURNAL CHECK] s_journal_dev: %u", + sblock->s_journal_dev); + struct node *jnode = NULL; + error_t err = diskfs_cached_lookup (8, &jnode); + + if (!err && jnode) + { + ext2_journal = journal_create (jnode); + JRNL_LOG_DEBUG ("Global Journal Initialized at %p", ext2_journal); + diskfs_nput(jnode); + } + } + else + { + JRNL_LOG_DEBUG ("\n[JOURNAL CHECK] No Journal flag found."); + } + /* Now that we are all set up to handle requests, and diskfs_root_node is set properly, it is safe to export our fsys control port to the outside world. */ diff --git a/ext2fs/ext2fs.h b/ext2fs/ext2fs.h index 46d41e08..95aa6975 100644 --- a/ext2fs/ext2fs.h +++ b/ext2fs/ext2fs.h @@ -284,6 +284,10 @@ extern int sblock_dirty; /* Size of one inode. */ extern uint16_t global_inode_size; +/* Forward declaration prevents circular dependency with journal.h */ +struct journal; +extern struct journal *ext2_journal; + /* Where the super-block is located on disk (at min-block 1). */ #define SBLOCK_BLOCK 1 /* Default location, second 1k block. */ #define SBLOCK_SIZE (sizeof (struct ext2_super_block)) diff --git a/ext2fs/getblk.c b/ext2fs/getblk.c index 49ab5740..6690222a 100644 --- a/ext2fs/getblk.c +++ b/ext2fs/getblk.c @@ -37,6 +37,7 @@ #include <string.h> #include "ext2fs.h" +#include "journal.h" /* * ext2_discard_prealloc and ext2_alloc_block are atomic wrt. the @@ -141,6 +142,9 @@ inode_getblk (struct node *node, int nr, int create, int zero, if (!create) return EINVAL; + if (ext2_journal) + journal_start_transaction (ext2_journal); + if (diskfs_node_disknode (node)->info.i_next_alloc_block == new_block) goal = diskfs_node_disknode (node)->info.i_next_alloc_goal; @@ -171,7 +175,11 @@ inode_getblk (struct node *node, int nr, int create, int zero, create ? "" : "no", hint, goal, *result); if (!*result) - return ENOSPC; + { + if (ext2_journal) + journal_stop_transaction (ext2_journal); + return ENOSPC; + } diskfs_node_disknode (node)->info.i_data[nr] = *result; @@ -181,6 +189,9 @@ inode_getblk (struct node *node, int nr, int create, int zero, node->dn_stat.st_blocks += 1 << log2_stat_blocks_per_fs_block; node->dn_stat_dirty = 1; + if (ext2_journal) + journal_stop_transaction (ext2_journal); + if (diskfs_synchronous || diskfs_node_disknode (node)->info.i_osync) diskfs_node_update (node, 1); diff --git a/ext2fs/hyper.c b/ext2fs/hyper.c index 847f9f2b..6919bbd1 100644 --- a/ext2fs/hyper.c +++ b/ext2fs/hyper.c @@ -196,11 +196,20 @@ diskfs_set_hypermetadata (int wait, int clean) /* The filesystem is clean, so we need to set the clean flag. */ { sblock->s_state |= htole16 (EXT2_VALID_FS); + if (ext2_journal) + { + sblock->s_feature_incompat &= htole32(~EXT3_FEATURE_INCOMPAT_RECOVER); + } sblock_dirty = 1; } else if (!clean && (sblock->s_state & htole16 (EXT2_VALID_FS))) /* The filesystem just became dirty, so clear the clean flag. */ { + if (ext2_journal && + !(sblock->s_feature_incompat & htole32(EXT3_FEATURE_INCOMPAT_RECOVER))) + { + sblock->s_feature_incompat |= htole32(EXT3_FEATURE_INCOMPAT_RECOVER); + } sblock->s_state &= htole16 (~EXT2_VALID_FS); sblock_dirty = 1; wait = 1; diff --git a/ext2fs/inode.c b/ext2fs/inode.c index 8d10af01..d552e595 100644 --- a/ext2fs/inode.c +++ b/ext2fs/inode.c @@ -20,6 +20,7 @@ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */ #include "ext2fs.h" +#include "journal.h" #include <string.h> #include <unistd.h> #include <stdio.h> @@ -559,24 +560,55 @@ write_all_disknodes (void) diskfs_node_iterate (write_one_disknode); } +static void +write_disknode_journaled (struct node *np, int wait) +{ + journal_start_transaction(ext2_journal); + struct ext2_inode *di = write_node (np); + + if (di) + { + unsigned long ino = np->dn_stat.st_ino; + unsigned long group = inode_group_num(ino); + block_t table_start = le32toh (group_desc(group)->bg_inode_table); + unsigned long inodes_per_group = le32toh (sblock->s_inodes_per_group); + unsigned long inode_index = (ino - 1) % inodes_per_group; + unsigned long byte_offset = inode_index * le16toh (sblock->s_inode_size); + block_t block_num = table_start + (byte_offset / block_size); + void *block_ptr = bptr (block_num); + journal_dirty_block(ext2_journal, block_num, block_ptr); + } + journal_stop_transaction(ext2_journal); + if (wait && di) + journal_commit_transaction(ext2_journal, NULL); +} + /* Sync the info in NP->dn_stat and any associated format-specific information to disk. If WAIT is true, then return only after the physicial media has been completely updated. */ void diskfs_write_disknode (struct node *np, int wait) { - struct ext2_inode *di = write_node (np); + struct ext2_inode *di; + + if (ext2_journal) + { + write_disknode_journaled (np, wait); + return; + } + di = write_node (np); if (di) { if (wait) { - sync_global_ptr (di, 1); + sync_global_ptr (di, 1); error_t err = store_sync (store); + /* Ignore EOPNOTSUPP (drivers), but warn on real I/O errors */ if (err && err != EOPNOTSUPP) - ext2_warning ("inode flush failed: %s", strerror (err)); + ext2_warning ("device flush failed: %s", strerror (err)); } else - record_global_poke (di); + record_global_poke (di); } } @@ -908,3 +940,11 @@ diskfs_shutdown_soft_ports (void) /* Should initiate termination of internally held pager ports (the only things that should be soft) XXX */ } + +void +diskfs_notify_change (struct node *np) +{ + /* If journaling is active, capture this metadata change immediately */ + if (ext2_journal) + diskfs_node_update (np, 0); +} diff --git a/ext2fs/jbd2_format.h b/ext2fs/jbd2_format.h new file mode 100644 index 00000000..9abd60a4 --- /dev/null +++ b/ext2fs/jbd2_format.h @@ -0,0 +1,102 @@ +/* JBD2 Standard On-Disk Layout + + Copyright (C) 2026 Free Software Foundation, Inc. + Written by Milos Nikic. + + This program is free software; you can redistribute it and/or + modify it under the terms of the GNU General Public License as + published by the Free Software Foundation; either version 2, or (at + your option) any later version. + + This program is distributed in the hope that it will be useful, but + WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */ + +#ifndef _JBD2_FORMAT_H +#define _JBD2_FORMAT_H + +#include <stdint.h> + +/** + * JBD2 Magic Numbers + * If a block starts with this, it's a metadata block. + */ +#define JBD2_MAGIC_NUMBER 0xC03B3998U + +/* Block Types */ +#define JBD2_DESCRIPTOR_BLOCK 1 +#define JBD2_COMMIT_BLOCK 2 +#define JBD2_SUPERBLOCK_V1 3 +#define JBD2_SUPERBLOCK_V2 4 +#define JBD2_REVOKE_BLOCK 5 + +#define JBD2_PACKED __attribute__((packed)) + +/** + * The Journal Superblock (Version 2). + * Lives at the very start of the journal partition (typically Inode 8). + */ +typedef struct JBD2_PACKED journal_superblock_s +{ + uint32_t s_header[3]; /* Standard header (magic, type, etc) */ + uint32_t s_blocksize; /* Journal device blocksize */ + uint32_t s_maxlen; /* Total blocks in journal file */ + uint32_t s_first; /* First block of log information */ + + uint32_t s_sequence; /* First commit ID expected in log */ + uint32_t s_start; /* Block number of start of log */ + + uint32_t s_errno; /* Error value, if any */ + + uint32_t s_feature_compat; + uint32_t s_feature_incompat; + uint32_t s_feature_ro_compat; + + uint8_t s_uuid[16]; /* 128-bit uuid for journal */ + uint32_t s_nr_users; /* Nr of filesystems sharing log */ + uint32_t s_dynsuper; /* Blocknr of dynamic superblock copy */ + uint32_t s_max_transaction; /* Limit of handle size */ + uint32_t s_max_trans_data; /* Limit of data blocks per trans */ + uint32_t s_checksum_type; /* checksum type */ + uint8_t s_padding2[42 * 4]; + uint32_t s_checksum; /* crc32c(superblock) */ + uint8_t s_users[16 * 48]; /* ids of all filesystems sharing log */ +} journal_superblock_t; + +_Static_assert (sizeof (journal_superblock_t) == 1024, + "JBD2 Superblock size mismatch! Check padding."); + +/** + * The Standard Header + * Every metadata block (Descriptor, Commit) starts with this. + */ +typedef struct JBD2_PACKED journal_header_s +{ + uint32_t h_magic; /* 0xC03B3998 */ + uint32_t h_blocktype; /* Descriptor, Commit, etc. */ + uint32_t h_sequence; /* The Transaction ID */ +} journal_header_t; + +/** + * The Block Tag + * Describes a data block that follows. + * Structure: [Descriptor Block] [Tag 1] [Tag 2] ... [Data 1] [Data 2] ... + */ +typedef struct JBD2_PACKED journal_block_tag_s +{ + uint32_t t_blocknr; /* The 32-bit physical block number */ + uint32_t t_flags; /* See flags below */ +} journal_block_tag_t; + +/* Flags for the Block Tag */ +#define JBD2_FLAG_ESCAPE 1 /* The data block starts with magic number (escaped) */ +#define JBD2_FLAG_SAME_UUID 2 /* (Not needed for us usually) */ +#define JBD2_FLAG_DELETED 4 /* Block was deleted */ +#define JBD2_FLAG_LAST_TAG 8 /* This is the last tag in this descriptor block */ + +#endif diff --git a/ext2fs/journal.c b/ext2fs/journal.c new file mode 100644 index 00000000..6564c4aa --- /dev/null +++ b/ext2fs/journal.c @@ -0,0 +1,996 @@ +/* JBD2 binary compliant journal driver. + + Implements "Writeback" journaling mode: + - Metadata (Inodes, Bitmaps, Superblock) is journaled and crash-consistent. + - File Data is written directly to disk (not journaled) and lacks + explicit ordering guarantees relative to the metadata commit. + This provides the best performance but allows for "stale data" in + recently allocated blocks after a crash. + + Copyright (C) 2026 Free Software Foundation, Inc. + Written by Milos Nikic. + + This program is free software; you can redistribute it and/or + modify it under the terms of the GNU General Public License as + published by the Free Software Foundation; either version 2, or (at + your option) any later version. + + This program is distributed in the hope that it will be useful, but + WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */ + +#include <stdint.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <unistd.h> + +#include <hurd/ihash.h> + +#include "ext2fs.h" +#include "jbd2_format.h" +#include "journal.h" + +/* Journal Tuning Params */ + +/** + * We limit a single transaction to 1/4 of the total journal size. + * This ensures we can pipeline: + * [ Committing Txn ] + [ Running Txn ] + [ Buffer/Wrap Space ] + */ +#define JRNL_MAX_TRANS_RATIO 4 + +/** + * The minimum size (in blocks) of a transaction. + * Below this, the overhead of commit records outweighs the data throughput. + * 256 blocks = 1MB (assuming 4k blocks). + */ +#define JRNL_MIN_BATCH_BLOCKS 256 + +/** + * Reserve space for metadata overhead (Descriptor blocks + Commit block). + * 32 blocks allows for ~8000 data blocks to be described (approx), + * which is plenty of safety margin for the size limits above. + */ +#define JRNL_METADATA_OVERHEAD 32 + +/** + * Ratio for estimating descriptor blocks. + * We estimate 1 descriptor block for every 32 data blocks. + * (Real capacity is ~250 tags/block, so 32 is a very conservative/safe estimate). + */ +#define JRNL_DESCRIPTOR_RATIO 32 + +/** + * Safety margin during commit. + * Reserves space for: 1 Commit Block + 1 Descriptor Block + 3 blocks slop + * to handle alignment/wrapping edge cases without hitting the tail. + */ +#define JRNL_COMMIT_MARGIN 5 + +/** + * Low Water Mark Ratio. + * If free space drops below 1/8th of the total journal, we force a checkpoint. + * This ensures the NEXT transaction has breathing room to start. + */ +#define JRNL_LOW_WATER_RATIO 8 + +static pthread_t kjournald_tid; + +/** + * This function exists to sync all AND avoid a deadlock with commit. + * It doesn't call journal_commit back yet it syncs everything. + **/ +extern void journal_sync_everything (void); + +/** + * Represents one modified block (4KB) that needs to be written to the journal. + */ +typedef struct journal_buffer +{ + block_t jb_blocknr; /* The physical block number on the filesystem */ + void *jb_shadow_data; /* 4KB Copy of the data to be logged */ + struct journal_buffer *jb_next; /* Linked list next pointer */ + uint32_t jb_log_spot; +} journal_buffer_t; + +/* The state of a transaction in memory */ +typedef enum +{ + T_RUNNING, /* Accepting new handles/buffers */ + T_LOCKED, /* Locked, no new handles, waiting for updates to finish */ + T_FLUSHING, /* Writing to the journal ring buffer */ + T_COMMIT, /* Writing the commit block */ + T_FINISHED /* Done, waiting to be checkpointed */ +} transaction_state_t; + +/* The Transaction Object */ +struct journal_transaction +{ + uint32_t t_tid; /* Transaction ID (Sequence Number) */ + transaction_state_t t_state; + + /* The Log Position */ + uint32_t t_log_start; /* Where this transaction starts in the ring */ + uint32_t t_nr_blocks; /* How many blocks it consumes */ + + uint32_t t_updates; /* Refcount: How many threads are in this transaction? */ + + /* The Payload (The Shadow Buffers) */ + journal_buffer_t *t_buffers; /* Linked List of dirty blocks */ + int t_buffer_count; + struct hurd_ihash t_buffer_map; /* The Map (for O(1) lookups) */ + + /* Timing/Debug */ + long t_start_time; +}; + +/* The Simple Mapper (Virtual -> Physical) */ +typedef struct journal_map +{ + block_t *phys_blocks; /* The 64KB array we malloc'd */ + uint32_t total_blocks; /* 16384 */ + struct node *inode; /* Inode 8 (for keeping ref) */ +} journal_map_t; + +/* The Grand Abstraction */ +typedef struct journal +{ + /* The Physics of it (The Map) */ + journal_map_t map; + + /* The Ring Buffer State (The Logic) */ + uint32_t j_head; /* Where we are writing next */ + uint32_t j_tail; /* The oldest live transaction (checkpoint) */ + uint32_t j_first; /* First block of data (usually 1, after SB) */ + uint32_t j_last; /* Last block of data */ + uint32_t j_free; /* How many blocks left? */ + + /* The Sequence Counter */ + uint32_t j_transaction_sequence; /* Monotonic ID (e.g. 500, 501...) */ + void *j_sb_buffer; /* Buffer holding the journal superblock */ + + pthread_mutex_t j_state_lock; /* Protects the pointers below */ + pthread_cond_t j_commit_wait; /* Conditional variable while waiting for the tx to be ready to commit. */ + /* The Transactions */ + struct journal_transaction *j_running_transaction; /* Currently filling */ + struct journal_transaction *j_committing_transaction; /* Flushing to journal */ + + uint32_t j_max_transaction_buffers; /* Max size of a single transaction */ + uint32_t j_min_free; +} journal_t; + +static void +flush_to_disk (void) +{ + error_t err = store_sync (store); + /* Ignore EOPNOTSUPP (drivers), but warn on real I/O errors */ + if (err && err != EOPNOTSUPP) + ext2_warning ("device flush failed: %s", strerror (err)); +} + +static void +init_map (journal_t *journal, struct node *jnode) +{ + journal->map.total_blocks = jnode->allocsize / block_size; + journal->map.phys_blocks = + malloc (journal->map.total_blocks * sizeof (block_t)); + if (!journal->map.phys_blocks) + ext2_panic ("No RAM for journal map"); + + for (uint32_t i = 0; i < journal->map.total_blocks; i++) + { + block_t phys = 0; + + /* ext2_getblk handles the indirect blocks/fragmentation. */ + error_t err = ext2_getblk (jnode, i, 0, &phys); + + if (err || phys == 0) + { + ext2_panic ("[JOURNAL] Gap in journal file at logical %u!", i); + } + + journal->map.phys_blocks[i] = phys; + } + + journal->map.inode = jnode; +} + +static void +destroy_map (journal_t *journal) +{ + free (journal->map.phys_blocks); + journal->map.total_blocks = 0; + if (journal->map.inode) + diskfs_nput (journal->map.inode); +} + +static void * +kjournald_thread (void *arg) +{ + journal_t *journal = (journal_t *) arg; + while (1) + { + sleep (5); + + if (journal->j_running_transaction) + { + JRNL_LOG_DEBUG ("Woke the journal up:\n" + " - Sequence: %u\n" + " - Start (Head): %u\n" + " - First Data Block: %u\n" + " - Total Blocks: %u", + journal->j_transaction_sequence, journal->j_head, + journal->j_first, journal->j_last); + + // "Lightweight" commit - only writes the log + journal_commit_transaction (journal, NULL); + } + } + return NULL; +} + +static block_t +get_journal_phys_block (journal_t *journal, uint32_t idx) +{ + assert_backtrace (idx < journal->map.total_blocks); + return journal->map.phys_blocks[idx]; +} + +/* Centralized logic to map FS Block -> Store Offset */ +static store_offset_t +journal_map_offset (journal_t *journal, uint32_t logical_idx) +{ + block_t phys_block = get_journal_phys_block (journal, logical_idx); + return phys_block << (log2_block_size - store->log2_block_size); +} + +/** + * Writes a full filesystem block (4096 bytes) to the journal. + * Handles the Logical -> Physical -> Store Offset conversion. + */ +static error_t +journal_write_block (journal_t *journal, uint32_t logical_idx, void *data) +{ + store_offset_t offset; + size_t written_amount = 0; + error_t err; + + /* Safety Check */ + if (logical_idx >= journal->map.total_blocks) + { + ext2_warning ("[JOURNAL] Write out of bounds! Index: %u, Max: %u", + logical_idx, journal->map.total_blocks); + return EINVAL; + } + + offset = journal_map_offset (journal, logical_idx); + err = store_write (store, offset, data, block_size, &written_amount); + + if (err) + { + JRNL_LOG_DEBUG + ("[JOURNAL] Write failed at logical %u. Err: %s", + logical_idx, strerror (err)); + return err; + } + + if (written_amount != block_size) + { + JRNL_LOG_DEBUG ("[JOURNAL] Short write! Wanted %u, wrote %lu", + block_size, written_amount); + return EIO; + } + + return 0; +} + +/** + * Reads a full filesystem block (4096 bytes) from the journal into 'out_buf'. + * out_buf must be at least block_size bytes. + */ +static error_t +journal_read_block (journal_t *journal, uint32_t logical_idx, void *out_buf) +{ + store_offset_t offset; + size_t read_amount = 0; + error_t err; + void *read_buf = out_buf; + + if (!out_buf) + return EINVAL; + + if (logical_idx >= journal->map.total_blocks) + { + ext2_warning ("[JOURNAL] Read out of bounds! Index: %u, Max: %u", + logical_idx, journal->map.total_blocks); + return EINVAL; + } + + offset = journal_map_offset (journal, logical_idx); + err = store_read (store, offset, block_size, &read_buf, &read_amount); + + if (err) + { + return err; + } + + if (read_amount != block_size) + { + JRNL_LOG_DEBUG ("[JOURNAL] Short read! Wanted %u, got %lu", block_size, + read_amount); + if (read_buf != out_buf) + vm_deallocate (mach_task_self (), (vm_address_t) read_buf, + read_amount); + return EIO; + } + + if (read_buf != out_buf) + { + memcpy (out_buf, read_buf, block_size); + vm_deallocate (mach_task_self (), (vm_address_t) read_buf, read_amount); + } + return 0; +} + +/** + * Reads the JBD2 superblock (Block 0 of the journal file) + * and initializes the journal_t state. + */ +static error_t +journal_load_superblock (journal_t *journal) +{ + error_t err; + journal_superblock_t *jsb; + void *buf; + + buf = malloc (block_size); + if (!buf) + return ENOMEM; + + /* journal_read_block handles all the store_read/vm_deallocate logic internally */ + err = journal_read_block (journal, 0, buf); + + if (err) + { + JRNL_LOG_DEBUG ("[JOURNAL] Failed to read SB. Err: %s", strerror (err)); + free (buf); + return err; + } + + /* Interpret as JBD2 Superblock and verify */ + jsb = (journal_superblock_t *) buf; + uint32_t magic = be32toh (jsb->s_header[0]); + uint32_t type = be32toh (jsb->s_header[1]); + + if (magic != JBD2_MAGIC_NUMBER) + { + ext2_warning ("[JOURNAL] Invalid Magic: %x (Expected %x)", magic, + JBD2_MAGIC_NUMBER); + free (buf); + return EINVAL; + } + + /* Check versions */ + if (type == JBD2_SUPERBLOCK_V1) + { + ext2_warning + ("[JOURNAL] Mounting V1 journal. 64-bit features disabled."); + /* V1 ends at s_errno. Zero out all V2-specific fields. */ + jsb->s_feature_compat = 0; + jsb->s_feature_incompat = 0; + jsb->s_feature_ro_compat = 0; + memset (jsb->s_uuid, 0, 16); + jsb->s_nr_users = 0; + jsb->s_dynsuper = 0; + jsb->s_max_transaction = 0; + jsb->s_max_trans_data = 0; + jsb->s_checksum_type = 0; + memset (jsb->s_padding2, 0, sizeof (jsb->s_padding2)); + jsb->s_checksum = 0; + memset (jsb->s_users, 0, sizeof (jsb->s_users)); + } + else if (type != JBD2_SUPERBLOCK_V2) + { + ext2_warning ("[JOURNAL] Invalid SB Type: %d", type); + free (buf); + return EINVAL; + } + + /* Populate Journal Struct */ + journal->j_first = be32toh (jsb->s_first); + journal->j_head = be32toh (jsb->s_start); + journal->j_tail = journal->j_head; + journal->j_transaction_sequence = be32toh (jsb->s_sequence); + + /* Validate blocksize */ + uint32_t j_bsize = be32toh (jsb->s_blocksize); + if (j_bsize != block_size) + { + ext2_warning ("[JOURNAL] Blocksize mismatch! Journal: %u, FS: %u", + j_bsize, block_size); + free (buf); + return EINVAL; + } + jsb->s_maxlen = htobe32 (journal->map.total_blocks); + + journal->j_sb_buffer = buf; + journal->j_last = journal->map.total_blocks - 1; + journal->j_free = journal->j_last - journal->j_first; + + JRNL_LOG_DEBUG ("Loaded JBD2 Superblock:\n" + " - Sequence: %u\n" + " - Start (Head): %u\n" + " - First Data Block: %u\n" + " - Total Blocks: %u", + journal->j_transaction_sequence, journal->j_head, + journal->j_first, journal->j_last); + return 0; +} + +/* Updates superblock and flushes it to disk */ +static error_t +journal_update_superblock (journal_t *journal, uint32_t sequence, + uint32_t start) +{ + error_t err; + journal_superblock_t *jsb = (journal_superblock_t *) journal->j_sb_buffer; + + /* Update Dynamic Fields */ + jsb->s_sequence = htobe32 (sequence); + jsb->s_start = htobe32 (start); + + JRNL_LOG_DEBUG ("[SB] Updating: Seq %u, Head %u", sequence, start); + err = journal_write_block (journal, 0, jsb); + if (err) + return err; + flush_to_disk (); + return 0; +} + +journal_t * +journal_create (struct node *journal_inode) +{ + journal_t *j = calloc (1, sizeof (struct journal)); + if (!j) + ext2_panic ("[JOURNAL] Cannot create journal struct."); + + init_map (j, journal_inode); + + /* Take ownership of the inode ref */ + diskfs_nref (journal_inode); + + /* Set generic defaults (Will be overwritten by Superblock read later) */ + j->j_first = 1; /* Skip SB block by default */ + j->j_last = j->map.total_blocks - 1; + uint32_t total_len = j->j_last - j->j_first; + j->j_free = total_len; + + j->j_max_transaction_buffers = total_len / JRNL_MAX_TRANS_RATIO; + if (j->j_max_transaction_buffers < JRNL_MIN_BATCH_BLOCKS) + j->j_max_transaction_buffers = JRNL_MIN_BATCH_BLOCKS; + j->j_min_free = j->j_max_transaction_buffers + JRNL_METADATA_OVERHEAD; + + if (journal_load_superblock (j) != 0) + { + ext2_panic ("[JOURNAL] Failed to load superblock!"); + } + pthread_mutex_init (&j->j_state_lock, NULL); + pthread_cond_init (&j->j_commit_wait, NULL); + if (pthread_create (&kjournald_tid, NULL, kjournald_thread, j) != 0) + { + JRNL_LOG_DEBUG ("Failed to create a flusher thread."); + } + else + { + JRNL_LOG_DEBUG ("Created flusher thread."); + } + return j; +} + +void +journal_destroy (journal_t *journal) +{ + destroy_map (journal); + pthread_mutex_destroy (&journal->j_state_lock); + pthread_cond_destroy (&journal->j_commit_wait); + + if (journal->j_sb_buffer) + free (journal->j_sb_buffer); + free (journal); +} + +/** + * Called when we are running out of space. + * Since we do a version of sync() on every commit, we can safely declare all + * previous transactions "checkpointed" and reset the log. + */ +static void +journal_force_checkpoint (journal_t *journal, uint32_t current_tid) +{ + JRNL_LOG_DEBUG + ("[CHECKPOINT] Journal Full! Forcing Global Sync & Reset..."); + + journal_sync_everything (); + + journal->j_tail = journal->j_head; + journal->j_free = journal->j_last - journal->j_first; + + journal_update_superblock (journal, current_tid, journal->j_head); + + JRNL_LOG_DEBUG + ("[CHECKPOINT] Reset complete. Tail moved to %u. Free space restored.", + journal->j_tail); +} + +static uint32_t +journal_next_log_block (journal_t *journal) +{ + journal->j_head++; + if (journal->j_head > journal->j_last) + { + journal->j_head = journal->j_first; + } + journal->j_free--; + return journal->j_head; +} + +/* Helper to calculate where the next block is, handling the ring buffer wrap. + Must match journal_next_log_block logic exactly! */ +static uint32_t +journal_next_after (journal_t *journal, uint32_t current_block) +{ + uint32_t next = current_block + 1; + /* Wrap around to the first usable block */ + if (next > journal->j_last) + next = journal->j_first; + return next; +} + +/* Writes the Descriptor Block + All Data Blocks (Escaped) */ +static error_t +journal_write_payload (journal_t *journal, + struct journal_transaction *txn, + uint32_t descriptor_loc) +{ + void *descriptor_buf = calloc (1, block_size); + if (!descriptor_buf) + return ENOMEM; + + journal_header_t *hdr = (journal_header_t *) descriptor_buf; + hdr->h_magic = htobe32 (JBD2_MAGIC_NUMBER); + hdr->h_blocktype = htobe32 (JBD2_DESCRIPTOR_BLOCK); + hdr->h_sequence = htobe32 (txn->t_tid); + + uint32_t tag_offset = sizeof (journal_header_t); + journal_buffer_t *jb = txn->t_buffers; + error_t err = 0; + + while (jb) + { + if (tag_offset + sizeof (journal_block_tag_t) > block_size) + { + ext2_warning ("[COMMIT] Descriptor overflow! Dropping tags."); + break; + } + + journal_block_tag_t *tag = + (journal_block_tag_t *) ((char *) descriptor_buf + tag_offset); + tag->t_blocknr = htobe32 (jb->jb_blocknr); + + uint32_t flags = JBD2_FLAG_SAME_UUID; + if (jb->jb_next == NULL) + flags |= JBD2_FLAG_LAST_TAG; + + /* Escaping Logic: If data looks like a header, mask it. */ + uint32_t *data_head = (uint32_t *) jb->jb_shadow_data; + if (*data_head == htobe32 (JBD2_MAGIC_NUMBER)) + { + flags |= JBD2_FLAG_ESCAPE; + *data_head = 0; + } + tag->t_flags = htobe32 (flags); + + jb = jb->jb_next; + tag_offset += sizeof (journal_block_tag_t); + } + + /* Write Descriptor */ + JRNL_LOG_DEBUG ("[COMMIT] Writing Descriptor to %u", descriptor_loc); + err = journal_write_block (journal, descriptor_loc, descriptor_buf); + free (descriptor_buf); + if (err) + return err; + + /* Write Data Blocks */ + jb = txn->t_buffers; + while (jb) + { + err = + journal_write_block (journal, jb->jb_log_spot, jb->jb_shadow_data); + if (err) + return err; + jb = jb->jb_next; + } + + return 0; +} + +/* Writes the Commit Block */ +static error_t +journal_write_commit_record (journal_t *journal, + struct journal_transaction *txn, + uint32_t commit_loc) +{ + void *commit_buf = calloc (1, block_size); + if (!commit_buf) + return ENOMEM; + + journal_header_t *hdr = (journal_header_t *) commit_buf; + hdr->h_magic = htobe32 (JBD2_MAGIC_NUMBER); + hdr->h_blocktype = htobe32 (JBD2_COMMIT_BLOCK); + hdr->h_sequence = htobe32 (txn->t_tid); + + error_t err = journal_write_block (journal, commit_loc, commit_buf); + free (commit_buf); + return err; +} + +/* Cleans up the transaction. */ +static error_t +journal_cleanup_transaction (struct journal_transaction *txn, error_t err) +{ + journal_buffer_t *jb = txn->t_buffers; + while (jb) + { + journal_buffer_t *next = jb->jb_next; + free (jb->jb_shadow_data); + free (jb); + jb = next; + } + hurd_ihash_destroy (&txn->t_buffer_map); + free (txn); + return err; +} + +error_t +journal_commit_transaction (journal_t *journal, uint32_t *out_j_head) +{ + struct journal_transaction *txn; + error_t err = 0; + uint32_t descriptor_loc, commit_loc; + journal_buffer_t *jb; + + pthread_mutex_lock (&journal->j_state_lock); + txn = journal->j_running_transaction; + + if (!txn || txn->t_state != T_RUNNING) + { + pthread_mutex_unlock (&journal->j_state_lock); + return EINVAL; + } + + journal->j_running_transaction = NULL; + txn->t_state = T_LOCKED; + + while (txn->t_updates > 0) + { + pthread_cond_wait (&journal->j_commit_wait, &journal->j_state_lock); + } + txn->t_state = T_FLUSHING; + + uint32_t needed = + txn->t_nr_blocks + (txn->t_nr_blocks / JRNL_DESCRIPTOR_RATIO) + + JRNL_COMMIT_MARGIN; + uint32_t low_water = + (journal->j_last - journal->j_first) / JRNL_LOW_WATER_RATIO; + + if (journal->j_free < needed || journal->j_free < low_water) + journal_force_checkpoint (journal, txn->t_tid); + + /* Reserve Blocks */ + descriptor_loc = journal_next_log_block (journal); + jb = txn->t_buffers; + while (jb) + { + jb->jb_log_spot = journal_next_log_block (journal); + jb = jb->jb_next; + } + commit_loc = journal_next_log_block (journal); + + pthread_mutex_unlock (&journal->j_state_lock); + + /* Write Data (I/O) */ + err = journal_write_payload (journal, txn, descriptor_loc); + if (err) + return journal_cleanup_transaction (txn, err); + + /* Ensure Data is on disk */ + flush_to_disk (); + + /* Write Commit Record */ + err = journal_write_commit_record (journal, txn, commit_loc); + if (err) + return journal_cleanup_transaction (txn, err); + + /* Ensure Commit is persistent */ + flush_to_disk (); + + /* Finalize Metadata */ + pthread_mutex_lock (&journal->j_state_lock); + + if (journal->j_tail == 0) + { + journal->j_tail = journal->j_first; + journal_update_superblock (journal, txn->t_tid, journal->j_first); + } + if (out_j_head) + *out_j_head = journal->j_head; + pthread_mutex_unlock (&journal->j_state_lock); + + return journal_cleanup_transaction (txn, 0); +} + +/** + * Ensures there is a VALID running transaction to attach to. + * Returns 0 on success, or error code. + */ +error_t +journal_start_transaction (journal_t *journal) +{ + struct journal_transaction *txn; + + if (!journal) + return EINVAL; + + pthread_mutex_lock (&journal->j_state_lock); + txn = journal->j_running_transaction; + + if (txn) + { + /* If there is a transaction, it MUST be RUNNING. + If it's anything else (LOCKED, FLUSHING), it means commit hasn't detached it yet. + In other words: we have a logic bug! + */ + if (txn->t_state != T_RUNNING) + { + ext2_panic + ("[TRX] Logic Error: Running transaction is not T_RUNNING!"); + } + txn->t_updates++; + } + else + { + txn = calloc (1, sizeof (struct journal_transaction)); + if (!txn) + { + pthread_mutex_unlock (&journal->j_state_lock); + return ENOMEM; + } + + hurd_ihash_init (&txn->t_buffer_map, HURD_IHASH_NO_LOCP); + txn->t_tid = journal->j_transaction_sequence++; + txn->t_state = T_RUNNING; + txn->t_updates = 1; + + journal->j_running_transaction = txn; + JRNL_LOG_DEBUG ("[TRX] Created NEW TID %u", txn->t_tid); + } + + pthread_mutex_unlock (&journal->j_state_lock); + return 0; +} + +void +journal_stop_transaction (journal_t *journal) +{ + struct journal_transaction *txn; + + if (!journal) + return; + + pthread_mutex_lock (&journal->j_state_lock); + + txn = journal->j_running_transaction; + if (!txn) + { + ext2_warning + ("[TRX] stop_transaction called but no transaction running!"); + pthread_mutex_unlock (&journal->j_state_lock); + return; + } + + txn->t_updates--; + if (txn->t_updates == 0) + { + pthread_cond_broadcast (&journal->j_commit_wait); + } + pthread_mutex_unlock (&journal->j_state_lock); +} + +/** + * Adds a modified filesystem block to the current running transaction. + * Performs a "Shadow Copy" of the data immediately. + */ +error_t +journal_dirty_block (journal_t *journal, block_t fs_blocknr, const void *data) +{ + struct journal_transaction *txn; + journal_buffer_t *jb; + journal_buffer_t *new_jb; + error_t err; + + if (!journal || !data) + return EINVAL; + + pthread_mutex_lock (&journal->j_state_lock); + + txn = journal->j_running_transaction; + + if (!txn || txn->t_state != T_RUNNING) + { + JRNL_LOG_DEBUG + ("[ERROR] journal_dirty_block called outside of transaction!"); + pthread_mutex_unlock (&journal->j_state_lock); + return EPERM; + } + + if (txn->t_nr_blocks >= journal->j_max_transaction_buffers) + { + JRNL_LOG_DEBUG + ("[TRX] Transaction %u too big (%u blocks). Rolling over.", + txn->t_tid, txn->t_nr_blocks); + + txn->t_updates--; + + pthread_mutex_unlock (&journal->j_state_lock); + + /* Commit the old one */ + journal_commit_transaction (journal, NULL); + + /* Start the new one (Implicitly sets updates=1 for us) */ + journal_start_transaction (journal); + + pthread_mutex_lock (&journal->j_state_lock); + txn = journal->j_running_transaction; + + if (!txn || txn->t_state != T_RUNNING) + { + ext2_panic ("[TRX] Failed to roll over transaction!"); + } + } + + /* FAST PATH using Hurd's libihash */ + jb = (journal_buffer_t *) hurd_ihash_find (&txn->t_buffer_map, + (hurd_ihash_key_t) fs_blocknr); + + if (jb) + { + memcpy (jb->jb_shadow_data, data, block_size); + pthread_mutex_unlock (&journal->j_state_lock); + return 0; + } + + /* SLOW PATH: Allocate new buffer wrapper */ + new_jb = malloc (sizeof (journal_buffer_t)); + if (!new_jb) + { + pthread_mutex_unlock (&journal->j_state_lock); + return ENOMEM; + } + + new_jb->jb_shadow_data = malloc (block_size); + if (!new_jb->jb_shadow_data) + { + free (new_jb); + pthread_mutex_unlock (&journal->j_state_lock); + return ENOMEM; + } + + new_jb->jb_blocknr = fs_blocknr; + memcpy (new_jb->jb_shadow_data, data, block_size); + + /* Insert it into Hash Map */ + err = hurd_ihash_add (&txn->t_buffer_map, (hurd_ihash_key_t) fs_blocknr, + (hurd_ihash_value_t) new_jb); + if (err) + { + free (new_jb->jb_shadow_data); + free (new_jb); + pthread_mutex_unlock (&journal->j_state_lock); + return err; + } + + /* Link into the Transaction List */ + new_jb->jb_next = txn->t_buffers; + txn->t_buffers = new_jb; + + txn->t_buffer_count++; + txn->t_nr_blocks++; + + pthread_mutex_unlock (&journal->j_state_lock); + return 0; +} + +/** + * Check if a specific filesystem block is currently part of the Running + * Transaction. + * Returns: 1 if the block is "pinned" (must not be written to disk yet), + * 0 if it is safe to write. + */ +int +journal_block_is_active (journal_t *journal, block_t blocknr) +{ + struct journal_transaction *txn; + int is_active = 0; + + if (!journal) + return 0; + + pthread_mutex_lock (&journal->j_state_lock); + txn = journal->j_running_transaction; + + if (txn && txn->t_state == T_RUNNING) + { + if (hurd_ihash_find (&txn->t_buffer_map, (hurd_ihash_key_t) blocknr)) + { + is_active = 1; + } + } + + pthread_mutex_unlock (&journal->j_state_lock); + return is_active; +} + +/** + * Called after a full filesystem sync. + * Frees all journal space used by committed transactions, + * since we know their data is now safe on the permanent disk. + */ +void +journal_reclaim_space (journal_t *journal, uint32_t barrier_limit) +{ + if (!journal) + return; + + pthread_mutex_lock (&journal->j_state_lock); + + /* If the tail is already there, do nothing */ + if (journal->j_tail == barrier_limit) + { + pthread_mutex_unlock (&journal->j_state_lock); + return; + } + + JRNL_LOG_DEBUG ("[CHECKPOINT] Advancing tail %u -> %u", journal->j_tail, + barrier_limit); + + journal->j_tail = barrier_limit; + uint32_t capacity = journal->j_last - journal->j_first + 1; + uint32_t used = 0; + + if (journal->j_head >= journal->j_tail) + { + /* Simple case: [ ... T ... H ... ] */ + used = journal->j_head - journal->j_tail; + } + else + { + /* Wrap case: [ ... H ... T ... ] */ + /* Space from Tail to End + Space from Start to Head */ + used = (journal->j_last - journal->j_tail + 1) + + (journal->j_head - journal->j_first); + } + journal->j_free = capacity - used; + + /* Update Superblock to persist the new tail */ + journal_update_superblock (journal, journal->j_transaction_sequence, + journal->j_tail); + + pthread_mutex_unlock (&journal->j_state_lock); +} diff --git a/ext2fs/journal.h b/ext2fs/journal.h new file mode 100644 index 00000000..947ef5db --- /dev/null +++ b/ext2fs/journal.h @@ -0,0 +1,97 @@ +/* JBD2 binary compliant journal driver. + + Implements "Writeback" journaling mode: + - Metadata (Inodes, Bitmaps, Superblock) is journaled and crash-consistent. + - File Data is written directly to disk (not journaled) and lacks + explicit ordering guarantees relative to the metadata commit. + This provides the best performance but allows for "stale data" in + recently allocated blocks after a crash. + + Copyright (C) 2026 Free Software Foundation, Inc. + Written by Milos Nikic. + + This program is free software; you can redistribute it and/or + modify it under the terms of the GNU General Public License as + published by the Free Software Foundation; either version 2, or (at + your option) any later version. + + This program is distributed in the hope that it will be useful, but + WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */ + +#ifndef _JOURNAL_H +#define _JOURNAL_H + +#include <stdint.h> +#include <stdio.h> + +#include "ext2fs.h" + +#ifndef JOURNAL_DEBUG +#define JOURNAL_DEBUG 0 /* Set to enable (very chatty) debug messages. */ +#endif + +#if JOURNAL_DEBUG +#define JRNL_LOG_DEBUG(fmt, ...) \ + do { \ + fprintf(stderr, "[JRNL][DEBUG] " fmt "\n", ##__VA_ARGS__); \ + fflush(stderr); \ + } while (0) +#else +#define JRNL_LOG_DEBUG(fmt, ...) do { } while (0) +#endif + +/* Opaque handle for the journal object */ +typedef struct journal journal_t; + +/* Initialize the journal subsystem using the inode provided (usually Inode 8). */ +journal_t *journal_create (struct node *journal_inode); + +/* Clean up and free the journal resources. */ +void journal_destroy (journal_t * journal); + + +/** + * Start tx: Ensure a valid running transaction exists. + * Must be called before modifying any metadata. + * Increments the transaction update count. + */ +error_t journal_start_transaction (journal_t * journal); + +/** + * Mark dirty: Add a modified filesystem block to the current transaction. + * Performs a shadow copy of 'data' into the journal memory. + */ +error_t +journal_dirty_block (journal_t * journal, block_t fs_blocknr, + const void *data); + +/** + * Stop tx: Decrement the transaction update count. + * When the count reaches zero, the transaction is eligible for commit. + */ +void journal_stop_transaction (journal_t * journal); + +/** + * Commit: Force the current running transaction to the log. + * This contains the write barriers (flush_to_disk) that guarantee durability. + */ +error_t journal_commit_transaction (journal_t * journal, uint32_t * out_j_head); + +/** + * Called after a full filesystem sync. + * Frees all journal space used by committed transactions, + * since we know their data is now safe on the permanent disk. + */ +void +journal_reclaim_space (journal_t *journal, uint32_t barrier_limit); + +/* Check if a block is currently pinned in a running transaction. */ +int journal_block_is_active (journal_t * journal, block_t blocknr); + +#endif //_JOURNAL_H diff --git a/ext2fs/pager.c b/ext2fs/pager.c index 1c795784..b9ea175e 100644 --- a/ext2fs/pager.c +++ b/ext2fs/pager.c @@ -25,6 +25,7 @@ #include <inttypes.h> #include <hurd/store.h> #include "ext2fs.h" +#include "journal.h" /* XXX */ #include "../libpager/priv.h" @@ -648,6 +649,11 @@ disk_pager_write_page (vm_offset_t page, void *buf) while (length > 0 && !err) { block_t block = boffs_block (offset); + if (ext2_journal && journal_block_is_active(ext2_journal, block)) + { + JRNL_LOG_DEBUG ("Pageout conflict on Block %u -> Forcing Commit", block); + journal_commit_transaction(ext2_journal, NULL); + } /* We don't clear the block modified bit here because this paging write request may not be the same one that actually set the bit, @@ -1580,6 +1586,30 @@ diskfs_shutdown_pager (void) pager, just make sure it's synced. */ } +static error_t +journal_sync_one (void *v_p) +{ + struct pager *p = v_p; + pager_sync (p, 1); + return 0; +} + +/** + * Sync all the pagers synchronously, but don't call + * journal_commit here. It would deadlock. + **/ +void +journal_sync_everything (void) +{ + write_all_disknodes (); + ports_bucket_iterate (file_pager_bucket, journal_sync_one); + sync_global (1); + error_t err = store_sync (store); + /* Ignore EOPNOTSUPP (drivers), but warn on real I/O errors */ + if (err && err != EOPNOTSUPP) + ext2_warning ("device flush failed: %s", strerror (err)); +} + /* Sync all the pagers. */ void diskfs_sync_everything (int wait) @@ -1591,6 +1621,12 @@ diskfs_sync_everything (int wait) return 0; } + uint32_t safe_journal_limit = 0; + if (ext2_journal) + { + /* We only commit if we have a running transaction */ + journal_commit_transaction (ext2_journal, &safe_journal_limit); + } write_all_disknodes (); ports_bucket_iterate (file_pager_bucket, sync_one); @@ -1602,6 +1638,8 @@ diskfs_sync_everything (int wait) /* Ignore EOPNOTSUPP (drivers), but warn on real I/O errors */ if (err && err != EOPNOTSUPP) ext2_warning ("device flush failed: %s", strerror (err)); + if (!err && ext2_journal) + journal_reclaim_space (ext2_journal, safe_journal_limit); } } diff --git a/ext2fs/truncate.c b/ext2fs/truncate.c index aa3a5a60..8f0572c4 100644 --- a/ext2fs/truncate.c +++ b/ext2fs/truncate.c @@ -19,6 +19,7 @@ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */ #include "ext2fs.h" +#include "journal.h" #ifdef DONT_CACHE_MEMORY_OBJECTS #define MAY_CACHE 0 @@ -339,12 +340,15 @@ diskfs_truncate (struct node *node, off_t length) pthread_rwlock_wrlock (&diskfs_node_disknode (node)->alloc_lock); + if (ext2_journal) + journal_start_transaction (ext2_journal); + /* Update the size on disk; fsck will finish freeing blocks if necessary should we crash. */ node->dn_stat.st_size = length; node->dn_set_mtime = 1; node->dn_set_ctime = 1; - diskfs_node_update (node, diskfs_synchronous); + diskfs_node_update (node, 0); err = diskfs_catch_exception (); if (!err) @@ -380,10 +384,23 @@ diskfs_truncate (struct node *node, off_t length) node->dn_set_ctime = 1; node->dn_stat_dirty = 1; + if (ext2_journal) + journal_stop_transaction (ext2_journal); + /* Now we can permit delayed copies again. */ enable_delayed_copies (node); pthread_rwlock_unlock (&diskfs_node_disknode (node)->alloc_lock); + if (diskfs_synchronous) + { + diskfs_node_update (node, 1); + } + else + { + if (ext2_journal) + diskfs_node_update (node, 0); + } + return err; } diff --git a/libdiskfs/diskfs.h b/libdiskfs/diskfs.h index 5f832dd7..7f420725 100644 --- a/libdiskfs/diskfs.h +++ b/libdiskfs/diskfs.h @@ -507,6 +507,11 @@ error_t diskfs_validate_flags_change (struct node *np, int flags); changed to RDEV; otherwise return an error code. */ error_t diskfs_validate_rdev_change (struct node *np, dev_t rdev); +/* The user may define this function. It is called immediately when + a node's metadata (stat info) is modified in memory, even if + diskfs_synchronous is false. The default definition does nothing. */ +void diskfs_notify_change (struct node *np); + /* The user must define this function. Sync the info in NP->dn_stat and any associated format-specific information to disk. If WAIT is true, then return only after the physicial media has been completely updated. */ diff --git a/libdiskfs/node-modified.c b/libdiskfs/node-modified.c new file mode 100644 index 00000000..a29dc39f --- /dev/null +++ b/libdiskfs/node-modified.c @@ -0,0 +1,28 @@ +/* Default version of diskfs_notify_change + Copyright (C) 2026 Free Software Foundation, Inc. + Written by Milos Nikic. + + This file is part of the GNU Hurd. + + The GNU Hurd is free software; you can redistribute it and/or + modify it under the terms of the GNU General Public License as + published by the Free Software Foundation; either version 2, or (at + your option) any later version. + + The GNU Hurd is distributed in the hope that it will be useful, but + WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111, USA. */ + +#include "priv.h" +#include "diskfs.h" + +void __attribute__ ((weak)) +diskfs_notify_change (struct node *np) +{ + // default function does nothing. +} diff --git a/libdiskfs/priv.h b/libdiskfs/priv.h index ca3c23ca..e8186d49 100644 --- a/libdiskfs/priv.h +++ b/libdiskfs/priv.h @@ -140,7 +140,13 @@ extern fshelp_fetch_root_callback2_t _diskfs_translator_callback2; pthread_mutex_lock (&np->lock); \ (OPERATION); \ if (diskfs_synchronous) \ + { \ diskfs_node_update (np, 1); \ + } \ + else \ + { \ + diskfs_notify_change (np); \ + } \ pthread_mutex_unlock (&np->lock); \ return err; \ }) -- 2.52.0
