Journaling and the ext3 filesystem
Kedar Sovani ([EMAIL PROTECTED])
Contents
1 Journaling
2 ext3 and journaling
2.1 The handle
2.2 create system call
3 Mounting ext3
1 Journaling
Journaling is a database terminology. One of the characteristics of a
database is the atomic nature of transactions. This implies that a
transaction either goes through completely, or does not go through at all.
The effects of the transaction are either completely seen or are not seen
at all. Harddisks provide atomicity at the sector level. Which means that
a write started on a sector on the harddisk will either completely goes
through or it does not. But, when a transaction spans mutliple sectors on
the harddisk, there is a need for some higher level mechanism which sees
to it that modifications to the entire set of sectors occurs atomically.
Say, if a transaction has to modify 3 sectors, and the machine crashes
after modifying only 2 sectors, then this will lead to inconsistency in
the database, since the 3rd sector will contain stale data.
Databases use a technique called journaling to maintain the atomicity of
operations. This technique consists of writing all the modified sectors to
a separate portion of the storage called the journal, instead of
overwriting the actual locations on the hard disk. The actual locations on
the harddisk are later over-written with the contents of this journal
_only_ after making sure that all the sectors associated with the
transaction have been flushed to the journal. In order to identify if all
the sectors belonging to a transaction are present on the journal, a
commit record is flushed to the journal at the end of the transaction. Let
us see what happens when the machine crashes in the following scenarios
with the above example :
after the commit record is flushed to the journal.
In this case, when the machine comes back up again, it checks the journal.
It finds a transaction with the commit record at the end of it. The commit
record indicates that this is a completed transaction and could be written
to the actual location. All the sectors belonging to this transaction are
written at their actual locations on the disk, overwriting the previous
contents (replaying the journal).
after only two sectors are flushed to the storage.
In this case, when the machine comes back up again and checks the journal
it finds a transaction with no commit record at the end of the
transaction. This indicates that it may not be a completed transaction and
hence NO modifications are done to the disk.
after the 3 sectors are flushed to the storage but the commit record is
not yet flushed to the storage.
Even in this case, because of the absence of the commit record no
modifications are done to the disk.
Going a step further, it may be noted that in the first case there could
be another crash during the replaying of the journal information. But no
harm is caused since the next time around, the same information will be
present in the journal (as the contents of the journal are not yet
modified), thus creating a situation similar to the first case.
Thus, with the help of journaling it can be assured that transactions are
committed atomically to the disk.
In the case of filesystems, all the metadata updates should be journaled
in order for the metadata information to be consistent. In the absence of
journaling a crash on the filesystem may cause inconsistency in the
filesystem metadata stored on the disk. Special programs that check for
consistency within the filesystem need to check and fix any
inconsistencies caused because of the crash. This check is a time
consuming process and the time taken is directly proportional to the size
of the disk.
Journaling filesystems reduce boot time by replacing the usually time
consuming filesystem checks with the fast and efficient journal replays.
It should be noted that journaling guarantees the consistency of metadata.
It does not make any guarantees about the consistency of data associated
with a file.
2 ext3 and journaling
The ext3 filesystem is a journaling filesystem. It uses the journaling
facilities provided by the jbd module in the linux kernel for journaling
purposes.
Basically, the ext3 code base is exactly similar to the ext2 code base but
with the additional functionality of journaling. It has no changes, other
than the journaling support, from the ext2 filesystem. This will be
clearer in the code snippets discussed in this section.
While working on the design for the ext3 filesystem the designers designed
it in two components. They designed a journaling layer which does only
journaling. It is independent of the ext3 filesystem on top of it. The
objective being that this layer could be potentially serve other modules
which require a journaling support as well. Hence, the jbd gives an api
which could be used by ext3 and other modules to incorporate journaling.
2.1 The handle
The ext3 needs some way of informing the journaling layer, which set of
updates form a single atomic update1. To address this issue the journaling
layer provides the concept of handle. When ext3 wants to perform an atomic
update, it informs the journaling layer the number of block updates that
constitute this single atomic update2. This is accomplished by the
journal_start() function call. It returns a handle to ext3. The handle is
a opaque structure for the ext3 layer. This handle is to be used by the
ext3 layer, while communicating with the journaling layer, to identify the
update under progress. Once the update is complete the handle can be
destroyed by the journal_stop() function call.
2.2 create system call
For the purpose of understanding how ext3 uses journaling , let us first
trace through the code for creating a filesystem object (file/directory)
in the ext3 file system. A comparison of the ext3 object creation code,
with the corresponding ext2 code will help in understanding the precise
changes made for journaling. Further for simplicity purposes let us focus
on a file creation as against the creation of a directory.
Here is the code for the create system call in ext2 and ext3.
ext2 :
static int ext2_create (struct inode * dir, struct dentry * dentry, int
mode)
{
struct inode * inode = ext2_new_inode (dir, mode); int err =
PTR_ERR(inode);
if (!IS_ERR(inode)) {
inode->i_op = &ext2_file_inode_operations;
inode->i_fop = &ext2_file_operations;
inode->i_mapping->a_ops = &ext2_aops;
mark_inode_dirty(inode);
err = ext2_add_nondir(dentry, inode);
}
return err; }
ext3 :
static int ext3_create (struct inode * dir, struct dentry * dentry, int
mode)
{
handle_t *handle; struct inode * inode; int err; handle =
ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS + 3);
if (IS_ERR(handle))
return PTR_ERR(handle);
if (IS_SYNC(dir))
handle->h_sync = 1;
inode = ext3_new_inode (handle, dir, mode);
err = PTR_ERR(inode);
if (!IS_ERR(inode)) {
inode->i_op = &ext3_file_inode_operations;
inode->i_fop = &ext3_file_operations;
inode->i_mapping->a_ops = &ext3_aops;
err = ext3_add_nondir(handle, dentry, inode);
}
ext3_journal_stop(handle, dir);
return err;
}
As can be seen, the task of create system call could be logically divided
into the sub-tasks :
Allocate and initialise new inode
Add an entry for this inode to the directory
Also it can be seen from the source code for ext2 and ext3, the calls to
the ext3 functions are made with an extra parameter called the handle.
This is the same handle that is returned by the journal_start() function
call as discussed earlier.
2.2.1 Allocate and initialise new inode
ext2 :
struct inode * ext2_new_inode(const struct inode * dir, int mode)
This function accepts a directory inode (in which the file is to be
created) and the mode in which it has to be created and returns a newly
created and initialised in-memory inode structure. The inode is marked
dirty in the function itself.
sb = dir->i_sb;
inode = new_inode(sb);
This code allocates an in-core uninitialised inode belonging to the
directory's superblock. The new_inode function returns an inode with the
minimum number of fields required.
group=find_group_other(sb, dir->u.ext2_i.i_block_group);
ext2 filesystem divides the available space into groups for better
management.
This function finds the correct group to which this inode should belong.
Also, ext2 filesystem keeps track of the number of free inodes in a given
group. This is information is updated for the group which is found. The
function also arranges for the syncing of the corresponding buffers by
marking the buffers dirty.
bh=load_inode_bitmap (sb, group);
if (IS_ERR(bh))
goto fail2;
i = ext2_find_first_zero_bit ((unsigned long *) bh->b_data,
EXT2_INODES_PER_GROUP(sb));
if (i >= EXT2_INODES_PER_GROUP(sb))
goto bad_count;
ext2_set_bit(i, bh->b_data);
mark_buffer_dirty(bh);
This code loads the buffer head which contains the inode bitmap for the
filesystem. The bitmap stores the allocated/unallocated state of inodes in
the filesystem. The first zero bit (unallocated inode) is found. This bit
is set to mark the inode as allocated and the buffer is marked dirty, so
that the updated state could be flushed on the disk at some point of time
later.
es->s_free_inodes_count=cpu_to_le32(le32_to_cpu(es->s_free_inodes_count)-1);
mark_buffer_dirty(sb->u.ext2_sb.s_sbh);
The number of free inode in the superblock is decremented and the super
block buffer is marked as dirty.
mark_inode_dirty(inode);
All the fields in the inode are initialised and then the inode is marked
dirty.
ext3:
Now let us take a look at the corresponding ext3 code :
struct inode * ext3_new_inode (handle_t *handle, const struct inode * dir,
int mode)
As can be seen the function definition accepts the same arguments,
directory inode and the mode in which the file has to be created, but
along with that, it also accepts another parameter, which is the handle.
This function as well returns a newly created and initialised in-memory
inode structure. The inode is marked dirty in the function itself.
sb = dir->i_sb;
inode = new_inode(sb);
This code is exactly similar to the ext2 code. Now we have an inode
structure with minimum initialised fields.
The next change that we see is that ext3 cannot simply use the
find_group_other() function as is being used in the ext2 function. Why ?
This function involves updation of metadata information, and since all the
metadata should first go to the journal, special care needs to be taken.
ext3 finds the correct group to which the handle belongs and also keeps a
pointer to the buffer head for the group's metadata.
bitmap_nr = load_inode_bitmap (sb, i);
if(bitmap_nr < 0)
goto fail;
bh = sb->u.ext3_sb.s_inode_bitmap[bitmap_nr];
if ((j = ext3_find_first_zero_bit ((unsigned long *)
bh->b_data,EXT3_INODES_PER_GROUP(sb))) < EXT3_INODES_PER_GROUP(sb)) {
This code is pretty much similar to what ext2 was doing. The inode bitmap
is loaded and the first zero bit is taken into the variable j.
err = ext3_journal_get_write_access(handle, bh);
if (err)
goto fail;
if (ext3_set_bit (j, bh->b_data)) {
ext3_error (sb, "ext3_new_inode", "bit already set for inode %d", j);
goto repeat;
}
BUFFER_TRACE(bh, "call ext3_journal_dirty_metadata");
err = ext3_journal_dirty_metadata(handle, bh);
if (err)
goto fail;
The code presented here is definitely different than the code that was
there in the ext2_new_inode. The difference is that, the ext3_set_bit()
function is enclosed within calls to ext3_journal_get_write_access() and
ext3_journal_dirty_metadata() function calls. Both these functions calls
the journal_ counterparts to complete the function call.
journal_get_write_access() is an indication to the journaling layer that
this buffer is supposed to be written to the journal soon. Whereas, the
journal_dirty_metadata() call informs the journaling layer that the
changes to this buffer have been made and the journaling layer can now
journal the buffer under consideration as a part of the current atomic
update.
As you can see the modifications to the buffer are done after the
journal_get_write_access() is done and before the journal_dirty_metadata()
call.
Most of the differences between ext2 and ext3 stem from the same concept.
The idea is to find the buffer head for metadata which has to be modified,
and to carry out the modifications enclosed within the
ext3_journal_get_write_access() and ext3_journal_dirty_metadata() function
calls.
Continuing with our call trace,
err = ext3_journal_get_write_access(handle, bh2); if (err)
goto fail;
gdp->bg_free_inodes_count =
cpu_to_le16(le16_to_cpu(gdp->bg_free_inodes_count) - 1);
if (S_ISDIR(mode))
gdp->bg_used_dirs_count =
cpu_to_le16(le16_to_cpu(gdp->bg_used_dirs_count) + 1);
BUFFER_TRACE(bh2, "call ext3_journal_dirty_metadata");
err = ext3_journal_dirty_metadata(handle, bh2);
if (err)
goto fail;
BUFFER_TRACE(sb->u.ext3_sb.s_sbh, "get_write_access");
err = ext3_journal_get_write_access(handle, sb->u.ext3_sb.s_sbh);
if (err)
goto fail;
es->s_free_inodes_count = cpu_to_le32(le32_to_cpu(es->s_free_inodes_count)
- 1);
BUFFER_TRACE(sb->u.ext3_sb.s_sbh, "call ext3_journal_dirty_metadata");
err = ext3_journal_dirty_metadata(handle, sb->u.ext3_sb.s_sbh);
This section of the code updates the free inodes count in the superblock
as well as the group descriptor. As mentioned earlier, in case of ext2,
the find_group_other() call returns a group with the free_inode_count
already decremented. In case of ext3, we have to do it explicitly. Again
this is done by enclosing the writes to the metadata withing the
journaling system calls.
err = ext3_mark_inode_dirty(handle, inode);
Eventually, the inode is marked dirty as is the case with the ext2 code.
Again the difference here is that an extra handle parameter is passed to
this function. The ext3_mark_inode_dirty() function will eventually find a
buffer head for the on-disk inode, and copy the inode information to the
buffer containing the on-disk (raw) inode image. Again, this would be done
by enclosing the changes within the journal system calls.
2.2.2 Add an entry for this inode to the directory
ext2 :
To add the given inode to the directory, ext2_create calls
ext2_add_nondir(), which eventually calls the ext2_add_link() function.
The ext2_add_link() is quite a simple function. It loops through all the
directory entries present in the directory. It cycles through all the
_pages_ of the directory, to find an appropriate directory entry. It
modifies the directory entry such that the filename and inode information
are added to the directory entry. Once this is done the page is scheduled
to be written to the disk. Also, the directory inode is marked dirty,
since the timestamps of the directory inode change.
ext3:
To add the given inode to the directory, ext3_create calls
ext3_add_nondir(), which eventually calls ext3_add_entry() function.
The ext3_add_entry() function cycles loops through all the directory
entries in the directory. As you would have guessed the ext3 code cycles
through the _buffers_ of the directory, to find an appropriate directory
entry. It modifies the directory entry such that the filename and inode
information are added to the directory entry. This change in the contents
of the buffer data are done within the journaling system calls as well.
3 Mounting ext3
Now that we know how information is written to the journal, let us see how
it is actually used by ext3 when a filesystem is mounted. This is one of
the important part since this is where recovery of a transaction which has
been written to the log completely but not to the actual location occurs.
In order to avoid getting defocused, we will only go through the code
which is really relevant with the journaling.
Let us see ext3's read super function to understand this :
if (ext3_load_journal(sb, es))
goto failed_mount2;
The function starts of by initialising most of the variables in the
super_block structure. Once this is done, a call to the ext3_load_journal
is made. The ext3_load_journal function performs a set of tasks, the tasks
along with the jbd functions they use are listed below :
journal_init_dev/journal_init_inode : Initialise this journal and return a
journal structure.
journal_update_format : updates the journal superblock to the latest
format.
journal_wipe : wipes the journal safely. This is done by ext3 only if if
the filesystem is being mounted as read only.
journal_load : loads the journal into the memory. Performs journal
recovery, if needed.
About this document ...
Journaling and the ext3 filesystem
This document was generated using the LaTeX2HTML translator Version
2002-2-1 (1.70)
Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning
Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department,
Macquarie University, Sydney.
The command line arguments were:
latex2html -no_subdir -split 0 -show_section_numbers
/tmp/lyx_tmpdir14844heazJR/lyx_tmpbuf0/journaling_and_the_ext3_fs.tex
The translation was initiated by Kedar Sovani on 2007-01-22
Footnotes
... update1
We do not use the term transaction here for a specific reason. The jbd
layer, for performance reasons, bunches all the single atomic updates made
by the ext3 module into one huge transaction and writes this transaction
to the disk atomically. Since this transaction is atomic in nature the
single atomic upates are also atomic. So basically, the jbd layer is still
providing the feature of atomically updating a set of blocks to the upper
module, but internally it does it more efficiently.
... update2
It is necessary for the journaling layer to reserve as many blocks
mentioned by the ext3 layer, in order to avoid deadlocks.
Kedar Sovani 2007-01-22