[linuxkernelnewbies] Journaling and the ext3 filesystem

Peter Teoh Tue, 25 Nov 2008 07:14:18 -0800


Journaling and the ext3 filesystem


Kedar Sovani ([EMAIL PROTECTED])

Contents
1 Journaling
2 ext3 and journaling
2.1 The handle
2.2 create system call

3 Mounting ext3

1 Journaling

Journaling is a database terminology. One of the characteristics of adatabase is the atomic nature of transactions. This implies that atransaction either goes through completely, or does not go through at all.The effects of the transaction are either completely seen or are not seenat all. Harddisks provide atomicity at the sector level. Which means thata write started on a sector on the harddisk will either completely goesthrough or it does not. But, when a transaction spans mutliple sectors onthe harddisk, there is a need for some higher level mechanism which seesto it that modifications to the entire set of sectors occurs atomically.Say, if a transaction has to modify 3 sectors, and the machine crashesafter modifying only 2 sectors, then this will lead to inconsistency inthe database, since the 3rd sector will contain stale data.

Databases use a technique called journaling to maintain the atomicity ofoperations. This technique consists of writing all the modified sectors toa separate portion of the storage called the journal, instead ofoverwriting the actual locations on the hard disk. The actual locations onthe harddisk are later over-written with the contents of this journal_only_ after making sure that all the sectors associated with thetransaction have been flushed to the journal. In order to identify if allthe sectors belonging to a transaction are present on the journal, acommit record is flushed to the journal at the end of the transaction. Letus see what happens when the machine crashes in the following scenarioswith the above example :


after the commit record is flushed to the journal.

In this case, when the machine comes back up again, it checks the journal.It finds a transaction with the commit record at the end of it. The commitrecord indicates that this is a completed transaction and could be writtento the actual location. All the sectors belonging to this transaction arewritten at their actual locations on the disk, overwriting the previouscontents (replaying the journal).


after only two sectors are flushed to the storage.

In this case, when the machine comes back up again and checks the journalit finds a transaction with no commit record at the end of thetransaction. This indicates that it may not be a completed transaction andhence NO modifications are done to the disk.

after the 3 sectors are flushed to the storage but the commit record isnot yet flushed to the storage.Even in this case, because of the absence of the commit record nomodifications are done to the disk.

Going a step further, it may be noted that in the first case there couldbe another crash during the replaying of the journal information. But noharm is caused since the next time around, the same information will bepresent in the journal (as the contents of the journal are not yetmodified), thus creating a situation similar to the first case.

Thus, with the help of journaling it can be assured that transactions arecommitted atomically to the disk.

In the case of filesystems, all the metadata updates should be journaledin order for the metadata information to be consistent. In the absence ofjournaling a crash on the filesystem may cause inconsistency in thefilesystem metadata stored on the disk. Special programs that check forconsistency within the filesystem need to check and fix anyinconsistencies caused because of the crash. This check is a timeconsuming process and the time taken is directly proportional to the sizeof the disk.

Journaling filesystems reduce boot time by replacing the usually timeconsuming filesystem checks with the fast and efficient journal replays.

It should be noted that journaling guarantees the consistency of metadata.It does not make any guarantees about the consistency of data associatedwith a file.


2 ext3 and journaling

The ext3 filesystem is a journaling filesystem. It uses the journalingfacilities provided by the jbd module in the linux kernel for journalingpurposes.

Basically, the ext3 code base is exactly similar to the ext2 code base butwith the additional functionality of journaling. It has no changes, otherthan the journaling support, from the ext2 filesystem. This will beclearer in the code snippets discussed in this section.

While working on the design for the ext3 filesystem the designers designedit in two components. They designed a journaling layer which does onlyjournaling. It is independent of the ext3 filesystem on top of it. Theobjective being that this layer could be potentially serve other moduleswhich require a journaling support as well. Hence, the jbd gives an apiwhich could be used by ext3 and other modules to incorporate journaling.


2.1 The handle

The ext3 needs some way of informing the journaling layer, which set ofupdates form a single atomic update1. To address this issue the journalinglayer provides the concept of handle. When ext3 wants to perform an atomicupdate, it informs the journaling layer the number of block updates thatconstitute this single atomic update2. This is accomplished by thejournal_start() function call. It returns a handle to ext3. The handle isa opaque structure for the ext3 layer. This handle is to be used by theext3 layer, while communicating with the journaling layer, to identify theupdate under progress. Once the update is complete the handle can bedestroyed by the journal_stop() function call.


2.2 create system call

For the purpose of understanding how ext3 uses journaling , let us firsttrace through the code for creating a filesystem object (file/directory)in the ext3 file system. A comparison of the ext3 object creation code,with the corresponding ext2 code will help in understanding the precisechanges made for journaling. Further for simplicity purposes let us focuson a file creation as against the creation of a directory.


Here is the code for the create system call in ext2 and ext3.

ext2 :

static int ext2_create (struct inode * dir, struct dentry * dentry, intmode)

struct inode * inode = ext2_new_inode (dir, mode); int err =PTR_ERR(inode);


if (!IS_ERR(inode)) {

    inode->i_op = &ext2_file_inode_operations;

    inode->i_fop = &ext2_file_operations;

    inode->i_mapping->a_ops = &ext2_aops;

    mark_inode_dirty(inode);

    err = ext2_add_nondir(dentry, inode);

}

return err; }

ext3 :

static int ext3_create (struct inode * dir, struct dentry * dentry, intmode)

handle_t *handle; struct inode * inode; int err; handle =ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS + 3);


if (IS_ERR(handle))

    return PTR_ERR(handle);

if (IS_SYNC(dir))

    handle->h_sync = 1;

inode = ext3_new_inode (handle, dir, mode);

err = PTR_ERR(inode);

if (!IS_ERR(inode)) {

    inode->i_op = &ext3_file_inode_operations;

    inode->i_fop = &ext3_file_operations;

    inode->i_mapping->a_ops = &ext3_aops;

    err = ext3_add_nondir(handle, dentry, inode);

}

ext3_journal_stop(handle, dir);

return err;

}

As can be seen, the task of create system call could be logically dividedinto the sub-tasks :


Allocate and initialise new inode
Add an entry for this inode to the directory

Also it can be seen from the source code for ext2 and ext3, the calls tothe ext3 functions are made with an extra parameter called the handle.This is the same handle that is returned by the journal_start() functioncall as discussed earlier.


2.2.1 Allocate and initialise new inode

ext2 :

struct inode * ext2_new_inode(const struct inode * dir, int mode)

This function accepts a directory inode (in which the file is to becreated) and the mode in which it has to be created and returns a newlycreated and initialised in-memory inode structure. The inode is markeddirty in the function itself.


sb = dir->i_sb;

inode = new_inode(sb);

This code allocates an in-core uninitialised inode belonging to thedirectory's superblock. The new_inode function returns an inode with theminimum number of fields required.


group=find_group_other(sb, dir->u.ext2_i.i_block_group);

ext2 filesystem divides the available space into groups for bettermanagement.

This function finds the correct group to which this inode should belong.Also, ext2 filesystem keeps track of the number of free inodes in a givengroup. This is information is updated for the group which is found. Thefunction also arranges for the syncing of the corresponding buffers bymarking the buffers dirty.


bh=load_inode_bitmap (sb, group);

if (IS_ERR(bh))

    goto fail2;

i = ext2_find_first_zero_bit ((unsigned long *) bh->b_data,EXT2_INODES_PER_GROUP(sb));


if (i >= EXT2_INODES_PER_GROUP(sb))

    goto bad_count;

ext2_set_bit(i, bh->b_data);

mark_buffer_dirty(bh);

This code loads the buffer head which contains the inode bitmap for thefilesystem. The bitmap stores the allocated/unallocated state of inodes inthe filesystem. The first zero bit (unallocated inode) is found. This bitis set to mark the inode as allocated and the buffer is marked dirty, sothat the updated state could be flushed on the disk at some point of timelater.


es->s_free_inodes_count=cpu_to_le32(le32_to_cpu(es->s_free_inodes_count)-1);

mark_buffer_dirty(sb->u.ext2_sb.s_sbh);

The number of free inode in the superblock is decremented and the superblock buffer is marked as dirty.


mark_inode_dirty(inode);

All the fields in the inode are initialised and then the inode is markeddirty.


ext3:

Now let us take a look at the corresponding ext3 code :

struct inode * ext3_new_inode (handle_t *handle, const struct inode * dir,int mode)

As can be seen the function definition accepts the same arguments,directory inode and the mode in which the file has to be created, butalong with that, it also accepts another parameter, which is the handle.This function as well returns a newly created and initialised in-memoryinode structure. The inode is marked dirty in the function itself.


sb = dir->i_sb;

inode = new_inode(sb);

This code is exactly similar to the ext2 code. Now we have an inodestructure with minimum initialised fields.

The next change that we see is that ext3 cannot simply use thefind_group_other() function as is being used in the ext2 function. Why ?This function involves updation of metadata information, and since all themetadata should first go to the journal, special care needs to be taken.

ext3 finds the correct group to which the handle belongs and also keeps apointer to the buffer head for the group's metadata.


bitmap_nr = load_inode_bitmap (sb, i);

if(bitmap_nr < 0)

goto fail;

bh = sb->u.ext3_sb.s_inode_bitmap[bitmap_nr];

if ((j = ext3_find_first_zero_bit ((unsigned long *)bh->b_data,EXT3_INODES_PER_GROUP(sb))) < EXT3_INODES_PER_GROUP(sb)) {

This code is pretty much similar to what ext2 was doing. The inode bitmapis loaded and the first zero bit is taken into the variable j.


err = ext3_journal_get_write_access(handle, bh);

if (err)

     goto fail;

if (ext3_set_bit (j, bh->b_data)) {

     ext3_error (sb, "ext3_new_inode", "bit already set for inode %d", j);

     goto repeat;

}

BUFFER_TRACE(bh, "call ext3_journal_dirty_metadata");

err = ext3_journal_dirty_metadata(handle, bh);

if (err)

     goto fail;

The code presented here is definitely different than the code that wasthere in the ext2_new_inode. The difference is that, the ext3_set_bit()function is enclosed within calls to ext3_journal_get_write_access() andext3_journal_dirty_metadata() function calls. Both these functions callsthe journal_ counterparts to complete the function call.

journal_get_write_access() is an indication to the journaling layer thatthis buffer is supposed to be written to the journal soon. Whereas, thejournal_dirty_metadata() call informs the journaling layer that thechanges to this buffer have been made and the journaling layer can nowjournal the buffer under consideration as a part of the current atomicupdate.

As you can see the modifications to the buffer are done after thejournal_get_write_access() is done and before the journal_dirty_metadata()call.

Most of the differences between ext2 and ext3 stem from the same concept.The idea is to find the buffer head for metadata which has to be modified,and to carry out the modifications enclosed within theext3_journal_get_write_access() and ext3_journal_dirty_metadata() functioncalls.


Continuing with our call trace,

err = ext3_journal_get_write_access(handle, bh2); if (err)

    goto fail;

gdp->bg_free_inodes_count =cpu_to_le16(le16_to_cpu(gdp->bg_free_inodes_count) - 1);


if (S_ISDIR(mode))

gdp->bg_used_dirs_count =cpu_to_le16(le16_to_cpu(gdp->bg_used_dirs_count) + 1);


BUFFER_TRACE(bh2, "call ext3_journal_dirty_metadata");

err = ext3_journal_dirty_metadata(handle, bh2);

if (err)

    goto fail;

BUFFER_TRACE(sb->u.ext3_sb.s_sbh, "get_write_access");

err = ext3_journal_get_write_access(handle, sb->u.ext3_sb.s_sbh);

if (err)

    goto fail;

es->s_free_inodes_count = cpu_to_le32(le32_to_cpu(es->s_free_inodes_count)- 1);


BUFFER_TRACE(sb->u.ext3_sb.s_sbh, "call ext3_journal_dirty_metadata");

err = ext3_journal_dirty_metadata(handle, sb->u.ext3_sb.s_sbh);

This section of the code updates the free inodes count in the superblockas well as the group descriptor. As mentioned earlier, in case of ext2,the find_group_other() call returns a group with the free_inode_countalready decremented. In case of ext3, we have to do it explicitly. Againthis is done by enclosing the writes to the metadata withing thejournaling system calls.


err = ext3_mark_inode_dirty(handle, inode);

Eventually, the inode is marked dirty as is the case with the ext2 code.Again the difference here is that an extra handle parameter is passed tothis function. The ext3_mark_inode_dirty() function will eventually find abuffer head for the on-disk inode, and copy the inode information to thebuffer containing the on-disk (raw) inode image. Again, this would be doneby enclosing the changes within the journal system calls.


2.2.2 Add an entry for this inode to the directory

ext2 :

To add the given inode to the directory, ext2_create callsext2_add_nondir(), which eventually calls the ext2_add_link() function.

The ext2_add_link() is quite a simple function. It loops through all thedirectory entries present in the directory. It cycles through all the_pages_ of the directory, to find an appropriate directory entry. Itmodifies the directory entry such that the filename and inode informationare added to the directory entry. Once this is done the page is scheduledto be written to the disk. Also, the directory inode is marked dirty,since the timestamps of the directory inode change.


ext3:

To add the given inode to the directory, ext3_create callsext3_add_nondir(), which eventually calls ext3_add_entry() function.

The ext3_add_entry() function cycles loops through all the directoryentries in the directory. As you would have guessed the ext3 code cyclesthrough the _buffers_ of the directory, to find an appropriate directoryentry. It modifies the directory entry such that the filename and inodeinformation are added to the directory entry. This change in the contentsof the buffer data are done within the journaling system calls as well.


3 Mounting ext3

Now that we know how information is written to the journal, let us see howit is actually used by ext3 when a filesystem is mounted. This is one ofthe important part since this is where recovery of a transaction which hasbeen written to the log completely but not to the actual location occurs.In order to avoid getting defocused, we will only go through the codewhich is really relevant with the journaling.


Let us see ext3's read super function to understand this :

if (ext3_load_journal(sb, es))

    goto failed_mount2;

The function starts of by initialising most of the variables in thesuper_block structure. Once this is done, a call to the ext3_load_journalis made. The ext3_load_journal function performs a set of tasks, the tasksalong with the jbd functions they use are listed below :

journal_init_dev/journal_init_inode : Initialise this journal and return ajournal structure.journal_update_format : updates the journal superblock to the latestformat.journal_wipe : wipes the journal safely. This is done by ext3 only if ifthe filesystem is being mounted as read only.journal_load : loads the journal into the memory. Performs journalrecovery, if needed.

About this document ...
Journaling and the ext3 filesystem

This document was generated using the LaTeX2HTML translator Version2002-2-1 (1.70)


The command line arguments were:

latex2html -no_subdir -split 0 -show_section_numbers/tmp/lyx_tmpdir14844heazJR/lyx_tmpbuf0/journaling_and_the_ext3_fs.tex


The translation was initiated by Kedar Sovani on 2007-01-22
Footnotes
... update1

We do not use the term transaction here for a specific reason. The jbdlayer, for performance reasons, bunches all the single atomic updates madeby the ext3 module into one huge transaction and writes this transactionto the disk atomically. Since this transaction is atomic in nature thesingle atomic upates are also atomic. So basically, the jbd layer is stillproviding the feature of atomically updating a set of blocks to the uppermodule, but internally it does it more efficiently.

... update2

It is necessary for the journaling layer to reserve as many blocksmentioned by the ext3 layer, in order to avoid deadlocks.


Kedar Sovani 2007-01-22

[linuxkernelnewbies] Journaling and the ext3 filesystem

Reply via email to