http://www.nondot.org/sabre/os/files/FileSystems/ext2fs/

Analysis of the Ext2fs structure

Louis-Dominique Dubeau


Go to the next section.

Copyright (C) 1994 Louis-Dominique Dubeau.

You may without charge, royalty or other payment, copy and distribute copies of this work and derivative works of this work in source or binary form provided that:

(1) you appropriately publish on each copy an appropriate copyright notice; (2) faithfully reproduce all prior copyright notices included in the original work (you may add your own copyright notice); and (3) agree to indemnify and hold all prior authors, copyright holders and licensors of the work harmless from and against all damages arising from the use of the work.

You may distribute sources of derivative works of the work provided that:

(1) (a) all source files of the original work that have been modified, (b) all source files of the derivative work that contain any party of the original work, and (c) all source files of the derivative work that are necessary to compile, link and run the derivative work without unresolved external calls and with the same functionality of the original work ("Necessary Sources") carry a prominent notice explaining the nature and date of the modification and/or creation. You are encouraged to make the Necessary Sources available under this license in order to further development and acceptance of the work.

EXCEPT AS OTHERWISE RESTRICTED BY LAW, THIS WORK IS PROVIDED WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES OF ANY KIND, INCLUDING BUT NOT LIMITED TO, ANY IMPLIED WARRANTIES OF FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY OR TITLE. EXCEPT AS OTHERWISE PROVIDED BY LAW, NO AUTHOR, COPYRIGHT HOLDER OR LICENSOR SHALL BE LIABLE TO YOU FOR DAMAGES OF ANY KIND, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

Introduction

This document has been written by Louis-Dominique Dubeau. It contains an analysis of the structure of the Second Extended File System and is based on a study of the Linux kernel source files. This document does not contain specifications written by the Ext2fs development team.

Ext2fs was designed by Rémy Card (1) as an extensible and powerful file system for Linux. It is also the most successful file system so far in the Linux community.

The first Linux file system was Minixfs: a file system originally developed for the Minix operating system. This file system had many disadvantages. Among them was: the 64MB limit on partitions, the 14 characters limit on file names and no built in extensibility.

To overcome those problems, Rémy Card wrote extfs. This file system was mostly based upon the original Minixfs code and implementation. However, it removed the 64MB size limit on partitions, and increased the file name size limit to 255 characters.

In his quest for the perfect file system, Rémy was still unsatisfied. So he decided to write an brand new file system: ext2fs. This file system not only has the advantages of extfs but also provides a better space allocation management, allows the use of special flags for file management, the use of access control lists and is extensible.

Will someday Rémy come up with ext3fs? Who knows? However, in the meantime ext2fs is the de-facto standard Linux file system. This document describes the physical layout of an ext2 file system on disk and the management policies that every ext2 file system managers should implement. The information in this document is accurate as of version 0.5 of ext2fs (Linux kernel version 1.0). The information about access control lists is not included because no implementation of ext2fs enforce them anyway.

Go to the next section.


Go to the previous, next section.

Blocks and Fragments

Blocks are the basic building blocks of a file system. The file system manager requests to read or write from the disk are always translated to a query to read or write an integral number of blocks from the disk.

Some blocks on the file system are reserved for the exclusive use of the superuser. This information is recorded in the s_r_blocks_count member of the superblock structure. See section Superblock Whenever the total number of free blocks becomes equal to the number of reserved blocks, the normal users can no longer allocate blocks for their use. Only the superuser may allocate new blocks. Without this provision for reserved blocks, filling up the file system might make the computer unbootable. Whenever the startup tasks would try to allocate a block, the computer would crash. With reserved blocks, we ensure a minimum space for booting and allowing the superuser to clean up the disk.

This is all very simple. However, computer scientists like to complicates things a bit. There are in fact two kinds of blocks, logical blocks and physical blocks. The addressing scheme and size of these two kind of blocks may vary. What happens is that when a request is made to manipulate the range `[a,b]' of some file, this range is first converted by the higher parts of the file system into a request to manipulate an integral number of logical blocks: `a' is rounded down to a logical block boundary and, `b' is rounded up to a logical block boundary. Then, this range of logical blocks is converted by lower parts of the file system into a request to manipulate an integral number of physical blocks. The logical block size must be the physical block size multiplied by a power of two (2). So when going from logical to physical addressing we just have to multiply the address by this power of two.

The logical addresses of the file system goes from zero up to the total number of blocks minus one. Block zero is the boot block and is usually only accessed during special operations.

Now, the problem with blocks is that if we have a file that is not an integral number of blocks, space at the end of the last block is wasted. On average, one half block is wasted per file. On most file systems this means a lot of wasted space.

To circumvent this inconvenience, the file system uses fragments. The fragment size must be the physical block size multiplied by a power of two (3). A file is therefore a sequence of blocks followed by a small sequence of consecutive fragments. When a file has enough ending fragments to fill a block, those fragments are grouped into a block. When a file is shortened, the last block may be broken into many contiguous fragments.

The general relationship between sizes is:

Go to the previous, next section.


Go to the previous, next section.

Groups

The blocks on disk are divided into groups. Each of these groups duplicates critical information of the file system. Moreover, the presence of block groups on disk allow the use of efficient disk allocation algorithms.

Each group contains in that order:

The superblock and group descriptors of each group must carry the same values on disk.

Go to the previous, next section.


Go to the previous, next section.

Superblock

In this section, the layout of a superblock is described. Here is the official structure of an ext2fs superblock [include/linux/ext2_fs.h]:

struct ext2_super_block {
  unsigned long  s_inodes_count;
  unsigned long  s_blocks_count;
  unsigned long  s_r_blocks_count;
  unsigned long  s_free_blocks_count;
  unsigned long  s_free_inodes_count;
  unsigned long  s_first_data_block;
  unsigned long  s_log_block_size;
  long           s_log_frag_size;
  unsigned long  s_blocks_per_group;
  unsigned long  s_frags_per_group;
  unsigned long  s_inodes_per_group;
  unsigned long  s_mtime;
  unsigned long  s_wtime;
  unsigned short s_mnt_count;
  short          s_max_mnt_count;
  unsigned short s_magic;
  unsigned short s_state;
  unsigned short s_errors;
  unsigned short s_pad;
  unsigned long  s_lastcheck;
  unsigned long  s_checkinterval;
  unsigned long  s_reserved[238];
};

s_inodes_count
the total number of inodes on the fs.

s_blocks_count
the total number of blocks on the fs.

s_r_blocks_count
the total number of blocks reserved for the exclusive use of the superuser.

s_free_blocks_count
the total number of free blocks on the fs.

s_free_inodes_count
the total number of free inodes on the fs.

s_first_data_block
the position on the fs of the first data block. Usually, this is block number 1 for fs containing 1024 bytes blocks and is number 0 for other fs.

s_log_block_size
used to compute the logical block size in bytes. The logical block size is in fact 1024 << s_log_block_size.

s_log_frag_size
used to compute the logical fragment size. The logical fragment size is in fact 1024 << s_log_frag_size if s_log_frag_size is positive and 1024 >> -s_log_frag_size if s_log_frag_size is negative.

s_blocks_per_group
the total number of blocks contained in a group.

s_frags_per_group
the total number of fragments contained in a group.

s_inodes_per_group
the total number of inodes contained in a group.

s_mtime
the time at which the last mount of the fs was performed.

s_wtime
the time at which the last write of the superblock on the fs was performed.

s_mnt_count
the number of time the fs has been mounted in read-write mode without having been checked.

s_max_mnt_count
the maximum number of time the fs may be mounted in read-write mode before a check must be done.

s_magic
a magic number that permits the identification of the file system. It is 0xEF53 for a normal ext2fs and 0xEF51 for versions of ext2fs prior to 0.2b.

s_state
the state of the file system. It contains an or'ed value of EXT2_VALID_FS (0x0001) which means: unmounted cleanly; and EXT2_ERROR_FS (0x0002) which means: errors detected by the kernel code.

s_errors
indicates what operation to perform when an error occurs. See section Error Handling

s_pad
unused.

s_lastcheck
the time of the last check performed on the fs.

s_checkinterval
the maximum possible time between checks on the fs.

s_reserved
unused.

Times are measured in seconds since 00:00:00 GMT, January 1, 1970.

Once the superblock is read in memory, the ext2fs kernel code calculates some other information and keeps them in another structure. This structure has the following layout:

struct ext2_sb_info {
	unsigned long s_frag_size;
	unsigned long s_frags_per_block;
	unsigned long s_inodes_per_block;
	unsigned long s_frags_per_group;
	unsigned long s_blocks_per_group;
	unsigned long s_inodes_per_group;
	unsigned long s_itb_per_group;
	unsigned long s_desc_per_block;
	unsigned long s_groups_count;
	struct buffer_head * s_sbh;
	struct ext2_super_block * s_es;
	struct buffer_head * s_group_desc[EXT2_MAX_GROUP_DESC];
	unsigned short s_loaded_inode_bitmaps;
	unsigned short s_loaded_block_bitmaps;
	unsigned long s_inode_bitmap_number[EXT2_MAX_GROUP_LOADED];
	struct buffer_head * s_inode_bitmap[EXT2_MAX_GROUP_LOADED];
	unsigned long s_block_bitmap_number[EXT2_MAX_GROUP_LOADED];
	struct buffer_head * s_block_bitmap[EXT2_MAX_GROUP_LOADED];
	int s_rename_lock;
	struct wait_queue * s_rename_wait;
	unsigned long  s_mount_opt;
	unsigned short s_mount_state;
};

s_frag_size
fragment size in bytes.

s_frags_per_block
number of fragments in a block.

s_inodes_per_block
number of inodes in a block of the inode table.

s_frags_per_group
number of fragments in a group.

s_blocks_per_group
number of blocks in a group.

s_inodes_per_group
number of inodes in a group.

s_itb_per_group
number of inode table blocks per group.

s_desc_per_block
number of group descriptors per block.

s_groups_count
number of groups.

s_sbh
the buffer containing the disk superblock in memory.

s_es
pointer to the superblock in the buffer.

s_group_desc
pointers to the buffers containing the group descriptors.

s_loaded_inode_bitmaps
number of inodes bitmap cache entries used.

s_loaded_block_bitmaps
number of blocks bitmap cache entries used.

s_inode_bitmap_number
indicates to which group the inodes bitmap in the buffers belong.

s_inode_bitmap
inode bitmap cache.

s_block_bitmap_number
indicates to which group the blocks bitmap in the buffers belong.

s_block_bitmap
block bitmap cache.

s_rename_lock
lock used to avoid two simultaneous rename operations on a fs.

s_rename_wait
wait queue used to wait for the completion of a rename operation in progress.

s_mount_opt
the mounting options specified by the administrator.

s_mount_state

Most of those values are computed from the superblock on disk.

Linux ext2fs manager caches access to the inodes and blocks bitmaps. This cache is a list of buffers ordered from the most recently used to the last recently used buffer. Managers should use the same kind of bitmap caching or other similar method of improving access time to disk.

Go to the previous, next section.


Go to the previous, next section.

Group Descriptors

On disk, the group descriptors immediately follow the superblock and each descriptor has the following layout:

struct ext2_group_desc
{
  unsigned long  bg_block_bitmap;
  unsigned long  bg_inode_bitmap;
  unsigned long  bg_inode_table;
  unsigned short bg_free_blocks_count;
  unsigned short bg_free_inodes_count;
  unsigned short bg_used_dirs_count;
  unsigned short bg_pad;
  unsigned long  bg_reserved[3];
};

bg_block_bitmap
points to the blocks bitmap block for the group.

bg_inode_bitmap
points to the inodes bitmap block for the group.

bg_inode_table
points to the inodes table first block.

bg_free_blocks_count
number of free blocks in the group.

bg_free_inodes_count
number of free inodes in the group.

bg_used_dirs_count
number of inodes allocated to directories in the group.

bg_pad
padding.

The information in a group descriptor pertains only to the group it is actually describing.

Go to the previous, next section.

Go to the previous, next section.

Bitmaps

The ext2 file system uses bitmaps to keep track of allocated blocks and inodes.

The blocks bitmap of each group refers to blocks ranging from the first block in the group to the last block in the group. To access the bit of a precise block, we first have to look for the group the block belongs to and then look for the bit of this block in the blocks bitmap contained in the group. It it very important to note that the blocks bitmap refer in fact to the smallest allocation unit supported by the file system: fragments. Since the block size is always a multiple of fragment size, when the file system manager allocates a block, it actually allocates a multiple number of fragments. This use of the blocks bitmap permits to the file system manager to allocate and deallocate space on a fragment basis.

The inode bitmap of each group refer to inodes ranging from the first inode of the group to the last inode of the group. To access the bit of a precise inode, we first have to look for the group the inode belongs to and then look for the bit of this inode in the inode bitmap contained in the group. To obtain the inode information from the inode table, the process is the same, except that the final search is in the inode table of the group instead of the inode bitmap.

Go to the previous, next section.

Go to the previous, next section.

Inodes

An inode uniquely describes a file. Here's what an inode looks like on disk:

struct ext2_inode {
  unsigned short i_mode;
  unsigned short i_uid;
  unsigned long  i_size;
  unsigned long  i_atime;
  unsigned long  i_ctime;
  unsigned long  i_mtime;
  unsigned long  i_dtime;
  unsigned short i_gid;
  unsigned short i_links_count;
  unsigned long  i_blocks;
  unsigned long  i_flags;
  unsigned long  i_reserved1;
  unsigned long  i_block[EXT2_N_BLOCKS];
  unsigned long  i_version;
  unsigned long  i_file_acl;
  unsigned long  i_dir_acl;
  unsigned long  i_faddr;
  unsigned char  i_frag;
  unsigned char  i_fsize;
  unsigned short i_pad1;
  unsigned long  i_reserved2[2];
};

i_mode
type of file (character, block, link, etc.) and access rights on the file.

i_uid
uid of the owner of the file.

i_size
logical size in bytes.

i_atime
last time the file was accessed.

i_ctime
last time the inode information of the file was changed.

i_mtime
last time the file content was modified.

i_dtime
when this file was deleted.

i_gid
gid of the file.

i_links_count
number of links pointing to this file.

i_blocks
number of blocks allocated to this file counted in 512 bytes units.

i_flags
flags (see below).

i_reserved1
reserved.

i_block
pointers to blocks (see below).

i_version
version of the file (used by NFS).

i_file_acl
control access list of the file (not used yet).

i_dir_acl
control access list of the directory (not used yet).

i_faddr
block where the fragment of the file resides.

i_frag
number of the fragment in the block.

i_size
size of the fragment.

i_pad1
padding.

i_reserved2
reserved.

As you can see, the inode contains, EXT2_N_BLOCKS (15 in ext2fs 0.5) pointers to block. Of theses pointers, the first EXT2_NDIR_BLOCKS (12) are direct pointers to data. The following entry points to a block of pointers to data (indirect). The following entry points to a block of pointers to blocks of pointers to data (double indirection). The following entry points to a block of pointers to a block of pointers to a block of pointers to data (triple indirection).

The inode flags may take one or more of the following or'ed values:

EXT2_SECRM_FL 0x0001
secure deletion. This usually means that when this flag is set and we delete the file, random data is written in the blocks previously allocated to the file.

EXT2_UNRM_FL 0x0002
undelete. When this flag is set and the file is being deleted, the file system code must store enough information to ensure the undeletion of the file (to a certain extent).

EXT2_COMPR_FL 0x0004
compress file. The content of the file is compressed, the file system code must use compression/decompression algorithms when accessing the data of this file.

EXT2_SYNC_FL 0x0008
synchronous updates. The disk representation of this file must be kept in sync with it's in core representation. Asynchronous I/O on this kind of file is not possible. The synchronous updates only apply to the inode itself and to the indirect blocks. Data blocks are always written asynchronously on the disk.

Some inodes have a special meaning:

EXT2_BAD_INO 1
a file containing the list of bad blocks on the file system.

EXT2_ROOT_INO 2
the root directory of the file system.

EXT2_ACL_IDX_INO 3
ACL inode.

EXT2_ACL_DATA_INO 4
ACL inode.

EXT2_BOOT_LOADER_INO 5
the file containing the boot loader. (Not used yet it seems.)

EXT2_UNDEL_DIR_INO 6
the undelete directory of the system.

EXT2_FIRST_INO 11
this is the first inode that does not have a special meaning.

Go to the previous, next section.





Reply via email to