[RFC 01/16] NOVA: Documentation

Steven Swanson Thu, 03 Aug 2017 00:49:14 -0700

A brief overview is in README.md.

Implementation and usage details are in Documentation/filesystems/nova.txt.


These two papers provide a detailed, high-level description of NOVA's design 
goals and approach:

   NOVA: A Log-structured File system for Hybrid Volatile/Non-volatile Main 
Memories (http://cseweb.ucsd.edu/~swanson/papers/FAST2016NOVA.pdf)

   Hardening the NOVA File System 
(http://cseweb.ucsd.edu/~swanson/papers/TechReport2017HardenedNOVA.pdf)

Signed-off-by: Steven Swanson <swan...@cs.ucsd.edu>
---
 Documentation/filesystems/00-INDEX |    2 
 Documentation/filesystems/nova.txt |  771 ++++++++++++++++++++++++++++++++++++
 MAINTAINERS                        |    8 
 README.md                          |  173 ++++++++
 4 files changed, 954 insertions(+)
 create mode 100644 Documentation/filesystems/nova.txt
 create mode 100644 README.md

diff --git a/Documentation/filesystems/00-INDEX 
b/Documentation/filesystems/00-INDEX
index b7bd6c9009cc..dc5c72273957 100644
--- a/Documentation/filesystems/00-INDEX
+++ b/Documentation/filesystems/00-INDEX
@@ -95,6 +95,8 @@ nfs/
        - nfs-related documentation.
 nilfs2.txt
        - info and mount options for the NILFS2 filesystem.
+nova.txt
+       - info on the NOVA filesystem.
 ntfs.txt
        - info and mount options for the NTFS filesystem (Windows NT).
 ocfs2.txt
diff --git a/Documentation/filesystems/nova.txt 
b/Documentation/filesystems/nova.txt
new file mode 100644
index 000000000000..af90da1c3fb1
--- /dev/null
+++ b/Documentation/filesystems/nova.txt
@@ -0,0 +1,771 @@
+The NOVA Filesystem
+===================
+
+NOVA is a DAX file system designed to maximize performance on hybrid DRAM and
+non-volatile main memory (NVMM) systems while providing strong consistency
+guarantees. NOVA adapts conventional log-structured file system techniques to
+exploit the fast random access that NVMs provide. In particular, it maintains
+separate logs for each inode to improve concurrency, and stores file data
+outside the log to minimize log size and reduce garbage collection costs. 
NOVA's
+logs provide metadata, data, and mmap atomicity and focus on simplicity and
+reliability, keeping complex metadata structures in DRAM to accelerate lookup
+operations.
+
+The main NOVA features include:
+
+  * POSIX semantics
+  * Directly access (DAX) byte-addressable NVMM without page caching
+  * Per-CPU NVMM pool to maximize concurrency
+  * Strong consistency guarantees with 8-byte atomic stores
+  * Full filesystem snapshot with DAX-mmap support
+  * Checksums on metadata and file data (crc32c)
+  * Full metadata replication and RAID-5 parity per file page
+  * Online filesystem integrity check and corruption recovery
+
+Filesystem Design
+=================
+NOVA divides NVMM into five regions. NOVA's 512 B superblock contains global
+file system information and the recovery inode. The recovery inode represents a
+special file that stores recovery information (e.g., the list of unallocated
+NVMM pages). NOVA divides its inode tables into per-CPU stripes. It also
+provides per-CPU journals for complex file operations that involve multiple
+inodes. The rest of the available NVMM stores logs and file data.
+
+NOVA is log-structured and stores a separate log for each inode to maximize
+concurrency and provide atomicity for operations that affect a single file. The
+logs only store metadata and comprise a linked list of 4 KB pages. Log entries
+are small – between 32 and 64 bytes. Logs are generally non-contiguous, and log
+pages may reside anywhere in NVMM.
+
+NOVA keeps read-only copies of most file metadata in DRAM during normal
+operations, eliminating the need to access metadata in NVMM during reads.
+
+NOVA uses copy-on-write to provide atomic updates for file data and appends
+metadata about the write to the log. For operations that affect multiple inodes
+NOVA uses lightweight, fixed-length journals – one per core.
+
+NOVA divides the allocatable NVMM into multiple regions, one region per CPU
+core. A per-core allocator manages each of the regions, minimizing contention
+during memory allocation.
+
+After a system crash, NOVA must scan all the logs to rebuild the memory
+allocator state. Since, there are many logs, NOVA aggressively parallelizes the
+scan.
+
+Using NOVA
+==========
+
+NOVA runs on a pmem non-volatile memory region.  You can create one of these
+regions with the `memmap` kernel command line option.  For instance, adding
+`memmap=16G!8G` to the kernel boot parameters will reserve 16GB memory starting
+from address 8GB, and the kernel will create a `pmem0` block device under the
+`/dev` directory.
+
+After the OS has booted, you can initialize a NOVA instance with the following 
commands:
+
+
+# modprobe nova
+# mount -t NOVA -o init /dev/pmem0 /mnt/ramdisk
+
+
+The above commands create a NOVA instance on `/dev/pmem0` and mounts it on
+`/mnt/ramdisk`.
+
+Nova support several module command line options:
+
+ * metadata_csum: Enable metadata replication and checksums (default 0)
+
+ * data_csum: Compute checksums on file data. (default: 0)
+
+ * data_parity: Compute parity for file data. (default: 0)
+
+ * inplace_data_updates:  Update data in place rather than with COW (default: 
0)
+
+ * wprotect: Make PMEM unwritable and then use CR0.WP to enable writes as
+   needed (default: 0).  You must also install the nd_pmem module as with
+   wprotect =1 (e.g., modprobe nd_pmem readonly=1).
+
+For instance to enable all Nova's data protection features:
+
+# modprobe nova metadata_csum=1\
+              data_csum=1\
+              data_parity=1\
+              wprotect=1
+
+Currently, remounting file systems with different combinations of options may
+not work.
+
+To recover an existing NOVA instance, mount NOVA without the init option, for 
example:
+
+# mount -t NOVA /dev/pmem0 /mnt/ramdisk
+
+### Taking Snapshots
+
+To create a snapshot:
+
+# echo 1 > /proc/fs/NOVA/<device>/create_snapshot
+
+To list the current snapshots:
+
+# cat /proc/fs/NOVA/<device>/snapshots
+
+To mount a snapshot, mount NOVA and specifying the snapshot index, for example:
+
+# mount -t NOVA -o snapshot=<index> /dev/pmem0 /mnt/ramdisk
+
+Users should not write to the file system after mounting a snapshot.
+
+Source File Structure
+=====================
+
+  * nova_def.h/nova.h
+   Defines NOVA macros and key inline functions.
+    
+  * balloc.{h,c}
+    NOVA's block allocator implementation.
+    
+  * bbuild.c
+    Implements recovery routines to restore the in-use inode list, the NVMM
+    allocator information, and the snapshot table.
+
+  * checksum.c
+    Contains checksum-related functions to compute and verify checksums on NOVA
+    data structures and file pages, and also performs recovery actions when
+    corruptions are detected.
+
+  * dax.c
+    Implements DAX read/write functions to access file data. NOVA uses
+    copy-on-write to modify file pages by default, unless inplace data update 
is
+    enabled at mount-time. There are also functions to update and verify the
+    file data integrity information.
+
+  * dir.c
+    Contains functions to create, update, and remove NOVA dentries.
+
+  * file.c
+    Implements file-related operations such as open, fallocate, llseek, fsync,
+    and flush.
+
+  * gc.c
+    NOVA's garbage collection functions. 
+
+  * inode.{h,c}
+    Creates, reads, and frees NOVA inode tables and inodes.
+
+  * ioctl.c
+    Implements some ioctl commands to call NOVA's internal functions.
+
+  * journal.{h,c}
+    For operations that affect multiple inodes NOVA uses lightweight,
+    fixed-length journals – one per core. This file contains functions to
+    create and manage the lite journals.
+
+  * log.{h,c}
+    Functions to manipulate NOVA inode logs, including log page allocation, log
+    entry creation, commit, modification, and deletion.
+
+  * mprotect.{h,c}
+    Implements inline functions to enable/disable writing to different NOVA
+    data structures.
+    
+  * namei.c
+    Functions to create/remove files, directories, and links. It also looks for
+    the NOVA inode number for a given path name.
+
+  * parity.c
+    Functions to compute file page parity bits. Each file page is striped in to
+    equally sized segments (or strips), and one parity strip is calculated 
using
+    RAID-5 method. A function to restore a broken data strip is also 
implemented
+    in this file.
+
+  * perf.{h,c}
+    Function performance measurements. It defines
+    function IDs and call prototypes.  Measures primitive functions'
+    performance, including memory copy functions for DRAM and NVMM, checksum
+    functions, and XOR parity functions.
+
+  * rebuild.c
+    When mounting NOVA after a crash, rebuilds NOVA inodes from its logs. There
+    are also functions to re-calculate checksums and parity bits for file pages
+    that were mmapped during the crash.
+
+  * snapshot.{h,c}
+    Code and data structures for taking snapshots.
+    
+  * stats.h
+    Defines data structures and macros that are relevant to gather NOVA usage
+    statistics.
+
+  * stats.c
+    Implements routines to gather and print NOVA usage statistics.
+
+  * super.{h,c}
+    Super block structures and Nova FS layout and entry points for NOVA
+    mounting and unmounting, initializing or recovering the NOVA super block
+    and other global file system information.
+
+  * symlink.c
+    Implements functions to create and read symbolic links in the filesystem.
+
+  * sysfs.c
+    Implements sysfs entries to take user inputs for taking snapshots, printing
+    NOVA statistics, and measuring function's performance.
+
+
+FS Layout
+======================
+
+A Nova file systems resides in single PMEM device. Nova divides the device int
+4KB blocks.
+
+ block
++-----------------------------------------------------+
+|  0  | primary super block (struct nova_super_block) |
++-----------------------------------------------------+
+|  1  | Reserved inodes                               |
++-----------------------------------------------------+
+|  2  | reserved                                      |
++-----------------------------------------------------+
+|  3  | Journal pointers                              |
++-----------------------------------------------------+
+| 4-5 | Inode pointer tables                          |
++-----------------------------------------------------+
+|  6  | reserved                                      |
++-----------------------------------------------------+
+|  7  | reserved                                      |
++-----------------------------------------------------+
+| ... | data pages                                    |
++-----------------------------------------------------+
+| n-2 | replica reserved Inodes                       |
++-----------------------------------------------------+
+| n-1 | replica super block                           |
++-----------------------------------------------------+
+
+
+
+Superblock and Associated Structures
+====================================
+
+The beginning of the PMEM device hold the super block and its associated
+tables.  These include reserved inodes, a table of pointers to the journals
+Nova uses for complex operations, and pointers to inodes tables.  Nova
+maintains replicas of the super block and reserved inodes in the last two
+blocks of the PMEM area.
+
+
+Block Allocator/Free Lists
+==========================
+
+Nova uses per-CPU allocators to manage free PMEM blocks.  On initialization,
+NOVA divides the range of blocks in the PMEM device among the CPUs, and those
+blocks are managed solely by that CPU.  We call these ranges of "allocation 
regions".
+
+Some of the blocks in an allocation region have fixed roles.  Here's the
+layout:
+
++-------------------------------+
+| data checksum blocks          |
++-------------------------------+
+| data parity blocks            |
++-------------------------------+
+|                               |
+| Allocatable blocks            |
+|                               |
++-------------------------------+
+| replica data parity blocks    |
++-------------------------------+
+| replica data checksum blocks  |
++-------------------------------+
+
+The first and last allocation regions, also contain the super block, inode
+tables, etc. and their replicas, respectively.
+
+Each allocator maintains a red-black tree of unallocated ranges (struct
+nova_range_node).
+
+Allocation Functions
+--------------------
+
+Nova allocate PMEM blocks using two mechanisms:
+
+1.  Static allocation as defined in super.h
+
+2.  Allocation for log and data pages via nova_new_log_blocks() and
+nova_new_data_blocks().
+
+Both of these functions allow the caller to control whether the allocator
+preferes higher addresses for allocation or lower addresses.  We use this to
+encourage meta data structures and their replicas to be far from one another.
+
+PMEM Address Translation
+------------------------
+
+In Nova's persistent data structures, memory locations are given as offsets
+from the beginning of the PMEM region.  nova_get_block() translates offsets to
+PMEM addresses.  nova_get_addr_off() performs the reverse translation.
+
+
+Inodes
+======
+
+Nova maintains per-CPU inode tables, and inode numbers are striped across the
+tables (i.e., inos 0, n, 2n,... on cpu 0; inos 1, n + 1, 2n + 1, ... on cpu 1).
+
+The inodes themselves live in a set of linked lists (one per CPU) of 2MB
+blocks.  The last 8 bytes of each block points to the next block.  Pointers to
+heads of these list live in PMEM block INODE_TABLE0_START and are replicated in
+PMEM block INODE_TABLE1_START.  Additional space for inodes is allocated on
+demand.
+
+To allocate inodes, Nova maintains a per-cpu "inuse_list" in DRAM holds a RB
+tree that holds ranges of unallocated inode numbers.
+
+Logs
+====
+
+Nova maintains a log for each inode that records updates to the inode's
+metadata and holds pointers to the file data.  Nova makes updates to file data
+and metadata atomic by atomically appending log entries to the log.
+
+Each inode contains pointers to head and tail of the inode's log.  When the log
+grows past the end of the last page, nova allocates additional space.  For
+short logs (less than 1MB) , it doubles the length.  For longer logs, it adds a
+fixed amount of additional space (1MB).
+
+Log space is reclaimed during garbage collection.
+
+Log Entries
+-----------
+
+There are eight kinds of log entry, documented in log.h.  The log entries have
+several entries in common:
+
+   1.  'epoch_id' gives the epoch during which the log entry was created.
+   Creating a snapshot increiments the epoch_id for the file systems.
+
+   2.  'trans_id' is filesystem-wide, monotone increasing, number assigned each
+   log entry.  It provides an ordering over all FS operations.
+
+   3.  'invalid' is true if the effects of this entry are dead and the log
+   entry can be garbage collected.
+
+   4.  'csum' is a CRC32 checksum for the entry.
+
+Log structure
+-------------
+
+The logs comprise a linked list of PMEM blocks.  The tail of each block
+
+contains some metadata about the block and pointers to the next block and
+block's replica (struct nova_inode_page_tail).
+
++----------------+
+| log entry      |
++----------------+
+| log entry      |
++----------------+
+| ...            |
++----------------+
+| tail           |
+|  metadata      |
+|  -> next block |
++----------------+
+
+
+Journals
+========
+
+Nova uses a lightweight journaling mechanisms to provide atomicity for
+operations that modify more than one on inode.  The journals providing logging
+for two operations:
+
+1.  Single word updates (JOURNAL_ENTRY)
+2.  Copying inodes (JOURNAL_INODE)
+                                                  
+The journals are undo logs: Nova creates the journal entries for an operation,
+and if the operation does not complete due to a system failure, the recovery
+process rolls back the changes using the journal entries.
+
+To commit, Nova drops the log.
+
+Nova maintains one journal per CPU.  The head and tail pointers for each
+journal live in a reserved page near the beginning of the file system.  
+
+During recovery, Nova scans the journals and undoes the operations described by
+each entry.
+
+
+File and Directory Access
+=========================
+
+To access file data via read(), Nova maintains a radix tree in DRAM for each
+inode (nova_inode_info_header.tree) that maps file offsets to write log
+entries.  For directories, the same tree maps a hash of filenames to their
+corresponding dentry.
+
+In both cases, the nova populates the tree when the file or directory is opened
+by scanning its log.
+
+MMap and DAX
+============
+
+NOVA leverages the kernel's DAX mechanisms for mmap and file data access.  Nova
+maintains a red-black tree in DRAM (nova_inode_info_header.vma_tree) to track
+which portions of a file have been mapped.
+
+Garbage Collection
+==================
+
+Nova recovers log space with a two-phase garbage collection system.  When a log
+reaches the end of its allocated pages, Nova allocates more space.  Then, the
+fast GC algorithm scans the log to remove pages that have no valid entries.
+Then, it estimates how many pages the logs valid entries would fill.  If this
+is less than half the number of pages in the log, the second GC phase copies
+the valid entries to new pages.
+
+For example (V=valid; I=invalid):
+
++---+          +---+           +---+
+| I |         | I |            | V |
++---+         +---+  Thorough  +---+
+| V |         | V |     GC     | V |
++---+         +---+   =====>   +---+
+| I |         | I |            | V |
++---+         +---+            +---+
+| V |         | V |            | V |
++---+         +---+            +---+   
+  |             |             
+  V             V             
++---+         +---+           
+| I |         | V |           
++---+         +---+           
+| I | fast GC  | I |          
++---+  ====>   +---+          
+| I |         | I |           
++---+         +---+           
+| I |         | V |           
++---+         +---+           
+  |            
+  V            
++---+          
+| V |          
++---+          
+| I |          
++---+          
+| I |          
++---+          
+| V |          
++---+            
+
+
+Replication and Checksums
+=========================
+
+Nova protects data and metadat from corruption due to media errors and
+"scribbles" -- software errors in the kernels that may overwrite Nova data.
+
+Replication
+-----------
+
+Nova replicates all PMEM metadata structures (there are a few exceptions.  They
+are WIP).  For structure, there is a primary and an "alternate" (denoted as
+"alter" in the code).  To ensure that Nova can recover a consistent copy of the
+data in case of a failure, Nova first updates the primary, and issues a persist
+barrier to ensure that data is written to NVMM.  Then it does the same for the
+alternate.
+
+Detection
+---------
+
+Nova uses two techniques to detect data corruption.  For media errors, Nova
+should always uses memcpy_from_pmem() to read data from PMEM, usually by
+copying the PMEM data structure into DRAM.
+
+To detect software-caused corruption, Nova uses CRC32 checksums.  All the PMEM
+data structures in Nova include csum field for this purpose.  Nova also
+computes CRC32 checksums each 512-byte slice of each data page.
+
+The checksums are stored in dedicated pages in each CPU's allocation region.
+
+                                                          replica
+                                                 parity   parity       
+                                                page     page    
+            +---+---+---+---+---+---+---+---+    +---+    +---+       
+data page 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 |    | 0 |    | 0 |        
+            +---+---+---+---+---+---+---+---+    +---+    +---+        
+data page 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 |    | 1 |    | 1 |        
+            +---+---+---+---+---+---+---+---+    +---+    +---+        
+data page 2 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 |    | 0 |    | 0 |        
+            +---+---+---+---+---+---+---+---+    +---+    +---+        
+data page 3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |    | 0 |    | 0 |        
+            +---+---+---+---+---+---+---+---+    +---+    +---+        
+    ...                    ...                    ...      ...   
+
+Recovery
+--------
+
+Nova uses replication to support recovery of metadata structures and
+RAID4-style parity to recover corrupted data.
+
+If Nova detects corruption of a metadata structure, it restores the structure
+using the replica.
+
+If it detects a corrupt slice of data page, it uses RAID4-style recovery to
+restore it.  The CRC32 checksums for the page slices are replicated.
+
+Cautious allocation
+-------------------
+
+To maximize its resilience to software scribbles, Nova allocate metadata
+structures and their replicas far from one another.  It tries to allocate the
+primary copy at a low address and the replica at a high address within the PMEM
+region.
+
+Write Protection
+----------------
+
+Finally, Nova supports can prevent unintended writes PMEM by mapping the entire
+PMEM device as read-only and then disabling _all_ write protection by clearing
+the WP bit the CR0 control register when Nova needs to perform a write.  The
+wprotect mount-time option controls this behavior.
+
+To map the PMEM device as read-only, we have added a readonly module command
+line option to nd_pmem.  There is probably a better approach to achieving this
+goal. 
+
+Unsafe modes
+============
+
+Nova support modes that disable some of the protections it provides to improve
+perforamnce.
+
+File data
+---------
+
+Nova can disable parity and/or checksums on file data (options 'data_parity=0'
+and 'data_checksum=0').  Without parity, Nova can detect but not recover from
+data corruption.  Without checksums, Nova will still detect and recover from
+media errors, but not scribbles.
+
+Nova also supports in-place file updates (option: inplace_data_updates=1).
+This breaks atomicity for writes, but improve performance, especially for
+sub-page writes, since these require a full page COW in the default mode.
+
+Metadata
+--------
+
+Nova can disable metadata checksums and replication (option 'metadata_csum=0').
+
+
+Snapshots
+=========
+
+Nova supports snapshots to facilitate backups.
+
+Taking a snapshot
+-----------------
+
+Each Nova file systems has a current epoch_id in the super block and each log
+entry has the epoch_id attached to it at creation.  When the user creates a
+snaphot, Nova increments the epoch_id for the file system and the old epoch_id
+identifies the moment the snapshot was taken.
+
+Nova records the epoch_id and a timestamp in a new log entry (struct
+snapshot_info_log_entry) and appends it to the log of the reserved snapshot
+inode (NOVA_SNAPSHOT_INODE) in the superblock.
+
+Nova also maintains a radix tree (nova_sb_info.snapshot_info_tree) of struct
+snapshot_info in DRAM indexed by epoch_id.
+
+Nova also marks all mmap'd pages as read-only and uses COW to preserve file
+contents after the snapshot.
+
+Tracking Live Data
+------------------
+
+Supporting snapshots requires Nova to preserve file contents from previous
+snapshots while also being able to recover the space a snapshot occupied after
+its deletion.
+
+Preserving file contents requires a small change to how Nova implements write
+operations.  To perform a write, Nova appends a write log entry to the file's
+log.  The log entry includes pointers to newly-allocated and populated NVMM
+pages that hold the written data.  If the write overwrites existing data, Nova
+locates the previous write log entry for that portion of the file, and performs
+an "epoch check" that compares the old log entry's epoch_id to the file
+system's current epoch_id.  If the comparison matches, the old write log entry
+and the file data blocks it points to no longer belong to any snapshot, and
+Nova reclaims the data blocks.
+
+If the epoch_id's do not match, then the data in the old log entry belongs to
+an earlier snapshot and Nova leaves the log entry in place.
+
+Determining when to reclaim data belonging to deleted snapshots requires
+additional bookkeeping.  For each snapshot, Nova maintains a "snapshot log"
+that records the inodes and blocks that belong to that snapshot, but are not
+part of the current file system image.
+
+Nova populates the snapshot log during the epoch check: If the epoch_ids for
+the new and old log entries do not match, it appends a log entry (either struct
+snapshot_inode_entry or struct snapshot_file_write_entry) to the snapshot log
+that the old log entry belongs to.  The log entry contains a pointer to the old
+log entry, and the filesystem's current epoch_id as the delete_epoch_id.
+
+To delete a snapshot, Nova removes the snapshot from the list of live snapshots
+and appends its log to the following snapshot's log.  Then, a background thread
+traverses the combined log and reclaims dead inode/data based on the delete
+epoch_id: If the delete epoch_id for an entry in the log is less than or equal
+to the snapshot's epoch_id, it means the log entry and/or the associated data
+blocks are now dead.
+
+Snapshots and DAX
+-----------------
+
+Taking consistent snapshots while applications are modifying files using
+DAX-style mmap requires NOVA to reckon with the order in which stores to NVMM
+become persistent (i.e., reach physical NVMM so they will survive a system
+failure).  These applications rely on the processor's ``memory persistence
+model'' [http://dl.acm.org/citation.cfm?id=2665671.2665712] to make guarantees
+about when and in what order stores become persistent.  These guarantees allow
+the application to restore their data to a consistent state during recovery
+from a system failure.
+
+From the application's perspective, reading a snapshot is equivalent to
+recovering from a system failure.  In both cases, the contents of the
+memory-mapped file reflect its state at a moment when application operations
+might be in-flight and when the application had no chance to shut down cleanly.
+
+A naive approach to checkpointing mmap()'d files in NOVA would simply mark each
+of the read/write mapped pages as read-only and then do copy-on-write when a
+store occurs to preserve the old pages as part of the snapshot.
+
+However, this approach can leave the snapshot in an inconsistent state:
+Setting the page to read-only captures its contents for the
+snapshot, and the kernel requires NOVA to set the pages as read-only
+one at a time.  So, if the order in which NOVA marks pages as read-only
+is incompatible with ordering that the application requires, the snapshot will
+contain an inconsistent version of the file.
+
+To resolve this problem, when NOVA starts marking pages as read-only, it blocks
+page faults to the read-only mmap()'d pages until it has marked all the pages
+read-only and finished taking the snapshot.
+
+More detail is available in the technical report referenced at the top of this
+document.
+
+We have implemented this functionality in NOVA by adding the 'original_write'
+flag to struct vm_area_struct that tracks whether the vm_area_struct is created
+with write permission, but has been marked read-only in the course of taking a
+snapshot.  We have also added a 'dax_cow' operation to struct
+vm_operations_struct that the page fault handler runs when applications write
+to a page with original_write = 1.  NOVA's dax_cow operation
+(nova_restore_page_write()) performs the COW, maps the page to a new physical
+page and allows writing.
+
+Saving Snapshot State
+---------------------
+
+During a clean shutdown, Nova stores the snapshot information to PMEM.
+
+Nova reserves an inode for storing snapshot information.  The log for the inode
+contains an entry for each snapshot (struct snapshot_info_log_entry).  On
+shutdown, Nova allocates one page (struct snapshot_nvmm_page) to store an array
+of struct snapshot_nvmm_list.
+
+Each of these lists (one per CPU) contains head and tail pointers to a linked
+list of blocks (just like an inode log).  The lists contain a struct
+snapshot_file_write_entry or struct snapshot_inode_entry for each operation
+that modified file data or an inode.
+
+Superblock
++--------------------+
+|   ...              |
++--------------------+
+| Reserved Inodes    |
++---+----------------+
+|   |     ...        |
++---+----------------+
+| 7 | Snapshot Inode |
+|   | head           |
++---+----------------+
+        /
+       /
+      / 
++---------+---------+---------+
+|  Snap   |  Snap   |  Snap   |
+| epoch=1 | epoch=4 | epoch=11|
+|         |         |         |
+|nvmm_page|nvmm_page|nvmm_page|
++---------+---------+---------+
+     |
+     |
++----------+   +--------+--------+
+|  cpu 0   |   | snap  | snap   |      
+|   head   |-->| inode | write  |
+|          |   | entry  | entry  |      
+|          |   +--------+--------+
++----------+   +--------+--------+
+|  cpu 1   |   | snap  | snap   |
+|   head   |-->| write | write  |
+|          |   | entry  | entry  |
+|          |   +--------+--------+
++----------+ 
+|    ...   | 
++----------+   +--------+
+|  cpu 128 |   | snap  |
+|   head   |-->| inode |
+|          |   | entry  |
+|          |   +--------+
++----------+
+
+
+Umount and Recovery
+===================
+
+Clean umount/mount
+------------------
+
+On a clean unmount, Nova saves the contents of many of its DRAM data structures
+to PMEM to accelerate the next mount:
+
+1. Nova stores the allocator state for each of the per-cpu allocators to the
+   log of a reserved inode (NOVA_BLOCK_NODE_INO).
+    
+2. Nova stores the per-CPU lists of available inodes (the inuse_list) to the
+   NOVA_BLOCK_INODELIST1_INO reserved inode.
+
+3. Nova stores the snapshot state to PMEM as described above.
+
+After a clean unmount, the following mount restores these data and then
+invalidates them.
+
+Recovery after failures
+------------------------
+
+In case of a unclean dismount (e.g., system crash), Nova must rebuild these
+DRAM structures by scanning the inode logs.  Nova log scanning is fast because
+per-CPU inode tables and per-inode logs allow for parallel recovery.
+
+The number of live log entries in an inode log is roughly the number of extents
+in the file.  As a result, Nova only needs to scan a small fraction of the NVMM
+during recovery.
+
+The Nova failure recovery consists of two steps:
+
+First, Nova checks its lite weight journals and rolls back any uncommitted
+transactions to restore the file system to a consistent state.
+
+Second, Nova starts a recovery thread on each CPU and scans the inode tables in
+parallel, performing log scanning for every valid inode in the inode table.
+Nova use different recovery mechanisms for directory inodes and file inodes:
+For a directory inode, Nova scans the log's linked list to enumerate the pages
+it occupies, but it does not inspect the log's contents.  For a file inode,
+Nova reads the write entries in the log to enumerate the data pages.
+
+During the recovery scan Nova builds a bitmap of occupied pages, and rebuilds
+the allocator based on the result. After this process completes, the file
+system is ready to accept new requests.
+
+During the same scan, it rebuilds the snapshot information and the list
+available inodes.
+
diff --git a/MAINTAINERS b/MAINTAINERS
index 767e9d202adf..cfcee556acc6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9108,6 +9108,14 @@ F:       drivers/power/supply/bq27xxx_battery_i2c.c
 F:     drivers/power/supply/isp1704_charger.c
 F:     drivers/power/supply/rx51_battery.c
 
+NOVA FILE SYSTEM
+M:     Andiry Xu <jix...@cs.ucsd.edu>
+M:     Steven Swanson <swan...@cs.ucsd.edu>
+L:     linux-fsde...@vger.kernel.org
+L:     linux-nvdimm@lists.01.org
+F:     Documentation/filesystems/nova.txt
+F:     fs/nova/
+
 NTB DRIVER CORE
 M:     Jon Mason <jdma...@kudzu.us>
 M:     Dave Jiang <dave.ji...@intel.com>
diff --git a/README.md b/README.md
new file mode 100644
index 000000000000..4f778e99a79e
--- /dev/null
+++ b/README.md
@@ -0,0 +1,173 @@
+# NOVA: NOn-Volatile memory Accelerated log-structured file system
+
+NOVA's goal is to provide a high-performance, full-featured, production-ready
+file system tailored for byte-addressable non-volatile memories (e.g., NVDIMMs
+and Intel's soon-to-be-released 3DXpoint DIMMs).  It combines design elements
+from many other file systems to provide a combination of high-performance,
+strong consistency guarantees, and comprehensive data protection.  NOVA support
+DAX-style mmap and making DAX performs well is a first-order priority in NOVA's
+design.  NOVA was developed by the [Non-Volatile Systems Laboratory][NVSL] in
+the [Computer Science and Engineering Department][CSE] at the [University of
+California, San Diego][UCSD].
+
+
+NOVA is primarily a log-structured file system, but rather than maintain a
+single global log for the entire file system, it maintains separate logs for
+each file (inode).  NOVA breaks the logs into 4KB pages, they need not be
+contiguous in memory.  The logs only contain metadata.
+
+File data pages reside outside the log, and log entries for write operations
+point to data pages they modify.  File modification uses copy-on-write (COW) to
+provide atomic file updates.
+
+For file operations that involve multiple inodes, NOVA use small, fixed-sized
+redo logs to atomically append log entries to the logs of the inodes involned.
+
+This structure keeps logs small and make garbage collection very fast.  It also
+enables enormous parallelism during recovery from an unclean unmount, since
+threads can scan logs in parallel.
+
+NOVA replicates and checksums all metadata structures and protects file data
+with RAID-4-style parity.  It supports checkpoints to facilitate backups.
+
+A more thorough discussion of NOVA's design is avaialable in these two papers:
+
+**NOVA: A Log-structured File system for Hybrid Volatile/Non-volatile Main 
Memories** 
+[PDF](http://cseweb.ucsd.edu/~swanson/papers/FAST2016NOVA.pdf)<br>
+*Jian Xu and Steven Swanson*<br>
+Published in [FAST 2016][FAST2016]
+
+**Hardening the NOVA File System**
+[PDF](http://cseweb.ucsd.edu/~swanson/papers/TechReport2017HardenedNOVA.pdf) 
<br>
+UCSD-CSE Techreport CS2017-1018
+*Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, 
Tamires Brito Da Silva, Andy Rudoff, Steven Swanson*<br>
+
+Read on for further details about NOVA's overall design and its current status 
+
+### Compatibilty with Other File Systems
+
+NOVA aims to be compatible with other Linux file systems.  To help verify that 
it achieves this we run several test suites against NOVA each night.
+
+* The latest version of XFSTests. ([Current 
failures](https://github.com/NVSL/linux-nova/issues?q=is%3Aopen+is%3Aissue+label%3AXFSTests))
+* The (Linux testing project)(https://linux-test-project.github.io/) file 
system tests.
+* The (fstest POSIX test suite)[POSIXtest].
+
+Currently, nearly all of these tests pass for the `master` branch, and we have
+run complex programs on NOVA.  There are, of course, many bugs left to fix.
+
+NOVA uses the standard PMEM kernel interfaces for accessing and managing
+persistent memory.
+
+### Atomicity
+
+By default, NOVA makes all metadata and file data operations atomic.
+
+Strong atomicity guarantees make it easier to build reliable applications on
+NOVA, and NOVA can provide these guarantees with sacrificing much performance
+because NVDIMMs support very fast random access.
+
+NOVA also supports "unsafe data" and "unsafe metadata" modes that
+improve performance in some cases and allows for non-atomic updates of file
+data and metadata, respectively.
+
+### Data Protection
+
+NOVA aims to protect data against both misdirected writes in the kernel (which
+can easily "scribble" over the contents of an NVDIMM) as well as media errors.
+
+NOVA protects all of its metadata data structures with a combination of
+replication and checksums.  It protects file data using RAID-5 style parity.
+
+NOVA can detects data corruption by verifying checksums on each access and by
+catching and handling machine check exceptions (MCEs) that arise when the
+system's memory controller detects at uncorrectable media error.
+
+We use a fault injection tool that allows testing of these recovery mechanisms.
+
+To facilitate backups, NOVA can take snapshots of the current filesystem state
+that can be mounted read-only while the current file system is mounted
+read-write.
+
+The tech report list above describes the design of NOVA's data protection 
system in detail.
+
+### DAX Support
+
+Supporting DAX efficiently is a core feature of NOVA and one of the challenges
+in designing NOVA is reconciling DAX support which aims to avoid file system
+intervention when file data changes, and other features that require such
+intervention.
+
+NOVA's philosophy with respect to DAX is that when a program uses DAX mmap to
+to modify a file, the program must take full responsibility for that data and
+NOVA must ensure that the memory will behave as expected.  At other times, the
+file system provides protection.  This approach has several implications:
+
+1. Implementing `msync()` in user space works fine.
+
+2. While a file is mmap'd, it is not protected by NOVA's RAID-style parity
+mechanism, because protecting it would be too expensive.  When the file is
+unmapped and/or during file system recovery, protection is restored.
+
+3. The snapshot mechanism must be careful about the order in which in adds
+pages to the file's snapshot image.
+
+### Performance
+
+The research paper and technical report referenced above compare NOVA's
+performance to other file systems.  In almost all cases, NOVA outperforms other
+DAX-enabled file systems.  A notable exception is sub-page updates which incur
+COW overheads for the entire page.
+
+The technical report also illustrates the trade-offs between our protection
+mechanisms and performance.
+
+## Gaps, Missing Features, and Development Status
+
+Although NOVA is a fully-functional file system, there is still much work left
+to be done.  In particular, (at least) the following items are currently 
missing:
+
+1.  There is no mkfs or fsk utility (`mount` takes `-o init` to create a NOVA 
file system)
+2.  NOVA doesn't scrub data to prevent corruption from accumulating in 
infrequently accessed data.
+3.  NOVA doesn't read bad block information on mount and attempt recovery of 
the effected data.
+4.  NOVA only works on x86-64 kernels.
+5.  NOVA does not currently support extended attributes or ACL.
+6.  NOVA does not currently prevent writes to mounted snapshots.
+7.  Using `write()` to modify pages that are mmap'd is not supported.
+8.  NOVA deoesn't provide quota support.
+9.  Moving NOVA file systems between machines with different numbers of CPUs 
does not work.
+10. Remounting a NOVA file system with different mount options may fail.
+
+None of these are fundamental limitations of NOVA's design.  Additional bugs
+and issues are here [here][https://github.com/NVSL/linux-nova/issues].
+
+NOVA is complete and robust enough to run a range of complex applications, but
+it is not yet ready for production use.  Our current focus is on adding a few
+missing features list above and finding/fixing bugs.
+
+## Building and Using NOVA
+
+This repo contains a version of the Linux with NOVA included.  You should be
+able to build and install it just as you would the mainline Linux source.
+
+### Building NOVA
+
+To build NOVA, build the kernel with PMEM (`CONFIG_BLK_DEV_PMEM`), DAX 
(`CONFIG_FS_DAX`) and NOVA (`CONFIG_NOVA_FS`) support.  Install as usual.
+
+## Hacking and Contributing
+
+The NOVA source code is almost completely contains in the `fs/nova` directory.
+The execptions are some small changes in the kernel's memory management system
+to support checkpointing.
+
+`Documentation/filesystems/nova.txt` describes the internals of Nova in more 
detail.
+
+If you find bugs, please [report 
them](https://github.com/NVSL/linux-nova/issues).
+
+If you have other questions or suggestions you can contact the NOVA developers 
at [cse-nova-hack...@eng.ucsd.edu](mailto:cse-nova-hack...@eng.ucsd.edu).
+
+
+[NVSL]: http://nvsl.ucsd.edu/ "http://nvsl.ucsd.edu";
+[POSIXtest]: http://www.tuxera.com/community/posix-test-suite/ 
+[FAST2016]: https://www.usenix.org/conference/fast16/technical-sessions
+[CSE]: http://cs.ucsd.edu
+[UCSD]: http://www.ucsd.edu
\ No newline at end of file

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

[RFC 01/16] NOVA: Documentation

Reply via email to