[ https://issues.apache.org/jira/browse/GSOC-257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829351#comment-17829351 ]
Hao Ding commented on GSOC-257: ------------------------------- https://docs.google.com/document/d/1huy8vHcoCTf-GausabR3PCwXIXfRJAEckeyTulp2gDU/edit#heading=h.j5qg8liqdqtz > Apache OpenDAL OVFS Project Proposal > ------------------------------------- > > Key: GSOC-257 > URL: https://issues.apache.org/jira/browse/GSOC-257 > Project: Comdev GSOC > Issue Type: New Feature > Reporter: Hao Ding > Priority: Major > Labels: OpenDAL, gsoc2024, mentor > > h1. *1 Project Abstract* > > Virtio is an open standard designed to enhance I/O performance between > virtual machines (VMs) and host systems in virtualized environments. VirtioFS > is an extension of the Virtio standard specifically crafted for file system > sharing between VMs and the host. This is particularly beneficial in > scenarios where seamless access to shared files and data between VMs and the > host is essential. VirtioFS has been widely adopted in virtualization > technologies such as QEMU and Kata Container. > > Apache OpenDAL is a data access layer that allows users to easily and > efficiently retrieve data from various storage services in a unified manner. > In this project, our goal is to reference virtiofsd (a standard vhost-user > backend, a pure Rust implementation of VirtioFS based on the local file > system) and implement VirtioFS based on OpenDAL. > > This storage-system-as-a-service approach conceals the details of the > distributed storage system's file system from VMs. This ensures the security > of storage services, as VMs do not need to be aware of the information, > configuration and permission credentials of the accessed storage service. > Additionally, it enables the utilization of a new backend storage system > without reconfiguring all VMs. Through this project, VMs can access numerous > data services through the file system interface with the assistance of the > OpenDAL service deployed on the host, all without their awareness. > Furthermore, it ensures the efficiency of file system reading and writing in > VMs through VirtioFS support. > h1. *2 Project Detailed Descrption* > > This chapter serves as an introduction to the overall structure of the > project, outlining the design ideas and principles of critical components. It > covers the OVFS architecture, interaction principles, design philosophy, > metadata operations beyond various storage backend, cache pool design, > configuration support, the expected POSIX interface support, and potential > usage scenarios of OVFS. > h2. *2.1 The Architecture of OVFS* > The picture above is the OVFS architecture diagram. OVFS is a file system > implementation based on the VirtioFS protocol and OpenDAL. It serves as a > bridge for semantic access to file system interfaces between VMs and external > storage systems. Leveraging the multiple service access capabilities and > unified abstraction provided by OpenDAL, OVFS can conveniently mount shared > directories in VMs on various existing distributed storage services. > > The complete OVFS architecture consists of three crucial components: > > 1) VMs FUSE client that supports the VirtioFS protocol and implements the > VirtioFS Virtio device specification. An appropriately configured Linux 5.4 > or later can be used for OVFS. The VirtioFS protocol is built on FUSE and > utilizes the VirtioFS Virtio device to transmit FUSE messages. In contrast to > traditional FUSE, where the file system daemon runs in the guest user space, > the VirtioFS protocol supports forwarding file system requests from the guest > to the host, enabling related processes on the host to function as the > guest's local file system. > > 2) A hypervisor that implements the VirtioFS Virtio device specification, > such as QEMU. The hypervisor needs to adhere to the VirtioFS Virtio device > specification, supporting devices used during the operation of VMs, managing > the file system operations of the VMs, and delegating these operations to a > specific vhost-user device backend implementation. > > 3) A vhost-user backend implementation, namely OVFSD (OVFS daemon). This is a > crucial aspect that requires particular attention in this project. This > backend is a file system daemon running on the host side, responsible for > handling all file system operations from VMs to access the shared directory. > virtiofsd offers a practical example of a vhost-user backend implementation, > based on pure Rust, forwarding VMs' file system requests to the local file > system on the host side. > h2. *2.2 How OVFSD Interacts with VMs and Hypervisor* > > The Virtio specification defines device emulation and communication between > VMs and the hypervisor. Among these, the virtio queue is a core component of > the communication mechanism in the Virtio specification and a key mechanism > for achieving efficient communication between VMs and the hypervisor. The > virtio queue is essentially a shared memory area called vring between VMs and > the hypervisor, through which the guest sends and receives data to the host. > > Simultaneously, the Virtio specification provides various forms of Virtio > device models and data interaction support. The vhost-user backend > implemented by OVFSD achieves information transmission through the vhost-user > protocol. The vhost-user protocol enables the sharing of virtio queues > through communication over Unix domain sockets. Interaction with VMs and the > hypervisor is accomplished by listening on the corresponding sockets provided > by the hypervisor. > > In terms of specific implementation, the vm-memory crate, virtio-queue crate > and vhost-user-backend crate play crucial roles in managing the interaction > between OVFSD, VMs, and the hypervisor. > > The vm-memory crate provides encapsulation of VMs memory and achieves > decoupling of memory usage. Through the vm-memory crate, OVFSD can access > relevant memory without knowing the implementation details of the VMs memory. > Two formats of virtio queues are defined in the Virtio specification: split > virtio queue and packed virtio queue. The virtio-queue crate provides support > for the split virtio queue. Through the DescriptorChain package provided by > the virtio-queue crate, OVFSD can parse the corresponding virtio queue > structure from the original vring data. The vhost-user-backend crate provides > a way to start and stop the file system demon, as well as encapsulation of > vring access. OVFSD implements the vhost-user backend service based on the > framework provided by the vhost-user-backend crate and implements the event > loop for the file system process to handle requests through this crate. > h2. *2.3 OVFS Design Philosophy* > > In this section, we will present the design philosophy of the OVFS project. > The concepts introduced here will permeate throughout the entire design and > implementation of OVFS, fully manifesting in other sections of the proposal. > > *Stateless Services* > > The mission of OVFS is to provide efficient and flexible data access methods > for VMs using Virtio and VirtioFS technologies. Through a stateless services > design, OVFS can easily facilitate large-scale deployment, expansion, > restarts, and error recovery in a cluster environment running multiple VMs. > This seamless integration into existing distributed cluster environments > means that users do not need to perceive or maintain additional stateful > services because of OVFS. > > To achieve stateless services, OVFS refrains from persisting any metadata > information. Instead, it maintains and synchronizes all state information of > the OVFS file system during operation through the backend storage system. > There are two implications here: OVFS doesn't need to retain additional > operational status during runtime, and it doesn't require the maintenance of > additional file system metadata when retrieving data from the backend storage > system. Consequently, OVFS doesn't necessitate exclusive access to the > storage system. It permits any other application to read and write data to > the storage system when it serves as the storage backend for OVFS. > Furthermore, OVFS ensures that the usage semantics of data in the storage > system remain unchanged. All data in the storage system is visible and > interpretable to other external applications. > > Under this design, OVFS alleviates concerns regarding synchronization > overhead and potential consistency issues stemming from data alterations in > the storage system due to external operations, thereby reducing the threshold > and risks associated with OVFS usage. > > *Storage System As A Service* > > We aspire for OVFS to serve as a fundamental storage layer within a VM > cluster. With OVFS's assistance, VMs can flexibly and conveniently execute > data read and write operations through existing distributed storage system > clusters. OVFS enables the creation of distinct mount points for various > storage systems under the VMs' mount point. This service design pattern > facilitates mounting once to access multiple existing storage systems. By > accessing different sub-mount points beneath the root mount point of the file > system, VMs can seamlessly access various storage services, imperceptible to > users. > > This design pattern allows users to customize the data access pipeline of VMs > in distributed clusters according to their needs and standardizes the data > reading, writing, and synchronization processes of VMs. In case of a network > or internal error in a mounted storage system, it will not disrupt the normal > operation of other storage systems under different mount points. > > *User-Friendly Interface* > > OVFS must offer users a user-friendly operating interface. This entails > ensuring that OVFS is easy to configure, intuitive, and controllable in terms > of behavior. OVFS accomplishes this through the following aspects: > > 1) It's essential to offer configurations for different storage systems that > align with OpenDAL. For users familiar with OpenDAL, there's no additional > learning curve. > > 2) OVFS is deployed using a formatted configuration file format. The > operation and maintenance of OVFS only require a TOML file with clear content. > > 3) Offer clear documentation, including usage and deployment instructions, > along with relevant scenario descriptions. > h2. *2.4 Metadata Operations Beyond Various Storage Backend* > > OVFS implements a file system model based on OpenDAL. A file system model > that provides POSIX semantics should include access to file data and > metadata, maintenance of directory trees (hierarchical relationships between > files), and additional POSIX interfaces. > > *Lazy Metadata Fetch In OVFS* > > OpenDAL natively supports various storage systems, including object storage, > file storage, key-value storage, and more. However, not all storage systems > directly offer an abstraction of file systems. Take AWS S3 as an example, > which provides object storage services. It abstracts the concepts of buckets > and objects, enabling users to create multiple buckets and multiple objects > within each bucket. Representing this classic two-level relationship in > object storage directly within the nested structure of a file system > directory tree poses a challenge. > > To enable OVFS to support various storage systems as file data storage > backends, OVFS will offer different assumptions for constructing directory > tree semantics for different types of storage systems to achieve file system > semantics. This design approach allows OVFS to lazily obtain metadata > information without the need to store and maintain additional metadata. > Additional metadata not only leads to synchronization and consistency issues > that are challenging to handle but also complicated OVFS's implementation of > stateless services. Stateful services are difficult to maintain and expand, > and they are not suitable for potential virtualization scenarios of OVFS. > > *Metadata Operations Based On Object Storage Backend* > > The working principle of OVFS based on the object storage backend is to > translate the storage names of buckets and objects in object storage into > files and directory systems in the file system. A comprehensive directory > tree architecture is realized by treating the bucket name as a full path in > the file system and treating the slash character "/" in the bucket name as a > directory delimiter. All objects in each bucket are considered as files in > the corresponding directory. File system operations in the VMs can interact > with the object storage system through similar escape operations to achieve > file system-based data reading and writing. The following table lists the > mapping of some file system operations in the object storage system. > |Metadata Operations|Object Storage Backend Operations| > |create a directory with the full path "/xxx/yyy"|create a bucket named > "/xxx/yyy"| > |remove a directory with the full path "/xxx/yyy"|remove a bucket named > "/xxx/yyy"| > |read all directory entries under the directory with the full path > "/xxx/yyy"|list all objects under the bucket named "/xxx/yyy" and the buckets > whose names are prefixed with "/xxx/yyy/"| > |create a file named "zzz" in a directory with the full path > "/xxx/yyy"|create an object named "zzz" under the bucket named "/xxx/yyy"| > |remove a file named "zzz" in a directory with the full path > "/xxx/yyy"|remove an object named "zzz" under the bucket named "/xxx/yyy"| > *Metadata Operations Based On File Storage Backend* > > Unlike distributed object storage systems, distributed file systems already > offer operational support for file system semantics. Therefore, OVFS based on > a distributed file system doesn't require additional processing of file > system requests and can achieve file system semantics simply by forwarding > requests. > > *Limitations Under OVFS Metadata Management* > > While OVFS strives to implement a unified file system access interface for > various storage system backends, users still need to be aware of its > limitations and potential differences. OVFS supports a range of file system > interfaces, but this doesn't imply POSIX standard compliance. OVFS cannot > support some file system calls specified in the POSIX standard. > h2. *2.5 Multi Granular Object Size Cache Pool* > > In order to improve data read and write performance and avoid the significant > overhead caused by repeated transmission of hot data between the storage > system and the host, OVFSD needs to build a data cache in the memory on the > host side. > > *Cache Pool Based On Multi Linked List* > > OVFSD will create a memory pool to cache file data during the file system > read and write process. This huge memory pool is divided into object sizes of > different granularities (such as 4 kb, 16 kb, 64 kb, etc.) to adapt to > different sizes of data file data blocks. > > Unused cache blocks of the same size in the memory pool are organized through > a linked list. When a cache block needs to be allocated, the unused cache > block can be obtained directly from the head of the linked list. When a cache > block that is no longer used needs to be recycled, the cache block is added > to the tail of the linked list. By using linked lists, not only can the > algorithmic complexity of allocation and recycling be O(1), but furthermore, > lock-free concurrency can be achieved by using CAS operations. > > *Write Back Strategy* > > OVFSD manages the data reading and writing process through the write back > strategy. Specifically, when writing data, the data is first written to the > cache, and the dirty data will be gradually synchronized to the backend > storage system in an asynchronous manner. When reading the file data, the > data will be requested from the backend storage system after a cache miss or > expiration , and the new data will be updated to the cache, and its > expiration time will be set. > > OVFSD will update the dirty data in the cache to the storage system in these > cases: > > 1) When VMs called fysnc, fdatasync, or used related flags during data > writing. > > 2) The cache pool is full, and dirty data needs to be written to make space > in the cache. This is also known as cache eviction, and the eviction order > can be maintained using the LRU algorithm. > > 3) Cleaned by threads that regularly clean dirty data or expired data. > > *DAX Window Support (Experimental)* > > The VirtioFS protocol extends the DAX window experimental features based on > the FUSE protocol. This feature allows memory mapping of file contents to be > supported in virtualization scenarios. The mapping is set up by issuing a > FUSE request to OVFSD, which then communicates with QEMU to establish the VMs > memory map. VMs can delete mapping in a similar manner. The size of the DAX > window can be configured based on available VM address space and memory > mapping requirements. > > By using the mmap and memfd mechanisms, OVFSD can use the data in the cache > to create an anonymous memory mapping area and share this memory mapping with > VMs to implement the DAX Window. The best performance is achieved when the > file contents are fully mapped, eliminating the need for file I/O > communication with OVFSD. It is possible to use a small DAX window, but this > incurs more memory map setup/removal overhead. > h2. *2.6 Flexible Configuration Support* > > *Running QEMU With OVFSD* > > As described in the architecture, deploying OVFS involves three parts: a > guest kernel with VirtioFS support, QEMU with VirtioFS support, and the > VirtioFS daemon (OVFSD). Here is an example of running QEMU with OVFSD: > > _host# ovfsd --config-file=./config.toml_ > > _host# qemu-system \_ > __ __ _-blockdev file,node-name=hdd,filename=<image file> \_ > __ __ _-device virtio-blk,drive=hdd \_ > __ __ _-chardev socket,id=char0,path=/tmp/vfsd.sock \_ > __ __ _-device > vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=<fs tag> \_ > __ __ _-object memory-backend-memfd,id=mem,size=4G,share=on \_ > __ __ _-numa node,memdev=mem \_ > __ __ _-accel kvm -m 4G_ > > _guest# mount -t virtiofs <fs tag> <mount point>_ > > The configurations above will generate two devices for the VMs in QEMU. The > block device named hdd serves as the backend for the virtio-blk device within > the VMs. It functions to store the VMs' disk image files and acts as the > primary device within the VMs. Another character device named char0 is > implemented as the backend for the vhost-user-fs-pci device using the > VirtioFS protocol in the VMs. This character device is of socket type and is > connected to the file system daemon in OVFS using the socket path to forward > file system messages and requests to OVFSD. > > It is worth noting that the configuration method largely refers to the > configuration in virtiofsd, and ignores many VMs configurations related file > system access permissions or boundary handling methods. > > *Enable Different Distributed Storage Systems* > > In order for OVFS to utilize the extensive service support provided by > OpenDAL, the corresponding service configuration file needs to be provided > when running OVFSD. The parameters in the configuration file are used to > support access to the storage system, including data root address and > permission authentication. Below is an example of a configuration file, using > a toml format similar to oli (a command line tool based on OpenDAL): > > _[ovfsd_settings]_ > _socket_path = "/tmp/vfsd.sock"_ > _enabled_services = "s3,hdfs"_ > _enabled_cache = true_ > _enabled_cache_write_back = false_ > _enabled_cache_expiration = true_ > _cache_expiration_time = "60s"_ > > _[profiles.s3]_ > _type = "s3"_ > _mount_point = "s3_fs"_ > _bucket = "<bucket>"_ > _endpoint = "https://s3.amazonaws.com"_ > _access_key_id = "<access_key_id>"_ > _secret_access_key = "<secret_access_key>"_ > > _[profiles.swift]_ > _type = "swift"_ > _mount_point = "swift_fs"_ > _endpoint = "https://openstack-controller.example.com:8080/v1/account"_ > _container = "container"_ > _token = "access_token"_ > > _[profiles.hdfs]_ > _type = "hdfs"_ > _mount_point = "hdfs_fs"_ > _name_node = "hdfs://127.0.0.1:9000"_ > > OVFS can achieve hot reloading by monitoring changes in the configuration > file. This approach allows OVFS to avoid restarting the entire service when > modifying certain storage system access configurations and mounting > conditions, thus preventing the blocking of correct request processing for > all file systems in the virtual machine. > h2. *2.7 Expected POSIX Interface Support* > > Finally, the table below lists the expected POSIX system call support to be > provided by OVFS, along with the corresponding types of distributed storage > systems used by OpenDAL. > > |System Call|Object Storage|File Storage|Key-Value Storage| > |getattr|Support|Support|Not Support| > |mknod/unlink|Support|Support|Not Support| > |mkdir/rmdir|Support|Support|Not Support| > |open/release|Support|Support|Not Support| > |read/write|Support|Support|Not Support| > |truncate|Support|Support|Not Support| > |opendir/releasedir|Support|Support|Not Support| > |readdir|Support|Support|Not Support| > |rename|Support|Support|Not Support| > |flush/fsync|Support|Support|Not Support| > |getxattr/setxattr|Not Support|Not Support|Not Support| > |chmod/chown|Not Support|Not Support|Not Support| > |access|Not Support|Not Support|Not Support| > > Since the data volume of an individual file may be substantial, contradicting > the design of key-value storage, we do not intend to include support for > key-value Storage in this project. The complex permission system control of > Linux is not within the scope of this project. Users can restrict file system > access behavior based on the configuration of storage system access > permissions in the OVFS configuration file. > h2. *2.8 Potential Usage Scenarios* > > In this section, we list some potential OVFS usage scenarios and application > areas through the detailed description of the OVFS project in the proposal. > It's worth mentioning that as the project progresses, more application > scenarios and areas of advantage may expand, leading to a deeper > understanding of the positioning of the OVFS project. > > 1) Unified data management basic software within distributed clusters. > > 2) The OVFS project could prove highly beneficial for large-scale data > analysis applications and machine learning training projects. It offers a > means for applications within VM clusters to read and write data, models, > checkpoints, and logs through common file system interfaces across various > distributed storage systems. > h1. *3 Deliverables* > > This chapter describes the items that the OVFS project needs to deliver > during the implementation cycle of GSoC 2024. > > 1) A code repository that implements the functions described in the project > details. The services implemented by OVFS in the code repository need to meet > the following requirements: (1) VirtioFS implementation, well integrated with > VMs and QEMU, able to correctly handle VMs read and write requests to the > file system. (2) Supports the use of distributed object storage systems and > distributed file systems as storage backends, and provides complete and > correct support for at least one specific storage service type for each > storage system type. S3 can be used as the target for object storage systems, > and HDFS can be used as the target for distributed file systems. (3) Supports > related configurations of various storage systems. Users can configure > storage system access and use according to actual needs. When an error > occurs, users can use the configuration file to restart services. > > 2) Form an OVFS related test suite. Testing about the project should consist > of two parts: (1) Unit testing in code components. Unit testing is the > guarantee that the code and related functions are implemented correctly. This > test implementation accompanies the entire code implementation process. (2) > CI testing based on github actions. The OpenDAL project integrates a large > number of CI tests to ensure the correct behavior of OpenDAL under various > storage backends. OVFS needs to use good CI testing to check potential errors > during code submission. > > 3) A performance test report of OVFS. The report needs to perform basic > metadata operations, data reading and writing performance tests on the VMs > mounted with OVFS, and summarize the performance of OVFS through the test > results. Reports can be based on file system performance testing tools such > as fio, sysbench and mdtest, and compared with virtiofsd when necessary. > > 4) Documentation on the introduction and use of OVFS, and promote the > inclusion of OVFS documentation into the official OpenDAL documentation when > the GSoC project is completed. > h1. Mentor > Mentor: Xuanwo, Apache Apache PMC Member Chair, > [xua...@apache.org|mailto:xua...@apache.org] > Mailing List: [d...@opendal.apache.org|mailto:d...@opendal.apache.org] -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: gsoc-unsubscr...@community.apache.org For additional commands, e-mail: gsoc-h...@community.apache.org