[ 
https://issues.apache.org/jira/browse/GSOC-257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829351#comment-17829351
 ] 

Hao Ding commented on GSOC-257:
-------------------------------

https://docs.google.com/document/d/1huy8vHcoCTf-GausabR3PCwXIXfRJAEckeyTulp2gDU/edit#heading=h.j5qg8liqdqtz

>  Apache OpenDAL OVFS Project Proposal
> -------------------------------------
>
>                 Key: GSOC-257
>                 URL: https://issues.apache.org/jira/browse/GSOC-257
>             Project: Comdev GSOC
>          Issue Type: New Feature
>            Reporter: Hao Ding
>            Priority: Major
>              Labels: OpenDAL, gsoc2024, mentor
>
> h1. *1 Project Abstract*
>  
> Virtio is an open standard designed to enhance I/O performance between 
> virtual machines (VMs) and host systems in virtualized environments. VirtioFS 
> is an extension of the Virtio standard specifically crafted for file system 
> sharing between VMs and the host. This is particularly beneficial in 
> scenarios where seamless access to shared files and data between VMs and the 
> host is essential. VirtioFS has been widely adopted in virtualization 
> technologies such as QEMU and Kata Container.
>  
> Apache OpenDAL is a data access layer that allows users to easily and 
> efficiently retrieve data from various storage services in a unified manner. 
> In this project, our goal is to reference virtiofsd (a standard vhost-user 
> backend, a pure Rust implementation of VirtioFS based on the local file 
> system) and implement VirtioFS based on OpenDAL.
>  
> This storage-system-as-a-service approach conceals the details of the 
> distributed storage system's file system from VMs. This ensures the security 
> of storage services, as VMs do not need to be aware of the information, 
> configuration and permission credentials of the accessed storage service. 
> Additionally, it enables the utilization of a new backend storage system 
> without reconfiguring all VMs. Through this project, VMs can access numerous 
> data services through the file system interface with the assistance of the 
> OpenDAL service deployed on the host, all without their awareness. 
> Furthermore, it ensures the efficiency of file system reading and writing in 
> VMs through VirtioFS support.
> h1. *2 Project Detailed Descrption*
>  
> This chapter serves as an introduction to the overall structure of the 
> project, outlining the design ideas and principles of critical components. It 
> covers the OVFS architecture, interaction principles, design philosophy, 
> metadata operations beyond various storage backend, cache pool design, 
> configuration support, the expected POSIX interface support, and potential 
> usage scenarios of OVFS.
> h2. *2.1 The Architecture of OVFS*
> The picture above is the OVFS architecture diagram. OVFS is a file system 
> implementation based on the VirtioFS protocol and OpenDAL. It serves as a 
> bridge for semantic access to file system interfaces between VMs and external 
> storage systems. Leveraging the multiple service access capabilities and 
> unified abstraction provided by OpenDAL, OVFS can conveniently mount shared 
> directories in VMs on various existing distributed storage services.
>  
> The complete OVFS architecture consists of three crucial components:
>  
> 1) VMs FUSE client that supports the VirtioFS protocol and implements the 
> VirtioFS Virtio device specification. An appropriately configured Linux 5.4 
> or later can be used for OVFS. The VirtioFS protocol is built on FUSE and 
> utilizes the VirtioFS Virtio device to transmit FUSE messages. In contrast to 
> traditional FUSE, where the file system daemon runs in the guest user space, 
> the VirtioFS protocol supports forwarding file system requests from the guest 
> to the host, enabling related processes on the host to function as the 
> guest's local file system.
>  
> 2) A hypervisor that implements the VirtioFS Virtio device specification, 
> such as QEMU. The hypervisor needs to adhere to the VirtioFS Virtio device 
> specification, supporting devices used during the operation of VMs, managing 
> the file system operations of the VMs, and delegating these operations to a 
> specific vhost-user device backend implementation.
>  
> 3) A vhost-user backend implementation, namely OVFSD (OVFS daemon). This is a 
> crucial aspect that requires particular attention in this project. This 
> backend is a file system daemon running on the host side, responsible for 
> handling all file system operations from VMs to access the shared directory. 
> virtiofsd offers a practical example of a vhost-user backend implementation, 
> based on pure Rust, forwarding VMs' file system requests to the local file 
> system on the host side.
> h2. *2.2 How OVFSD Interacts with VMs and Hypervisor*
>  
> The Virtio specification defines device emulation and communication between 
> VMs and the hypervisor. Among these, the virtio queue is a core component of 
> the communication mechanism in the Virtio specification and a key mechanism 
> for achieving efficient communication between VMs and the hypervisor. The 
> virtio queue is essentially a shared memory area called vring between VMs and 
> the hypervisor, through which the guest sends and receives data to the host.
>  
> Simultaneously, the Virtio specification provides various forms of Virtio 
> device models and data interaction support. The vhost-user backend 
> implemented by OVFSD achieves information transmission through the vhost-user 
> protocol. The vhost-user protocol enables the sharing of virtio queues 
> through communication over Unix domain sockets. Interaction with VMs and the 
> hypervisor is accomplished by listening on the corresponding sockets provided 
> by the hypervisor.
>  
> In terms of specific implementation, the vm-memory crate, virtio-queue crate 
> and vhost-user-backend crate play crucial roles in managing the interaction 
> between OVFSD, VMs, and the hypervisor.
>  
> The vm-memory crate provides encapsulation of VMs memory and achieves 
> decoupling of memory usage. Through the vm-memory crate, OVFSD can access 
> relevant memory without knowing the implementation details of the VMs memory. 
> Two formats of virtio queues are defined in the Virtio specification: split 
> virtio queue and packed virtio queue. The virtio-queue crate provides support 
> for the split virtio queue. Through the DescriptorChain package provided by 
> the virtio-queue crate, OVFSD can parse the corresponding virtio queue 
> structure from the original vring data. The vhost-user-backend crate provides 
> a way to start and stop the file system demon, as well as encapsulation of 
> vring access. OVFSD implements the vhost-user backend service based on the 
> framework provided by the vhost-user-backend crate and implements the event 
> loop for the file system process to handle requests through this crate.
> h2. *2.3 OVFS Design Philosophy*
>  
> In this section, we will present the design philosophy of the OVFS project. 
> The concepts introduced here will permeate throughout the entire design and 
> implementation of OVFS, fully manifesting in other sections of the proposal.
>  
> *Stateless Services*
>  
> The mission of OVFS is to provide efficient and flexible data access methods 
> for VMs using Virtio and VirtioFS technologies. Through a stateless services 
> design, OVFS can easily facilitate large-scale deployment, expansion, 
> restarts, and error recovery in a cluster environment running multiple VMs. 
> This seamless integration into existing distributed cluster environments 
> means that users do not need to perceive or maintain additional stateful 
> services because of OVFS.
>  
> To achieve stateless services, OVFS refrains from persisting any metadata 
> information. Instead, it maintains and synchronizes all state information of 
> the OVFS file system during operation through the backend storage system. 
> There are two implications here: OVFS doesn't need to retain additional 
> operational status during runtime, and it doesn't require the maintenance of 
> additional file system metadata when retrieving data from the backend storage 
> system. Consequently, OVFS doesn't necessitate exclusive access to the 
> storage system. It permits any other application to read and write data to 
> the storage system when it serves as the storage backend for OVFS. 
> Furthermore, OVFS ensures that the usage semantics of data in the storage 
> system remain unchanged. All data in the storage system is visible and 
> interpretable to other external applications.
>  
> Under this design, OVFS alleviates concerns regarding synchronization 
> overhead and potential consistency issues stemming from data alterations in 
> the storage system due to external operations, thereby reducing the threshold 
> and risks associated with OVFS usage.
>  
> *Storage System As A Service*
>  
> We aspire for OVFS to serve as a fundamental storage layer within a VM 
> cluster. With OVFS's assistance, VMs can flexibly and conveniently execute 
> data read and write operations through existing distributed storage system 
> clusters. OVFS enables the creation of distinct mount points for various 
> storage systems under the VMs' mount point. This service design pattern 
> facilitates mounting once to access multiple existing storage systems. By 
> accessing different sub-mount points beneath the root mount point of the file 
> system, VMs can seamlessly access various storage services, imperceptible to 
> users.
>  
> This design pattern allows users to customize the data access pipeline of VMs 
> in distributed clusters according to their needs and standardizes the data 
> reading, writing, and synchronization processes of VMs. In case of a network 
> or internal error in a mounted storage system, it will not disrupt the normal 
> operation of other storage systems under different mount points.
>  
> *User-Friendly Interface*
>  
> OVFS must offer users a user-friendly operating interface. This entails 
> ensuring that OVFS is easy to configure, intuitive, and controllable in terms 
> of behavior. OVFS accomplishes this through the following aspects:
>  
> 1) It's essential to offer configurations for different storage systems that 
> align with OpenDAL. For users familiar with OpenDAL, there's no additional 
> learning curve.
>  
> 2) OVFS is deployed using a formatted configuration file format. The 
> operation and maintenance of OVFS only require a TOML file with clear content.
>  
> 3) Offer clear documentation, including usage and deployment instructions, 
> along with relevant scenario descriptions.
> h2. *2.4 Metadata Operations Beyond Various Storage Backend*
>  
> OVFS implements a file system model based on OpenDAL. A file system model 
> that provides POSIX semantics should include access to file data and 
> metadata, maintenance of directory trees (hierarchical relationships between 
> files), and additional POSIX interfaces.
>  
> *Lazy Metadata Fetch In OVFS*
>  
> OpenDAL natively supports various storage systems, including object storage, 
> file storage, key-value storage, and more. However, not all storage systems 
> directly offer an abstraction of file systems. Take AWS S3 as an example, 
> which provides object storage services. It abstracts the concepts of buckets 
> and objects, enabling users to create multiple buckets and multiple objects 
> within each bucket. Representing this classic two-level relationship in 
> object storage directly within the nested structure of a file system 
> directory tree poses a challenge.
>  
> To enable OVFS to support various storage systems as file data storage 
> backends, OVFS will offer different assumptions for constructing directory 
> tree semantics for different types of storage systems to achieve file system 
> semantics. This design approach allows OVFS to lazily obtain metadata 
> information without the need to store and maintain additional metadata. 
> Additional metadata not only leads to synchronization and consistency issues 
> that are challenging to handle but also complicated OVFS's implementation of 
> stateless services. Stateful services are difficult to maintain and expand, 
> and they are not suitable for potential virtualization scenarios of OVFS.
>  
> *Metadata Operations Based On Object Storage Backend*
>  
> The working principle of OVFS based on the object storage backend is to 
> translate the storage names of buckets and objects in object storage into 
> files and directory systems in the file system. A comprehensive directory 
> tree architecture is realized by treating the bucket name as a full path in 
> the file system and treating the slash character "/" in the bucket name as a 
> directory delimiter. All objects in each bucket are considered as files in 
> the corresponding directory. File system operations in the VMs can interact 
> with the object storage system through similar escape operations to achieve 
> file system-based data reading and writing. The following table lists the 
> mapping of some file system operations in the object storage system.
> |Metadata Operations|Object Storage Backend Operations|
> |create a directory with the full path "/xxx/yyy"|create a bucket named 
> "/xxx/yyy"|
> |remove a directory with the full path "/xxx/yyy"|remove a bucket named 
> "/xxx/yyy"|
> |read all directory entries under the directory with the full path 
> "/xxx/yyy"|list all objects under the bucket named "/xxx/yyy" and the buckets 
> whose names are prefixed with "/xxx/yyy/"|
> |create a file named "zzz" in a directory with the full path 
> "/xxx/yyy"|create an object named "zzz" under the bucket named "/xxx/yyy"|
> |remove a file named "zzz" in a directory with the full path 
> "/xxx/yyy"|remove an object named "zzz" under the bucket named "/xxx/yyy"|
> *Metadata Operations Based On File Storage Backend*
>  
> Unlike distributed object storage systems, distributed file systems already 
> offer operational support for file system semantics. Therefore, OVFS based on 
> a distributed file system doesn't require additional processing of file 
> system requests and can achieve file system semantics simply by forwarding 
> requests.
>  
> *Limitations Under OVFS Metadata Management*
>  
> While OVFS strives to implement a unified file system access interface for 
> various storage system backends, users still need to be aware of its 
> limitations and potential differences. OVFS supports a range of file system 
> interfaces, but this doesn't imply POSIX standard compliance. OVFS cannot 
> support some file system calls specified in the POSIX standard.
> h2. *2.5 Multi Granular Object Size Cache Pool*
>  
> In order to improve data read and write performance and avoid the significant 
> overhead caused by repeated transmission of hot data between the storage 
> system and the host, OVFSD needs to build a data cache in the memory on the 
> host side.
>  
> *Cache Pool Based On Multi Linked List*
>  
> OVFSD will create a memory pool to cache file data during the file system 
> read and write process. This huge memory pool is divided into object sizes of 
> different granularities (such as 4 kb, 16 kb, 64 kb, etc.) to adapt to 
> different sizes of data file data blocks.
>  
> Unused cache blocks of the same size in the memory pool are organized through 
> a linked list. When a cache block needs to be allocated, the unused cache 
> block can be obtained directly from the head of the linked list. When a cache 
> block that is no longer used needs to be recycled, the cache block is added 
> to the tail of the linked list. By using linked lists, not only can the 
> algorithmic complexity of allocation and recycling be O(1), but furthermore, 
> lock-free concurrency can be achieved by using CAS operations.
>  
> *Write Back Strategy*
>  
> OVFSD manages the data reading and writing process through the write back 
> strategy. Specifically, when writing data, the data is first written to the 
> cache, and the dirty data will be gradually synchronized to the backend 
> storage system in an asynchronous manner. When reading the file data, the 
> data will be requested from the backend storage system after a cache miss or 
> expiration , and the new data will be updated to the cache, and its 
> expiration time will be set.
>  
> OVFSD will update the dirty data in the cache to the storage system in these 
> cases:
>  
> 1) When VMs called fysnc, fdatasync, or used related flags during data 
> writing.
>  
> 2) The cache pool is full, and dirty data needs to be written to make space 
> in the cache. This is also known as cache eviction, and the eviction order 
> can be maintained using the LRU algorithm.
>  
> 3) Cleaned by threads that regularly clean dirty data or expired data.
>  
> *DAX Window Support (Experimental)*
>  
> The VirtioFS protocol extends the DAX window experimental features based on 
> the FUSE protocol. This feature allows memory mapping of file contents to be 
> supported in virtualization scenarios. The mapping is set up by issuing a 
> FUSE request to OVFSD, which then communicates with QEMU to establish the VMs 
> memory map. VMs can delete mapping in a similar manner. The size of the DAX 
> window can be configured based on available VM address space and memory 
> mapping requirements.
>  
> By using the mmap and memfd mechanisms, OVFSD can use the data in the cache 
> to create an anonymous memory mapping area and share this memory mapping with 
> VMs to implement the DAX Window. The best performance is achieved when the 
> file contents are fully mapped, eliminating the need for file I/O 
> communication with OVFSD. It is possible to use a small DAX window, but this 
> incurs more memory map setup/removal overhead.
> h2. *2.6 Flexible Configuration Support*
>  
> *Running QEMU With OVFSD*
>  
> As described in the architecture, deploying OVFS involves three parts: a 
> guest kernel with VirtioFS support, QEMU with VirtioFS support, and the 
> VirtioFS daemon (OVFSD). Here is an example of running QEMU with OVFSD:
>  
> _host# ovfsd --config-file=./config.toml_
>  
> _host# qemu-system \_
>     __     __ _-blockdev file,node-name=hdd,filename=<image file> \_
>     __     __ _-device virtio-blk,drive=hdd \_
>     __     __ _-chardev socket,id=char0,path=/tmp/vfsd.sock \_
>     __     __ _-device 
> vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=<fs tag> \_
>     __     __ _-object memory-backend-memfd,id=mem,size=4G,share=on \_
>     __     __ _-numa node,memdev=mem \_
>     __     __ _-accel kvm -m 4G_
>  
> _guest# mount -t virtiofs <fs tag> <mount point>_
>  
> The configurations above will generate two devices for the VMs in QEMU. The 
> block device named hdd serves as the backend for the virtio-blk device within 
> the VMs. It functions to store the VMs' disk image files and acts as the 
> primary device within the VMs. Another character device named char0 is 
> implemented as the backend for the vhost-user-fs-pci device using the 
> VirtioFS protocol in the VMs. This character device is of socket type and is 
> connected to the file system daemon in OVFS using the socket path to forward 
> file system messages and requests to OVFSD.
>  
> It is worth noting that the configuration method largely refers to the 
> configuration in virtiofsd, and ignores many VMs configurations related file 
> system access permissions or boundary handling methods.
>  
> *Enable Different Distributed Storage Systems*
>  
> In order for OVFS to utilize the extensive service support provided by 
> OpenDAL, the corresponding service configuration file needs to be provided 
> when running OVFSD. The parameters in the configuration file are used to 
> support access to the storage system, including data root address and 
> permission authentication. Below is an example of a configuration file, using 
> a toml format similar to oli (a command line tool based on OpenDAL):
>  
> _[ovfsd_settings]_
> _socket_path = "/tmp/vfsd.sock"_
> _enabled_services = "s3,hdfs"_
> _enabled_cache = true_
> _enabled_cache_write_back = false_
> _enabled_cache_expiration = true_
> _cache_expiration_time = "60s"_
>  
> _[profiles.s3]_
> _type = "s3"_
> _mount_point = "s3_fs"_
> _bucket = "<bucket>"_
> _endpoint = "https://s3.amazonaws.com"_
> _access_key_id = "<access_key_id>"_
> _secret_access_key = "<secret_access_key>"_
>  
> _[profiles.swift]_
> _type = "swift"_
> _mount_point = "swift_fs"_
> _endpoint = "https://openstack-controller.example.com:8080/v1/account"_
> _container = "container"_
> _token = "access_token"_
>  
> _[profiles.hdfs]_
> _type = "hdfs"_
> _mount_point = "hdfs_fs"_
> _name_node = "hdfs://127.0.0.1:9000"_
>  
> OVFS can achieve hot reloading by monitoring changes in the configuration 
> file. This approach allows OVFS to avoid restarting the entire service when 
> modifying certain storage system access configurations and mounting 
> conditions, thus preventing the blocking of correct request processing for 
> all file systems in the virtual machine.
> h2. *2.7 Expected POSIX Interface Support*
>  
> Finally, the table below lists the expected POSIX system call support to be 
> provided by OVFS, along with the corresponding types of distributed storage 
> systems used by OpenDAL.
>  
> |System Call|Object Storage|File Storage|Key-Value Storage|
> |getattr|Support|Support|Not Support|
> |mknod/unlink|Support|Support|Not Support|
> |mkdir/rmdir|Support|Support|Not Support|
> |open/release|Support|Support|Not Support|
> |read/write|Support|Support|Not Support|
> |truncate|Support|Support|Not Support|
> |opendir/releasedir|Support|Support|Not Support|
> |readdir|Support|Support|Not Support|
> |rename|Support|Support|Not Support|
> |flush/fsync|Support|Support|Not Support|
> |getxattr/setxattr|Not Support|Not Support|Not Support|
> |chmod/chown|Not Support|Not Support|Not Support|
> |access|Not Support|Not Support|Not Support|
>  
> Since the data volume of an individual file may be substantial, contradicting 
> the design of key-value storage, we do not intend to include support for 
> key-value Storage in this project. The complex permission system control of 
> Linux is not within the scope of this project. Users can restrict file system 
> access behavior based on the configuration of storage system access 
> permissions in the OVFS configuration file.
> h2. *2.8 Potential Usage Scenarios*
>  
> In this section, we list some potential OVFS usage scenarios and application 
> areas through the detailed description of the OVFS project in the proposal. 
> It's worth mentioning that as the project progresses, more application 
> scenarios and areas of advantage may expand, leading to a deeper 
> understanding of the positioning of the OVFS project.
>  
> 1) Unified data management basic software within distributed clusters.
>  
> 2) The OVFS project could prove highly beneficial for large-scale data 
> analysis applications and machine learning training projects. It offers a 
> means for applications within VM clusters to read and write data, models, 
> checkpoints, and logs through common file system interfaces across various 
> distributed storage systems.
> h1. *3 Deliverables*
>  
> This chapter describes the items that the OVFS project needs to deliver 
> during the implementation cycle of GSoC 2024.
>  
> 1) A code repository that implements the functions described in the project 
> details. The services implemented by OVFS in the code repository need to meet 
> the following requirements: (1) VirtioFS implementation, well integrated with 
> VMs and QEMU, able to correctly handle VMs read and write requests to the 
> file system. (2) Supports the use of distributed object storage systems and 
> distributed file systems as storage backends, and provides complete and 
> correct support for at least one specific storage service type for each 
> storage system type. S3 can be used as the target for object storage systems, 
> and HDFS can be used as the target for distributed file systems. (3) Supports 
> related configurations of various storage systems. Users can configure 
> storage system access and use according to actual needs. When an error 
> occurs, users can use the configuration file to restart services.
>  
> 2) Form an OVFS related test suite. Testing about the project should consist 
> of two parts: (1) Unit testing in code components. Unit testing is the 
> guarantee that the code and related functions are implemented correctly. This 
> test implementation accompanies the entire code implementation process. (2) 
> CI testing based on github actions. The OpenDAL project integrates a large 
> number of CI tests to ensure the correct behavior of OpenDAL under various 
> storage backends. OVFS needs to use good CI testing to check potential errors 
> during code submission.
>  
> 3) A performance test report of OVFS. The report needs to perform basic 
> metadata operations, data reading and writing performance tests on the VMs 
> mounted with OVFS, and summarize the performance of OVFS through the test 
> results. Reports can be based on file system performance testing tools such 
> as fio, sysbench and mdtest, and compared with virtiofsd when necessary.
>  
> 4) Documentation on the introduction and use of OVFS, and promote the 
> inclusion of OVFS documentation into the official OpenDAL documentation when 
> the GSoC project is completed.
> h1. Mentor
> Mentor: Xuanwo, Apache Apache PMC Member Chair, 
> [xua...@apache.org|mailto:xua...@apache.org]
> Mailing List: [d...@opendal.apache.org|mailto:d...@opendal.apache.org]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: gsoc-unsubscr...@community.apache.org
For additional commands, e-mail: gsoc-h...@community.apache.org

Reply via email to