[ 
https://issues.apache.org/jira/browse/HDDS-12659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Andika updated HDDS-12659:
-------------------------------
    Description: 
In the future, Ozone can support "one file per container" storage layout. 
Currently, Ozone support FilePerBlock (current) and FilePerChunk (deprecated).

The current FilePerBlock storage layout have the following benefits:
 * No write contentions for writing blocks belonging to the write container
 ** However, for Ratis pipeline, this is also guaranteed by sequential nature 
of Raft algorithm
 * Block file can be deleted as soon as the datanode receives the deletion 
command

However, the FilePerBlock layout is not good for handling a lot of small files 
since each block is stored a separate file. This increases the inode tree size 
of the datanodes and cause memory issues when we need to check all the block 
files (e.g. scanner / volume size using "du"). So while Ozone alleviates the 
small files problem related to the metadata (by storing it persistently in 
RocksDB), we haven't fully addressed the small file issues related to the data 
(storage layout). We can check the number of inodes using "df -i" command.

For example, recently we saw that one DN has with high read traffic results in 
high memory usage due to inode + dentry cache and FS cache buffer (buffer_head)

slabtop -sc

{code:java}
Active / Total Objects (% used)    : 48293416 / 58851801 (82.1%)
 Active / Total Slabs (% used)      : 1326850 / 1326850 (100.0%)
 Active / Total Caches (% used)     : 103 / 140 (73.6%)
 Active / Total Size (% used)       : 8047950.31K / 13247836.52K (60.7%)
 Minimum / Average / Maximum Object : 0.01K / 0.22K / 16.69K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
10807090 3342402  30%    0.57K 192996       56   6175872K radix_tree_node
30966195 30764318  99%    0.10K 794005       39   3176020K buffer_head
788121 184537  23%    1.05K  26277       30    840864K ext4_inode_cache
3066378 2012259  65%    0.19K  73010       42    584080K dentry
3323376 3323376 100%    0.14K  59346       56    474768K ext4_groupinfo_4k
432557 426347  98%    0.65K   8847       49    283104K proc_inode_cache
244750 241424  98%    0.58K   4450       55    142400K inode_cache
144109 143451  99%    0.81K   3696       39    118272K sock_inode_cache
 13932  13480  96%    7.62K   3486        4    111552K task_struct
444608 433987  97%    0.25K  13895       32    111160K filp
490680 487837  99%    0.20K  12267       40     98136K vm_area_struct
 18632  18486  99%    4.00K   2329        8     74528K kmalloc-4k
146208 146080  99%    0.50K   4569       32     73104K kmalloc-512
 66348  65889  99%    1.00K   2076       32     66432K kmalloc-1k
 25958  25381  97%    2.19K   1856       14     59392K TCP
 28866  28792  99%    2.00K   1805       16     57760K kmalloc-2k
1003750 212013  21%    0.05K  13750       73     55000K Acpi-Parse
 44940  44906  99%    1.12K   1605       28     51360K signal_cache
700736 699330  99%    0.06K  10949       64     43796K anon_vma_chain
 42016  42016 100%    1.00K   1313       32     42016K UNIX
335392 333507  99%    0.12K  10481       32     41924K pid
 19605  19504  99%    2.06K   1307       15     41824K sighand_cache
 36540  36540 100%    1.06K   1218       30     38976K mm_struct
 34209  34209 100%    1.06K   1143       30     36576K UDP
 52118  52118 100%    0.69K   1133       46     36256K files_cache
410090 408997  99%    0.09K   8915       46     35660K anon_vma
410536 410532  99%    0.07K   7331       56     29324K Acpi-Operand
148512 148135  99%    0.19K   3536       42     28288K cred_jar
 38180  37593  98%    0.69K    830       46     26560K shmem_inode_cache
392448  54238  13%    0.06K   6132       64     24528K vmap_area
 79712  79158  99%    0.25K   2491       32     19928K skbuff_head_cache
  2424   2424 100%    8.00K    606        4     19392K kmalloc-8k
139980 139980 100%    0.13K   2333       60     18664K kernfs_node_cache
 12750  12750 100%    1.25K    510       25     16320K UDPv6
250752 249441  99%    0.06K   3918       64     15672K kmalloc-64
497664 491010  98%    0.03K   3888      128     15552K kmalloc-32
 80976  80811  99%    0.19K   1928       42     15424K kmalloc-192
 14532  14532 100%    1.00K    455       32     14560K RAW
193144 192752  99%    0.07K   3449       56     13796K eventpoll_pwq
349758 319683  91%    0.04K   3429      102     13716K ext4_extent_status
165087 164781  99%    0.08K   3237       51     12948K task_delay_info
 66486  66364  99%    0.19K   1583       42     12664K skbuff_ext_cache
  4927   4927 100%    2.31K    379       13     12128K TCPv6
  5586   5522  98%    2.00K    350       16     11200K biovec-128
  9248   9248 100%    1.00K    289       32      9248K biovec-64
209814 209814 100%    0.04K   2057      102      8228K avtab_extended_perms
126784 122402  96%    0.06K   1981       64      7924K kmalloc-rcl-64
 32930  32634  99%    0.21K    890       37      7120K file_lock_cache
 14144  14112  99%    0.50K    442       32      7072K skbuff_fclone_cache
178092 178092 100%    0.04K   1746      102      6984K pde_opener
  1648   1586  96%    4.00K    206        8      6592K biovec-max
421632 421363  99%    0.02K   1647      256      6588K kmalloc-16
 67032  65446  97%    0.09K   1596       42      6384K kmalloc-rcl-96
 13536  13178  97%    0.44K    376       36      6016K kmem_cache
 17493  17212  98%    0.31K    343       51      5488K mnt_cache
  1192   1120  93%    4.00K    149        8      4768K names_cache
  4356   4356 100%    0.94K    129       34      4128K PING
 41244  40645  98%    0.09K    982       42      3928K kmalloc-96
 15520  15296  98%    0.25K    485       32      3880K dquot
482816 482816 100%    0.01K    943      512      3772K kmalloc-8
 15072  14952  99%    0.25K    471       32      3768K kmalloc-256
 28704  15324  53%    0.12K    897       32      3588K kmalloc-128
 55104  55104 100%    0.06K    861       64      3444K ext4_io_end
 64090  64090 100%    0.05K    754       85      3016K ftrace_event_field
189440 188001  99%    0.02K    740      256      2960K lsm_file_cache
 93952  92800  98%    0.03K    734      128      2936K jbd2_revoke_record_s
 11220  11220 100%    0.24K    340       33      2720K tw_sock_TCP
 34272  34272 100%    0.08K    672       51      2688K Acpi-State
{code}


An alternative storage layout can be one file per container. This is 
implemented in some existing distributed object storage / file system like 
SeaweedFS's volume (similar to Facebook's Haystack). 

This has the benefit of reducing the small files in the datanode. One container 
file can contain hundreds or thousands of logical files. 

Additionally, we can move the container metadata to the file instead of the 
Container file to ensure O(1) disk seek per read. Currently, the we need to 
check the container DB first and then get the associated blocks which might 
container more disk seeks than necessary (depending on the read amplification 
of RocksDB, etc).

However, this also comes with some drawbacks:
 * Bookkeeping required
 ** We need to keep some metadata (e.g. to track which blocks are in which 
offset of the container file) which can be implemented as a separate "index 
file" or on the header (superblock) of the data file
 * Deletion is delayed until the compaction / reclamation task
 ** Deletion of a block needs to mark the particular block as deleted
 ** A separate background task will run the compaction (garbage collection) 
task where it will create a new container file with the deleted blocks removed
 *** This can momentarily increase the datanode space usage since a new file 
needs to be created
 * Write contention on the same file
 ** If two clients are writing to the same container file at the same time, a 
file lock needs to be used to prevent race condition
 ** This introduces write contention and will reduce the write throughput.

We might also store the small files directly in the RocksDB (e.g. using 
[https://github.com/facebook/rocksdb/wiki/BlobDB]).

This is a long-term wish to kickstart discussions on the feasibility of this 
storage layout in Ozone in the future.

  was:
In the future, Ozone can support "one file per container" storage layout. 
Currently, Ozone support FilePerBlock (current) and FilePerChunk (deprecated).

The current FilePerBlock storage layout have the following benefits:
 * No write contentions for writing blocks belonging to the write container
 ** However, for Ratis pipeline, this is also guaranteed by sequential nature 
of Raft algorithm
 * Block file can be deleted as soon as the datanode receives the deletion 
command

However, the FilePerBlock layout is not good for handling a lot of small files 
since each block is stored a separate file. This increases the inode tree size 
of the datanodes and cause memory issues when we need to check all the block 
files (e.g. scanner / volume size using "du"). So while Ozone alleviates the 
small files problem related to the metadata (by storing it persistently in 
RocksDB), we haven't fully addressed the small file issues related to the data 
(storage layout). We can check the number of inodes using "df -i" command.

For example, recently we saw that one DN has with high read traffic results in 
high memory usage due to inode + dentry cache and FS cache buffer (buffer_head)

slabtop -sc

{code:java}
Active / Total Objects (% used)    : 48293416 / 58851801 (82.1%)
 Active / Total Slabs (% used)      : 1326850 / 1326850 (100.0%)
 Active / Total Caches (% used)     : 103 / 140 (73.6%)
 Active / Total Size (% used)       : 8047950.31K / 13247836.52K (60.7%)
 Minimum / Average / Maximum Object : 0.01K / 0.22K / 16.69K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
10807090 3342402  30%    0.57K 192996       56   6175872K radix_tree_node
30966195 30764318  99%    0.10K 794005       39   3176020K buffer_head
788121 184537  23%    1.05K  26277       30    840864K ext4_inode_cache
3066378 2012259  65%    0.19K  73010       42    584080K dentry
3323376 3323376 100%    0.14K  59346       56    474768K ext4_groupinfo_4k
432557 426347  98%    0.65K   8847       49    283104K proc_inode_cache
244750 241424  98%    0.58K   4450       55    142400K inode_cache
144109 143451  99%    0.81K   3696       39    118272K sock_inode_cache
 13932  13480  96%    7.62K   3486        4    111552K task_struct
444608 433987  97%    0.25K  13895       32    111160K filp
490680 487837  99%    0.20K  12267       40     98136K vm_area_struct
 18632  18486  99%    4.00K   2329        8     74528K kmalloc-4k
146208 146080  99%    0.50K   4569       32     73104K kmalloc-512
 66348  65889  99%    1.00K   2076       32     66432K kmalloc-1k
 25958  25381  97%    2.19K   1856       14     59392K TCP
 28866  28792  99%    2.00K   1805       16     57760K kmalloc-2k
1003750 212013  21%    0.05K  13750       73     55000K Acpi-Parse
 44940  44906  99%    1.12K   1605       28     51360K signal_cache
700736 699330  99%    0.06K  10949       64     43796K anon_vma_chain
 42016  42016 100%    1.00K   1313       32     42016K UNIX
335392 333507  99%    0.12K  10481       32     41924K pid
 19605  19504  99%    2.06K   1307       15     41824K sighand_cache
 36540  36540 100%    1.06K   1218       30     38976K mm_struct
 34209  34209 100%    1.06K   1143       30     36576K UDP
 52118  52118 100%    0.69K   1133       46     36256K files_cache
410090 408997  99%    0.09K   8915       46     35660K anon_vma
410536 410532  99%    0.07K   7331       56     29324K Acpi-Operand
148512 148135  99%    0.19K   3536       42     28288K cred_jar
 38180  37593  98%    0.69K    830       46     26560K shmem_inode_cache
392448  54238  13%    0.06K   6132       64     24528K vmap_area
 79712  79158  99%    0.25K   2491       32     19928K skbuff_head_cache
  2424   2424 100%    8.00K    606        4     19392K kmalloc-8k
139980 139980 100%    0.13K   2333       60     18664K kernfs_node_cache
 12750  12750 100%    1.25K    510       25     16320K UDPv6
250752 249441  99%    0.06K   3918       64     15672K kmalloc-64
497664 491010  98%    0.03K   3888      128     15552K kmalloc-32
 80976  80811  99%    0.19K   1928       42     15424K kmalloc-192
 14532  14532 100%    1.00K    455       32     14560K RAW
193144 192752  99%    0.07K   3449       56     13796K eventpoll_pwq
349758 319683  91%    0.04K   3429      102     13716K ext4_extent_status
165087 164781  99%    0.08K   3237       51     12948K task_delay_info
 66486  66364  99%    0.19K   1583       42     12664K skbuff_ext_cache
  4927   4927 100%    2.31K    379       13     12128K TCPv6
  5586   5522  98%    2.00K    350       16     11200K biovec-128
  9248   9248 100%    1.00K    289       32      9248K biovec-64
209814 209814 100%    0.04K   2057      102      8228K avtab_extended_perms
126784 122402  96%    0.06K   1981       64      7924K kmalloc-rcl-64
 32930  32634  99%    0.21K    890       37      7120K file_lock_cache
 14144  14112  99%    0.50K    442       32      7072K skbuff_fclone_cache
178092 178092 100%    0.04K   1746      102      6984K pde_opener
  1648   1586  96%    4.00K    206        8      6592K biovec-max
421632 421363  99%    0.02K   1647      256      6588K kmalloc-16
 67032  65446  97%    0.09K   1596       42      6384K kmalloc-rcl-96
 13536  13178  97%    0.44K    376       36      6016K kmem_cache
 17493  17212  98%    0.31K    343       51      5488K mnt_cache
  1192   1120  93%    4.00K    149        8      4768K names_cache
  4356   4356 100%    0.94K    129       34      4128K PING
 41244  40645  98%    0.09K    982       42      3928K kmalloc-96
 15520  15296  98%    0.25K    485       32      3880K dquot
482816 482816 100%    0.01K    943      512      3772K kmalloc-8
 15072  14952  99%    0.25K    471       32      3768K kmalloc-256
 28704  15324  53%    0.12K    897       32      3588K kmalloc-128
 55104  55104 100%    0.06K    861       64      3444K ext4_io_end
 64090  64090 100%    0.05K    754       85      3016K ftrace_event_field
189440 188001  99%    0.02K    740      256      2960K lsm_file_cache
 93952  92800  98%    0.03K    734      128      2936K jbd2_revoke_record_s
 11220  11220 100%    0.24K    340       33      2720K tw_sock_TCP
 34272  34272 100%    0.08K    672       51      2688K Acpi-State
{code}


An alternative storage layout can be one file per container. This is 
implemented in some existing distributed object storage / file system like 
SeaweedFS's volume (similar to Facebook's Haystack). 

This has the benefit of reducing the small files in the datanode. One container 
file can contain hundreds or thousands of logical files. 

Additionally, we can move the container metadata to the file instead of the 
Container file to ensure O(1) disk seek per read. Currently, the we need to 
check the container DB first and then get the associated blocks which might 
container more disk seeks than necessary (depending on the read amplification 
of RocksDB, etc).

However, this also comes with some drawbacks:
 * Bookkeeping required
 ** We need to keep some metadata (e.g. to track which blocks are in which 
offset of the container file) which can be implemented as a separate "index 
file" or on the header (superblock) of the data file
 * Deletion is not direct
 ** Deletion of a block needs to mark the particular block as deleted
 ** A separate background task will run the compaction (garbage collection) 
task where it will create a new container file with the deleted blocks removed
 *** This can momentarily increase the datanode space usage since a new file 
needs to be created
 * Write contention on the same file
 ** If two clients are writing to the same container file at the same time, a 
file lock needs to be used to prevent race condition
 ** This introduces write contention and will reduce the write throughput.

We might also store the small files directly in the RocksDB (e.g. using 
[https://github.com/facebook/rocksdb/wiki/BlobDB]).

This is a long-term wish to kickstart discussions on the feasibility of this 
storage layout in Ozone in the future.


> One File per Container Storage Layout
> -------------------------------------
>
>                 Key: HDDS-12659
>                 URL: https://issues.apache.org/jira/browse/HDDS-12659
>             Project: Apache Ozone
>          Issue Type: Wish
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>
> In the future, Ozone can support "one file per container" storage layout. 
> Currently, Ozone support FilePerBlock (current) and FilePerChunk (deprecated).
> The current FilePerBlock storage layout have the following benefits:
>  * No write contentions for writing blocks belonging to the write container
>  ** However, for Ratis pipeline, this is also guaranteed by sequential nature 
> of Raft algorithm
>  * Block file can be deleted as soon as the datanode receives the deletion 
> command
> However, the FilePerBlock layout is not good for handling a lot of small 
> files since each block is stored a separate file. This increases the inode 
> tree size of the datanodes and cause memory issues when we need to check all 
> the block files (e.g. scanner / volume size using "du"). So while Ozone 
> alleviates the small files problem related to the metadata (by storing it 
> persistently in RocksDB), we haven't fully addressed the small file issues 
> related to the data (storage layout). We can check the number of inodes using 
> "df -i" command.
> For example, recently we saw that one DN has with high read traffic results 
> in high memory usage due to inode + dentry cache and FS cache buffer 
> (buffer_head)
> slabtop -sc
> {code:java}
> Active / Total Objects (% used)    : 48293416 / 58851801 (82.1%)
>  Active / Total Slabs (% used)      : 1326850 / 1326850 (100.0%)
>  Active / Total Caches (% used)     : 103 / 140 (73.6%)
>  Active / Total Size (% used)       : 8047950.31K / 13247836.52K (60.7%)
>  Minimum / Average / Maximum Object : 0.01K / 0.22K / 16.69K
>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
> 10807090 3342402  30%    0.57K 192996       56   6175872K radix_tree_node
> 30966195 30764318  99%    0.10K 794005       39   3176020K buffer_head
> 788121 184537  23%    1.05K  26277       30    840864K ext4_inode_cache
> 3066378 2012259  65%    0.19K  73010       42    584080K dentry
> 3323376 3323376 100%    0.14K  59346       56    474768K ext4_groupinfo_4k
> 432557 426347  98%    0.65K   8847       49    283104K proc_inode_cache
> 244750 241424  98%    0.58K   4450       55    142400K inode_cache
> 144109 143451  99%    0.81K   3696       39    118272K sock_inode_cache
>  13932  13480  96%    7.62K   3486        4    111552K task_struct
> 444608 433987  97%    0.25K  13895       32    111160K filp
> 490680 487837  99%    0.20K  12267       40     98136K vm_area_struct
>  18632  18486  99%    4.00K   2329        8     74528K kmalloc-4k
> 146208 146080  99%    0.50K   4569       32     73104K kmalloc-512
>  66348  65889  99%    1.00K   2076       32     66432K kmalloc-1k
>  25958  25381  97%    2.19K   1856       14     59392K TCP
>  28866  28792  99%    2.00K   1805       16     57760K kmalloc-2k
> 1003750 212013  21%    0.05K  13750       73     55000K Acpi-Parse
>  44940  44906  99%    1.12K   1605       28     51360K signal_cache
> 700736 699330  99%    0.06K  10949       64     43796K anon_vma_chain
>  42016  42016 100%    1.00K   1313       32     42016K UNIX
> 335392 333507  99%    0.12K  10481       32     41924K pid
>  19605  19504  99%    2.06K   1307       15     41824K sighand_cache
>  36540  36540 100%    1.06K   1218       30     38976K mm_struct
>  34209  34209 100%    1.06K   1143       30     36576K UDP
>  52118  52118 100%    0.69K   1133       46     36256K files_cache
> 410090 408997  99%    0.09K   8915       46     35660K anon_vma
> 410536 410532  99%    0.07K   7331       56     29324K Acpi-Operand
> 148512 148135  99%    0.19K   3536       42     28288K cred_jar
>  38180  37593  98%    0.69K    830       46     26560K shmem_inode_cache
> 392448  54238  13%    0.06K   6132       64     24528K vmap_area
>  79712  79158  99%    0.25K   2491       32     19928K skbuff_head_cache
>   2424   2424 100%    8.00K    606        4     19392K kmalloc-8k
> 139980 139980 100%    0.13K   2333       60     18664K kernfs_node_cache
>  12750  12750 100%    1.25K    510       25     16320K UDPv6
> 250752 249441  99%    0.06K   3918       64     15672K kmalloc-64
> 497664 491010  98%    0.03K   3888      128     15552K kmalloc-32
>  80976  80811  99%    0.19K   1928       42     15424K kmalloc-192
>  14532  14532 100%    1.00K    455       32     14560K RAW
> 193144 192752  99%    0.07K   3449       56     13796K eventpoll_pwq
> 349758 319683  91%    0.04K   3429      102     13716K ext4_extent_status
> 165087 164781  99%    0.08K   3237       51     12948K task_delay_info
>  66486  66364  99%    0.19K   1583       42     12664K skbuff_ext_cache
>   4927   4927 100%    2.31K    379       13     12128K TCPv6
>   5586   5522  98%    2.00K    350       16     11200K biovec-128
>   9248   9248 100%    1.00K    289       32      9248K biovec-64
> 209814 209814 100%    0.04K   2057      102      8228K avtab_extended_perms
> 126784 122402  96%    0.06K   1981       64      7924K kmalloc-rcl-64
>  32930  32634  99%    0.21K    890       37      7120K file_lock_cache
>  14144  14112  99%    0.50K    442       32      7072K skbuff_fclone_cache
> 178092 178092 100%    0.04K   1746      102      6984K pde_opener
>   1648   1586  96%    4.00K    206        8      6592K biovec-max
> 421632 421363  99%    0.02K   1647      256      6588K kmalloc-16
>  67032  65446  97%    0.09K   1596       42      6384K kmalloc-rcl-96
>  13536  13178  97%    0.44K    376       36      6016K kmem_cache
>  17493  17212  98%    0.31K    343       51      5488K mnt_cache
>   1192   1120  93%    4.00K    149        8      4768K names_cache
>   4356   4356 100%    0.94K    129       34      4128K PING
>  41244  40645  98%    0.09K    982       42      3928K kmalloc-96
>  15520  15296  98%    0.25K    485       32      3880K dquot
> 482816 482816 100%    0.01K    943      512      3772K kmalloc-8
>  15072  14952  99%    0.25K    471       32      3768K kmalloc-256
>  28704  15324  53%    0.12K    897       32      3588K kmalloc-128
>  55104  55104 100%    0.06K    861       64      3444K ext4_io_end
>  64090  64090 100%    0.05K    754       85      3016K ftrace_event_field
> 189440 188001  99%    0.02K    740      256      2960K lsm_file_cache
>  93952  92800  98%    0.03K    734      128      2936K jbd2_revoke_record_s
>  11220  11220 100%    0.24K    340       33      2720K tw_sock_TCP
>  34272  34272 100%    0.08K    672       51      2688K Acpi-State
> {code}
> An alternative storage layout can be one file per container. This is 
> implemented in some existing distributed object storage / file system like 
> SeaweedFS's volume (similar to Facebook's Haystack). 
> This has the benefit of reducing the small files in the datanode. One 
> container file can contain hundreds or thousands of logical files. 
> Additionally, we can move the container metadata to the file instead of the 
> Container file to ensure O(1) disk seek per read. Currently, the we need to 
> check the container DB first and then get the associated blocks which might 
> container more disk seeks than necessary (depending on the read amplification 
> of RocksDB, etc).
> However, this also comes with some drawbacks:
>  * Bookkeeping required
>  ** We need to keep some metadata (e.g. to track which blocks are in which 
> offset of the container file) which can be implemented as a separate "index 
> file" or on the header (superblock) of the data file
>  * Deletion is delayed until the compaction / reclamation task
>  ** Deletion of a block needs to mark the particular block as deleted
>  ** A separate background task will run the compaction (garbage collection) 
> task where it will create a new container file with the deleted blocks removed
>  *** This can momentarily increase the datanode space usage since a new file 
> needs to be created
>  * Write contention on the same file
>  ** If two clients are writing to the same container file at the same time, a 
> file lock needs to be used to prevent race condition
>  ** This introduces write contention and will reduce the write throughput.
> We might also store the small files directly in the RocksDB (e.g. using 
> [https://github.com/facebook/rocksdb/wiki/BlobDB]).
> This is a long-term wish to kickstart discussions on the feasibility of this 
> storage layout in Ozone in the future.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to