[
https://issues.apache.org/jira/browse/HDDS-12659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Andika updated HDDS-12659:
-------------------------------
Description:
In the future, Ozone can support "one file per container" storage layout.
Currently, Ozone support FilePerBlock (current) and FilePerChunk (deprecated).
The current FilePerBlock storage layout have the following benefits:
* No write contentions for writing blocks belonging to the write container
** However, for Ratis pipeline, this is also guaranteed by sequential nature
of Raft algorithm
* Block file can be deleted as soon as the datanode receives the deletion
command
However, the FilePerBlock layout is not good for handling a lot of small files
since each block is stored a separate file. This increases the inode tree size
of the datanodes and cause memory issues when we need to check all the block
files (e.g. scanner / volume size using "du"). So while Ozone alleviates the
small files problem related to the metadata (by storing it persistently in
RocksDB), we haven't fully addressed the small file issues related to the data
(storage layout). We can check the number of inodes using "df -i" command.
For example, recently we saw that one DN has with high read traffic results in
high memory usage due to inode + dentry cache and FS cache buffer (buffer_head)
slabtop -sc
{code:java}
Active / Total Objects (% used) : 48293416 / 58851801 (82.1%)
Active / Total Slabs (% used) : 1326850 / 1326850 (100.0%)
Active / Total Caches (% used) : 103 / 140 (73.6%)
Active / Total Size (% used) : 8047950.31K / 13247836.52K (60.7%)
Minimum / Average / Maximum Object : 0.01K / 0.22K / 16.69K
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
10807090 3342402 30% 0.57K 192996 56 6175872K radix_tree_node
30966195 30764318 99% 0.10K 794005 39 3176020K buffer_head
788121 184537 23% 1.05K 26277 30 840864K ext4_inode_cache
3066378 2012259 65% 0.19K 73010 42 584080K dentry
3323376 3323376 100% 0.14K 59346 56 474768K ext4_groupinfo_4k
432557 426347 98% 0.65K 8847 49 283104K proc_inode_cache
244750 241424 98% 0.58K 4450 55 142400K inode_cache
144109 143451 99% 0.81K 3696 39 118272K sock_inode_cache
13932 13480 96% 7.62K 3486 4 111552K task_struct
444608 433987 97% 0.25K 13895 32 111160K filp
490680 487837 99% 0.20K 12267 40 98136K vm_area_struct
18632 18486 99% 4.00K 2329 8 74528K kmalloc-4k
146208 146080 99% 0.50K 4569 32 73104K kmalloc-512
66348 65889 99% 1.00K 2076 32 66432K kmalloc-1k
25958 25381 97% 2.19K 1856 14 59392K TCP
28866 28792 99% 2.00K 1805 16 57760K kmalloc-2k
1003750 212013 21% 0.05K 13750 73 55000K Acpi-Parse
44940 44906 99% 1.12K 1605 28 51360K signal_cache
700736 699330 99% 0.06K 10949 64 43796K anon_vma_chain
42016 42016 100% 1.00K 1313 32 42016K UNIX
335392 333507 99% 0.12K 10481 32 41924K pid
19605 19504 99% 2.06K 1307 15 41824K sighand_cache
36540 36540 100% 1.06K 1218 30 38976K mm_struct
34209 34209 100% 1.06K 1143 30 36576K UDP
52118 52118 100% 0.69K 1133 46 36256K files_cache
410090 408997 99% 0.09K 8915 46 35660K anon_vma
410536 410532 99% 0.07K 7331 56 29324K Acpi-Operand
148512 148135 99% 0.19K 3536 42 28288K cred_jar
38180 37593 98% 0.69K 830 46 26560K shmem_inode_cache
392448 54238 13% 0.06K 6132 64 24528K vmap_area
79712 79158 99% 0.25K 2491 32 19928K skbuff_head_cache
2424 2424 100% 8.00K 606 4 19392K kmalloc-8k
139980 139980 100% 0.13K 2333 60 18664K kernfs_node_cache
12750 12750 100% 1.25K 510 25 16320K UDPv6
250752 249441 99% 0.06K 3918 64 15672K kmalloc-64
497664 491010 98% 0.03K 3888 128 15552K kmalloc-32
80976 80811 99% 0.19K 1928 42 15424K kmalloc-192
14532 14532 100% 1.00K 455 32 14560K RAW
193144 192752 99% 0.07K 3449 56 13796K eventpoll_pwq
349758 319683 91% 0.04K 3429 102 13716K ext4_extent_status
165087 164781 99% 0.08K 3237 51 12948K task_delay_info
66486 66364 99% 0.19K 1583 42 12664K skbuff_ext_cache
4927 4927 100% 2.31K 379 13 12128K TCPv6
5586 5522 98% 2.00K 350 16 11200K biovec-128
9248 9248 100% 1.00K 289 32 9248K biovec-64
209814 209814 100% 0.04K 2057 102 8228K avtab_extended_perms
126784 122402 96% 0.06K 1981 64 7924K kmalloc-rcl-64
32930 32634 99% 0.21K 890 37 7120K file_lock_cache
14144 14112 99% 0.50K 442 32 7072K skbuff_fclone_cache
178092 178092 100% 0.04K 1746 102 6984K pde_opener
1648 1586 96% 4.00K 206 8 6592K biovec-max
421632 421363 99% 0.02K 1647 256 6588K kmalloc-16
67032 65446 97% 0.09K 1596 42 6384K kmalloc-rcl-96
13536 13178 97% 0.44K 376 36 6016K kmem_cache
17493 17212 98% 0.31K 343 51 5488K mnt_cache
1192 1120 93% 4.00K 149 8 4768K names_cache
4356 4356 100% 0.94K 129 34 4128K PING
41244 40645 98% 0.09K 982 42 3928K kmalloc-96
15520 15296 98% 0.25K 485 32 3880K dquot
482816 482816 100% 0.01K 943 512 3772K kmalloc-8
15072 14952 99% 0.25K 471 32 3768K kmalloc-256
28704 15324 53% 0.12K 897 32 3588K kmalloc-128
55104 55104 100% 0.06K 861 64 3444K ext4_io_end
64090 64090 100% 0.05K 754 85 3016K ftrace_event_field
189440 188001 99% 0.02K 740 256 2960K lsm_file_cache
93952 92800 98% 0.03K 734 128 2936K jbd2_revoke_record_s
11220 11220 100% 0.24K 340 33 2720K tw_sock_TCP
34272 34272 100% 0.08K 672 51 2688K Acpi-State
{code}
An alternative storage layout can be one file per container. This is
implemented in some existing distributed object storage / file system like
SeaweedFS's volume (similar to Facebook's Haystack).
This has the benefit of reducing the small files in the datanode. One container
file can contain hundreds or thousands of logical files.
Additionally, we can move the container metadata to the file instead of the
Container file to ensure O(1) disk seek per read. Currently, the we need to
check the container DB first and then get the associated blocks which might
container more disk seeks than necessary (depending on the read amplification
of RocksDB, etc).
However, this also comes with some drawbacks:
* Bookkeeping required
** We need to keep some metadata (e.g. to track which blocks are in which
offset of the container file) which can be implemented as a separate "index
file" or on the header (superblock) of the data file
* Deletion is delayed until the compaction / reclamation task
** Deletion of a block needs to mark the particular block as deleted
** A separate background task will run the compaction (garbage collection)
task where it will create a new container file with the deleted blocks removed
*** This can momentarily increase the datanode space usage since a new file
needs to be created
* Write contention on the same file
** If two clients are writing to the same container file at the same time, a
file lock needs to be used to prevent race condition
** This introduces write contention and will reduce the write throughput.
We might also store the small files directly in the RocksDB (e.g. using
[https://github.com/facebook/rocksdb/wiki/BlobDB]).
This is a long-term wish to kickstart discussions on the feasibility of this
storage layout in Ozone in the future.
was:
In the future, Ozone can support "one file per container" storage layout.
Currently, Ozone support FilePerBlock (current) and FilePerChunk (deprecated).
The current FilePerBlock storage layout have the following benefits:
* No write contentions for writing blocks belonging to the write container
** However, for Ratis pipeline, this is also guaranteed by sequential nature
of Raft algorithm
* Block file can be deleted as soon as the datanode receives the deletion
command
However, the FilePerBlock layout is not good for handling a lot of small files
since each block is stored a separate file. This increases the inode tree size
of the datanodes and cause memory issues when we need to check all the block
files (e.g. scanner / volume size using "du"). So while Ozone alleviates the
small files problem related to the metadata (by storing it persistently in
RocksDB), we haven't fully addressed the small file issues related to the data
(storage layout). We can check the number of inodes using "df -i" command.
For example, recently we saw that one DN has with high read traffic results in
high memory usage due to inode + dentry cache and FS cache buffer (buffer_head)
slabtop -sc
{code:java}
Active / Total Objects (% used) : 48293416 / 58851801 (82.1%)
Active / Total Slabs (% used) : 1326850 / 1326850 (100.0%)
Active / Total Caches (% used) : 103 / 140 (73.6%)
Active / Total Size (% used) : 8047950.31K / 13247836.52K (60.7%)
Minimum / Average / Maximum Object : 0.01K / 0.22K / 16.69K
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
10807090 3342402 30% 0.57K 192996 56 6175872K radix_tree_node
30966195 30764318 99% 0.10K 794005 39 3176020K buffer_head
788121 184537 23% 1.05K 26277 30 840864K ext4_inode_cache
3066378 2012259 65% 0.19K 73010 42 584080K dentry
3323376 3323376 100% 0.14K 59346 56 474768K ext4_groupinfo_4k
432557 426347 98% 0.65K 8847 49 283104K proc_inode_cache
244750 241424 98% 0.58K 4450 55 142400K inode_cache
144109 143451 99% 0.81K 3696 39 118272K sock_inode_cache
13932 13480 96% 7.62K 3486 4 111552K task_struct
444608 433987 97% 0.25K 13895 32 111160K filp
490680 487837 99% 0.20K 12267 40 98136K vm_area_struct
18632 18486 99% 4.00K 2329 8 74528K kmalloc-4k
146208 146080 99% 0.50K 4569 32 73104K kmalloc-512
66348 65889 99% 1.00K 2076 32 66432K kmalloc-1k
25958 25381 97% 2.19K 1856 14 59392K TCP
28866 28792 99% 2.00K 1805 16 57760K kmalloc-2k
1003750 212013 21% 0.05K 13750 73 55000K Acpi-Parse
44940 44906 99% 1.12K 1605 28 51360K signal_cache
700736 699330 99% 0.06K 10949 64 43796K anon_vma_chain
42016 42016 100% 1.00K 1313 32 42016K UNIX
335392 333507 99% 0.12K 10481 32 41924K pid
19605 19504 99% 2.06K 1307 15 41824K sighand_cache
36540 36540 100% 1.06K 1218 30 38976K mm_struct
34209 34209 100% 1.06K 1143 30 36576K UDP
52118 52118 100% 0.69K 1133 46 36256K files_cache
410090 408997 99% 0.09K 8915 46 35660K anon_vma
410536 410532 99% 0.07K 7331 56 29324K Acpi-Operand
148512 148135 99% 0.19K 3536 42 28288K cred_jar
38180 37593 98% 0.69K 830 46 26560K shmem_inode_cache
392448 54238 13% 0.06K 6132 64 24528K vmap_area
79712 79158 99% 0.25K 2491 32 19928K skbuff_head_cache
2424 2424 100% 8.00K 606 4 19392K kmalloc-8k
139980 139980 100% 0.13K 2333 60 18664K kernfs_node_cache
12750 12750 100% 1.25K 510 25 16320K UDPv6
250752 249441 99% 0.06K 3918 64 15672K kmalloc-64
497664 491010 98% 0.03K 3888 128 15552K kmalloc-32
80976 80811 99% 0.19K 1928 42 15424K kmalloc-192
14532 14532 100% 1.00K 455 32 14560K RAW
193144 192752 99% 0.07K 3449 56 13796K eventpoll_pwq
349758 319683 91% 0.04K 3429 102 13716K ext4_extent_status
165087 164781 99% 0.08K 3237 51 12948K task_delay_info
66486 66364 99% 0.19K 1583 42 12664K skbuff_ext_cache
4927 4927 100% 2.31K 379 13 12128K TCPv6
5586 5522 98% 2.00K 350 16 11200K biovec-128
9248 9248 100% 1.00K 289 32 9248K biovec-64
209814 209814 100% 0.04K 2057 102 8228K avtab_extended_perms
126784 122402 96% 0.06K 1981 64 7924K kmalloc-rcl-64
32930 32634 99% 0.21K 890 37 7120K file_lock_cache
14144 14112 99% 0.50K 442 32 7072K skbuff_fclone_cache
178092 178092 100% 0.04K 1746 102 6984K pde_opener
1648 1586 96% 4.00K 206 8 6592K biovec-max
421632 421363 99% 0.02K 1647 256 6588K kmalloc-16
67032 65446 97% 0.09K 1596 42 6384K kmalloc-rcl-96
13536 13178 97% 0.44K 376 36 6016K kmem_cache
17493 17212 98% 0.31K 343 51 5488K mnt_cache
1192 1120 93% 4.00K 149 8 4768K names_cache
4356 4356 100% 0.94K 129 34 4128K PING
41244 40645 98% 0.09K 982 42 3928K kmalloc-96
15520 15296 98% 0.25K 485 32 3880K dquot
482816 482816 100% 0.01K 943 512 3772K kmalloc-8
15072 14952 99% 0.25K 471 32 3768K kmalloc-256
28704 15324 53% 0.12K 897 32 3588K kmalloc-128
55104 55104 100% 0.06K 861 64 3444K ext4_io_end
64090 64090 100% 0.05K 754 85 3016K ftrace_event_field
189440 188001 99% 0.02K 740 256 2960K lsm_file_cache
93952 92800 98% 0.03K 734 128 2936K jbd2_revoke_record_s
11220 11220 100% 0.24K 340 33 2720K tw_sock_TCP
34272 34272 100% 0.08K 672 51 2688K Acpi-State
{code}
An alternative storage layout can be one file per container. This is
implemented in some existing distributed object storage / file system like
SeaweedFS's volume (similar to Facebook's Haystack).
This has the benefit of reducing the small files in the datanode. One container
file can contain hundreds or thousands of logical files.
Additionally, we can move the container metadata to the file instead of the
Container file to ensure O(1) disk seek per read. Currently, the we need to
check the container DB first and then get the associated blocks which might
container more disk seeks than necessary (depending on the read amplification
of RocksDB, etc).
However, this also comes with some drawbacks:
* Bookkeeping required
** We need to keep some metadata (e.g. to track which blocks are in which
offset of the container file) which can be implemented as a separate "index
file" or on the header (superblock) of the data file
* Deletion is not direct
** Deletion of a block needs to mark the particular block as deleted
** A separate background task will run the compaction (garbage collection)
task where it will create a new container file with the deleted blocks removed
*** This can momentarily increase the datanode space usage since a new file
needs to be created
* Write contention on the same file
** If two clients are writing to the same container file at the same time, a
file lock needs to be used to prevent race condition
** This introduces write contention and will reduce the write throughput.
We might also store the small files directly in the RocksDB (e.g. using
[https://github.com/facebook/rocksdb/wiki/BlobDB]).
This is a long-term wish to kickstart discussions on the feasibility of this
storage layout in Ozone in the future.
> One File per Container Storage Layout
> -------------------------------------
>
> Key: HDDS-12659
> URL: https://issues.apache.org/jira/browse/HDDS-12659
> Project: Apache Ozone
> Issue Type: Wish
> Reporter: Ivan Andika
> Assignee: Ivan Andika
> Priority: Major
>
> In the future, Ozone can support "one file per container" storage layout.
> Currently, Ozone support FilePerBlock (current) and FilePerChunk (deprecated).
> The current FilePerBlock storage layout have the following benefits:
> * No write contentions for writing blocks belonging to the write container
> ** However, for Ratis pipeline, this is also guaranteed by sequential nature
> of Raft algorithm
> * Block file can be deleted as soon as the datanode receives the deletion
> command
> However, the FilePerBlock layout is not good for handling a lot of small
> files since each block is stored a separate file. This increases the inode
> tree size of the datanodes and cause memory issues when we need to check all
> the block files (e.g. scanner / volume size using "du"). So while Ozone
> alleviates the small files problem related to the metadata (by storing it
> persistently in RocksDB), we haven't fully addressed the small file issues
> related to the data (storage layout). We can check the number of inodes using
> "df -i" command.
> For example, recently we saw that one DN has with high read traffic results
> in high memory usage due to inode + dentry cache and FS cache buffer
> (buffer_head)
> slabtop -sc
> {code:java}
> Active / Total Objects (% used) : 48293416 / 58851801 (82.1%)
> Active / Total Slabs (% used) : 1326850 / 1326850 (100.0%)
> Active / Total Caches (% used) : 103 / 140 (73.6%)
> Active / Total Size (% used) : 8047950.31K / 13247836.52K (60.7%)
> Minimum / Average / Maximum Object : 0.01K / 0.22K / 16.69K
> OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
> 10807090 3342402 30% 0.57K 192996 56 6175872K radix_tree_node
> 30966195 30764318 99% 0.10K 794005 39 3176020K buffer_head
> 788121 184537 23% 1.05K 26277 30 840864K ext4_inode_cache
> 3066378 2012259 65% 0.19K 73010 42 584080K dentry
> 3323376 3323376 100% 0.14K 59346 56 474768K ext4_groupinfo_4k
> 432557 426347 98% 0.65K 8847 49 283104K proc_inode_cache
> 244750 241424 98% 0.58K 4450 55 142400K inode_cache
> 144109 143451 99% 0.81K 3696 39 118272K sock_inode_cache
> 13932 13480 96% 7.62K 3486 4 111552K task_struct
> 444608 433987 97% 0.25K 13895 32 111160K filp
> 490680 487837 99% 0.20K 12267 40 98136K vm_area_struct
> 18632 18486 99% 4.00K 2329 8 74528K kmalloc-4k
> 146208 146080 99% 0.50K 4569 32 73104K kmalloc-512
> 66348 65889 99% 1.00K 2076 32 66432K kmalloc-1k
> 25958 25381 97% 2.19K 1856 14 59392K TCP
> 28866 28792 99% 2.00K 1805 16 57760K kmalloc-2k
> 1003750 212013 21% 0.05K 13750 73 55000K Acpi-Parse
> 44940 44906 99% 1.12K 1605 28 51360K signal_cache
> 700736 699330 99% 0.06K 10949 64 43796K anon_vma_chain
> 42016 42016 100% 1.00K 1313 32 42016K UNIX
> 335392 333507 99% 0.12K 10481 32 41924K pid
> 19605 19504 99% 2.06K 1307 15 41824K sighand_cache
> 36540 36540 100% 1.06K 1218 30 38976K mm_struct
> 34209 34209 100% 1.06K 1143 30 36576K UDP
> 52118 52118 100% 0.69K 1133 46 36256K files_cache
> 410090 408997 99% 0.09K 8915 46 35660K anon_vma
> 410536 410532 99% 0.07K 7331 56 29324K Acpi-Operand
> 148512 148135 99% 0.19K 3536 42 28288K cred_jar
> 38180 37593 98% 0.69K 830 46 26560K shmem_inode_cache
> 392448 54238 13% 0.06K 6132 64 24528K vmap_area
> 79712 79158 99% 0.25K 2491 32 19928K skbuff_head_cache
> 2424 2424 100% 8.00K 606 4 19392K kmalloc-8k
> 139980 139980 100% 0.13K 2333 60 18664K kernfs_node_cache
> 12750 12750 100% 1.25K 510 25 16320K UDPv6
> 250752 249441 99% 0.06K 3918 64 15672K kmalloc-64
> 497664 491010 98% 0.03K 3888 128 15552K kmalloc-32
> 80976 80811 99% 0.19K 1928 42 15424K kmalloc-192
> 14532 14532 100% 1.00K 455 32 14560K RAW
> 193144 192752 99% 0.07K 3449 56 13796K eventpoll_pwq
> 349758 319683 91% 0.04K 3429 102 13716K ext4_extent_status
> 165087 164781 99% 0.08K 3237 51 12948K task_delay_info
> 66486 66364 99% 0.19K 1583 42 12664K skbuff_ext_cache
> 4927 4927 100% 2.31K 379 13 12128K TCPv6
> 5586 5522 98% 2.00K 350 16 11200K biovec-128
> 9248 9248 100% 1.00K 289 32 9248K biovec-64
> 209814 209814 100% 0.04K 2057 102 8228K avtab_extended_perms
> 126784 122402 96% 0.06K 1981 64 7924K kmalloc-rcl-64
> 32930 32634 99% 0.21K 890 37 7120K file_lock_cache
> 14144 14112 99% 0.50K 442 32 7072K skbuff_fclone_cache
> 178092 178092 100% 0.04K 1746 102 6984K pde_opener
> 1648 1586 96% 4.00K 206 8 6592K biovec-max
> 421632 421363 99% 0.02K 1647 256 6588K kmalloc-16
> 67032 65446 97% 0.09K 1596 42 6384K kmalloc-rcl-96
> 13536 13178 97% 0.44K 376 36 6016K kmem_cache
> 17493 17212 98% 0.31K 343 51 5488K mnt_cache
> 1192 1120 93% 4.00K 149 8 4768K names_cache
> 4356 4356 100% 0.94K 129 34 4128K PING
> 41244 40645 98% 0.09K 982 42 3928K kmalloc-96
> 15520 15296 98% 0.25K 485 32 3880K dquot
> 482816 482816 100% 0.01K 943 512 3772K kmalloc-8
> 15072 14952 99% 0.25K 471 32 3768K kmalloc-256
> 28704 15324 53% 0.12K 897 32 3588K kmalloc-128
> 55104 55104 100% 0.06K 861 64 3444K ext4_io_end
> 64090 64090 100% 0.05K 754 85 3016K ftrace_event_field
> 189440 188001 99% 0.02K 740 256 2960K lsm_file_cache
> 93952 92800 98% 0.03K 734 128 2936K jbd2_revoke_record_s
> 11220 11220 100% 0.24K 340 33 2720K tw_sock_TCP
> 34272 34272 100% 0.08K 672 51 2688K Acpi-State
> {code}
> An alternative storage layout can be one file per container. This is
> implemented in some existing distributed object storage / file system like
> SeaweedFS's volume (similar to Facebook's Haystack).
> This has the benefit of reducing the small files in the datanode. One
> container file can contain hundreds or thousands of logical files.
> Additionally, we can move the container metadata to the file instead of the
> Container file to ensure O(1) disk seek per read. Currently, the we need to
> check the container DB first and then get the associated blocks which might
> container more disk seeks than necessary (depending on the read amplification
> of RocksDB, etc).
> However, this also comes with some drawbacks:
> * Bookkeeping required
> ** We need to keep some metadata (e.g. to track which blocks are in which
> offset of the container file) which can be implemented as a separate "index
> file" or on the header (superblock) of the data file
> * Deletion is delayed until the compaction / reclamation task
> ** Deletion of a block needs to mark the particular block as deleted
> ** A separate background task will run the compaction (garbage collection)
> task where it will create a new container file with the deleted blocks removed
> *** This can momentarily increase the datanode space usage since a new file
> needs to be created
> * Write contention on the same file
> ** If two clients are writing to the same container file at the same time, a
> file lock needs to be used to prevent race condition
> ** This introduces write contention and will reduce the write throughput.
> We might also store the small files directly in the RocksDB (e.g. using
> [https://github.com/facebook/rocksdb/wiki/BlobDB]).
> This is a long-term wish to kickstart discussions on the feasibility of this
> storage layout in Ozone in the future.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]