[
https://issues.apache.org/jira/browse/KUDU-3371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yingchun Lai updated KUDU-3371:
-------------------------------
Description:
h1. motivation
The current LBM container use separate .data and .metadata files. The .data
file store the real user data, we can use hole punching to reduce disk space.
While the metadata use write protobuf serialized string to a file, in append
only mode. Each protobuf object is a struct of BlockRecordPB:
{code:java}
message BlockRecordPB {
required BlockIdPB block_id = 1; // int64
required BlockRecordType op_type = 2; // CREATE or DELETE
required uint64 timestamp_us = 3;
optional int64 offset = 4; // Required for CREATE.
optional int64 length = 5; // Required for CREATE.
} {code}
That means each object is either type of CREATE or DELETE. To mark a 'block' as
deleted, there will be 2 objects in the metadata, one is CREATE type and the
other is DELETE type.
There are some weak points of current LBM metadata storage mechanism:
h2. 1. Disk space amplification
The metadata live blocks rate may be very low, the worst case is there is only
1 alive block (suppose it hasn't reach the runtime compact threshold), all the
other thousands of blocks are dead (i.e. in pair of CREATE-DELETE).
So the disk space amplification is very serious.
h2. 2. Long time bootstrap
In Kudu server bootstrap stage, it have to replay all the metadata files, to
find out the alive blocks. In the worst case, we may replayed thousands of
blocks in metadata, but find only a very few blocks are alive.
It may waste much time in almost all cases, since the Kudu cluster in
production environment always run without bootstrap with several months, the
LBM may be very loose.
h2. 3. Metadada compaction
To resolve the issues above, there is a metadata compaction mechanism in LBM,
was:
h1. motivation
The current LBM container use separate .data and .metadata files. The .data
file store the real user data, we can use hole punching to reduce disk space.
While the metadata use write protobuf serialized string to a file, in append
only mode. Each protobuf object is a struct of BlockRecordPB:
{code:java}
message BlockRecordPB {
required BlockIdPB block_id = 1; // int64
required BlockRecordType op_type = 2; // CREATE or DELETE
required uint64 timestamp_us = 3;
optional int64 offset = 4; // Required for CREATE.
optional int64 length = 5; // Required for CREATE.
} {code}
That means each object is either type of CREATE or DELETE. To mark a 'block' as
deleted, there will be 2 objects in the metadata, one is CREATE type and the
other is DELETE type.
There are some weak points of current LBM metadata storage mechanism:
h2. 1. Disk space amplification
h2. 2. Long time bootstrap
> Use RocksDB to store LBM metadata
> ---------------------------------
>
> Key: KUDU-3371
> URL: https://issues.apache.org/jira/browse/KUDU-3371
> Project: Kudu
> Issue Type: Improvement
> Components: fs
> Reporter: Yingchun Lai
> Priority: Major
>
> h1. motivation
> The current LBM container use separate .data and .metadata files. The .data
> file store the real user data, we can use hole punching to reduce disk space.
> While the metadata use write protobuf serialized string to a file, in append
> only mode. Each protobuf object is a struct of BlockRecordPB:
>
> {code:java}
> message BlockRecordPB {
> required BlockIdPB block_id = 1; // int64
> required BlockRecordType op_type = 2; // CREATE or DELETE
> required uint64 timestamp_us = 3;
> optional int64 offset = 4; // Required for CREATE.
> optional int64 length = 5; // Required for CREATE.
> } {code}
> That means each object is either type of CREATE or DELETE. To mark a 'block'
> as deleted, there will be 2 objects in the metadata, one is CREATE type and
> the other is DELETE type.
> There are some weak points of current LBM metadata storage mechanism:
> h2. 1. Disk space amplification
> The metadata live blocks rate may be very low, the worst case is there is
> only 1 alive block (suppose it hasn't reach the runtime compact threshold),
> all the other thousands of blocks are dead (i.e. in pair of CREATE-DELETE).
> So the disk space amplification is very serious.
> h2. 2. Long time bootstrap
> In Kudu server bootstrap stage, it have to replay all the metadata files, to
> find out the alive blocks. In the worst case, we may replayed thousands of
> blocks in metadata, but find only a very few blocks are alive.
> It may waste much time in almost all cases, since the Kudu cluster in
> production environment always run without bootstrap with several months, the
> LBM may be very loose.
> h2. 3. Metadada compaction
> To resolve the issues above, there is a metadata compaction mechanism in LBM,
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)