[ 
https://issues.apache.org/jira/browse/KUDU-3371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yingchun Lai updated KUDU-3371:
-------------------------------
    Description: 
h1. motivation

The current LBM container use separate .data and .metadata files. The .data 
file store the real user data, we can use hole punching to reduce disk space. 
While the metadata use write protobuf serialized string to a file, in append 
only mode. Each protobuf object is a struct of BlockRecordPB:

 
{code:java}
message BlockRecordPB {
  required BlockIdPB block_id = 1;  // int64
  required BlockRecordType op_type = 2;  // CREATE or DELETE
  required uint64 timestamp_us = 3;
  optional int64 offset = 4; // Required for CREATE.
  optional int64 length = 5; // Required for CREATE.
} {code}
That means each object is either type of CREATE or DELETE. To mark a 'block' as 
deleted, there will be 2 objects in the metadata, one is CREATE type and the 
other is DELETE type.

There are some weak points of current LBM metadata storage mechanism:
h2. 1. Disk space amplification

The metadata live blocks rate may be very low, the worst case is there is only 
1 alive block (suppose it hasn't reach the runtime compact threshold), all the 
other thousands of blocks are dead (i.e. in pair of CREATE-DELETE).

So the disk space amplification is very serious.
h2. 2. Long time bootstrap

In Kudu server bootstrap stage, it have to replay all the metadata files, to 
find out the alive blocks. In the worst case, we may replayed thousands of 
blocks in metadata, but find only a very few blocks are alive.

It may waste much time in almost all cases, since the Kudu cluster in 
production environment always run without bootstrap with several months, the 
LBM may be very loose.
h2. 3. Metadada compaction

To resolve the issues above, there is a metadata compaction mechanism in LBM, 

 

  was:
h1. motivation

The current LBM container use separate .data and .metadata files. The .data 
file store the real user data, we can use hole punching to reduce disk space. 
While the metadata use write protobuf serialized string to a file, in append 
only mode. Each protobuf object is a struct of BlockRecordPB:

 
{code:java}
message BlockRecordPB {
  required BlockIdPB block_id = 1;  // int64
  required BlockRecordType op_type = 2;  // CREATE or DELETE
  required uint64 timestamp_us = 3;
  optional int64 offset = 4; // Required for CREATE.
  optional int64 length = 5; // Required for CREATE.
} {code}
That means each object is either type of CREATE or DELETE. To mark a 'block' as 
deleted, there will be 2 objects in the metadata, one is CREATE type and the 
other is DELETE type.

There are some weak points of current LBM metadata storage mechanism:
h2. 1. Disk space amplification

 
h2. 2. Long time bootstrap

 

 

 


> Use RocksDB to store LBM metadata
> ---------------------------------
>
>                 Key: KUDU-3371
>                 URL: https://issues.apache.org/jira/browse/KUDU-3371
>             Project: Kudu
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Yingchun Lai
>            Priority: Major
>
> h1. motivation
> The current LBM container use separate .data and .metadata files. The .data 
> file store the real user data, we can use hole punching to reduce disk space. 
> While the metadata use write protobuf serialized string to a file, in append 
> only mode. Each protobuf object is a struct of BlockRecordPB:
>  
> {code:java}
> message BlockRecordPB {
>   required BlockIdPB block_id = 1;  // int64
>   required BlockRecordType op_type = 2;  // CREATE or DELETE
>   required uint64 timestamp_us = 3;
>   optional int64 offset = 4; // Required for CREATE.
>   optional int64 length = 5; // Required for CREATE.
> } {code}
> That means each object is either type of CREATE or DELETE. To mark a 'block' 
> as deleted, there will be 2 objects in the metadata, one is CREATE type and 
> the other is DELETE type.
> There are some weak points of current LBM metadata storage mechanism:
> h2. 1. Disk space amplification
> The metadata live blocks rate may be very low, the worst case is there is 
> only 1 alive block (suppose it hasn't reach the runtime compact threshold), 
> all the other thousands of blocks are dead (i.e. in pair of CREATE-DELETE).
> So the disk space amplification is very serious.
> h2. 2. Long time bootstrap
> In Kudu server bootstrap stage, it have to replay all the metadata files, to 
> find out the alive blocks. In the worst case, we may replayed thousands of 
> blocks in metadata, but find only a very few blocks are alive.
> It may waste much time in almost all cases, since the Kudu cluster in 
> production environment always run without bootstrap with several months, the 
> LBM may be very loose.
> h2. 3. Metadada compaction
> To resolve the issues above, there is a metadata compaction mechanism in LBM, 
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to