[jira] [Updated] (TUBEMQ-124) Structured index storage

Guocheng Zhang (Jira) Sat, 16 May 2020 20:09:24 -0700


     [ 
https://issues.apache.org/jira/browse/TUBEMQ-124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Guocheng Zhang updated TUBEMQ-124:
----------------------------------
    Description: 
1. Structured index storage: optimize the current index storage, for example, 
increase the structured index storage, which can be quickly retrieved through 
the index when in use to quickly locate the data; the increase in the index 
structure may make the write request slower, At the same time, it takes more 
time to check and restore the index when the system restarts

--------------------------------------------------------------------------

To solve this problem, I plan to implement it like this:
 !image-2020-05-17-10-42-15-736.png! 

Divide the data into the bucket in the index segment file, and use the Bloom 
filter algorithm to determine whether the filter item contained in each data in 
the bucket exists, and finally add 2 bytes of version information at the end of 
the segment file. After improvement, there are two levels of indexes in the 
index segment file, the Bloom filter bitmap is the first level, and the index 
storage area with message index information is the second level.

When filtering consumption, the system first searches whether the filter item 
exists in the corresponding data bucket from the first level. If it does not 
exist, it continues to search for the existence of the next data bucket until 
the index segment file is completed and the filter is switched to the next 
index segment file; if the filter item is in a data bucket, the data in the 
corresponding data bucket will be read according to the current index file 
retrieval method. 

Implementation effect estimation: The results of using the Bloom filter 
algorithm to locate the results are not guaranteed to be unique, but they 
should be improved compared to the current item-by-item inspection, at least in 
the worst case, the filtering effect is consistent; and it will be a very good 
help if the sparse and non-colliding index item collection. The impact is that 
we need additional index storage space, and index file recovery requires 
special attention.

If the design needs to be implemented, I think the following points need to be 
considered:
1. Due to the addition of a bitmap index,  the checkpoint file needs to be 
added to the index store, so, when the system is restarted we can know the 
starting checkpoint of the index file;
2. Due to the change in file structure, before releasing the version of this 
feature, we need to first release a historical version compatible with this 
feature to solve the system rollback problem after this feature version is 
upgraded abnormally. I think that this is a one-time operation, the price is 
worth it.

  was:
1. Structured index storage: optimize the current index storage, for example, 
increase the structured index storage, which can be quickly retrieved through 
the index when in use to quickly locate the data; the increase in the index 
structure may make the write request slower, At the same time, it takes more 
time to check and restore the index when the system restarts

--------------------------------------------------------------------------

To solve this problem, I plan to implement it like this:
 !image-2020-05-17-10-42-15-736.png! 

Divide the datas to bucket in index segment file, and use the Bloom filter 
algorithm to save the position of the filter item for each data in the bucket, 
and finally add 2 bytes of version information at the end of the segment file. 
After this improvement, there are level 2 indexes in the index segment file, 
the Bloom filter bitmap is the first level, and the index bucket with message 
index information is the second level. 

When filtering consumption, the system first searches whether the filter item 
exists in the corresponding data bucket from the first level. If it does not 
exist, it continues to search for the existence of the next data bucket until 
the index segment file is completed and the filter is switched to the next 
index segment file; if the filter item is in a data bucket, the data in the 
corresponding data bucket will be read according to the current index file 
retrieval method. 

Implementation effect estimation: The results of using the Bloom filter 
algorithm to locate the results are not guaranteed to be unique, but they 
should be improved compared to the current item-by-item inspection, at least in 
the worst case, the filtering effect is consistent; and it will be a very good 
help if the sparse and non-colliding index item collection. The impact is that 
we need additional index storage space, and index file recovery requires 
special attention.

If the design needs to be implemented, I think the following points need to be 
considered:
1. Due to the addition of a bitmap index,  the checkpoint file needs to be 
added to the index store, so, when the system is restarted we can know the 
starting checkpoint of the index file;
2. Due to the change in file structure, before releasing the version of this 
feature, we need to first release a historical version compatible with this 
feature to solve the system rollback problem after this feature version is 
upgraded abnormally. I think that this is a one-time operation, the price is 
worth it.


> Structured index storage
> ------------------------
>
>                 Key: TUBEMQ-124
>                 URL: https://issues.apache.org/jira/browse/TUBEMQ-124
>             Project: Apache TubeMQ
>          Issue Type: Sub-task
>            Reporter: Guocheng Zhang
>            Priority: Major
>         Attachments: image-2020-05-17-10-42-15-736.png
>
>
> 1. Structured index storage: optimize the current index storage, for example, 
> increase the structured index storage, which can be quickly retrieved through 
> the index when in use to quickly locate the data; the increase in the index 
> structure may make the write request slower, At the same time, it takes more 
> time to check and restore the index when the system restarts
> --------------------------------------------------------------------------
> To solve this problem, I plan to implement it like this:
>  !image-2020-05-17-10-42-15-736.png! 
> Divide the data into the bucket in the index segment file, and use the Bloom 
> filter algorithm to determine whether the filter item contained in each data 
> in the bucket exists, and finally add 2 bytes of version information at the 
> end of the segment file. After improvement, there are two levels of indexes 
> in the index segment file, the Bloom filter bitmap is the first level, and 
> the index storage area with message index information is the second level.
> When filtering consumption, the system first searches whether the filter item 
> exists in the corresponding data bucket from the first level. If it does not 
> exist, it continues to search for the existence of the next data bucket until 
> the index segment file is completed and the filter is switched to the next 
> index segment file; if the filter item is in a data bucket, the data in the 
> corresponding data bucket will be read according to the current index file 
> retrieval method. 
> Implementation effect estimation: The results of using the Bloom filter 
> algorithm to locate the results are not guaranteed to be unique, but they 
> should be improved compared to the current item-by-item inspection, at least 
> in the worst case, the filtering effect is consistent; and it will be a very 
> good help if the sparse and non-colliding index item collection. The impact 
> is that we need additional index storage space, and index file recovery 
> requires special attention.
> If the design needs to be implemented, I think the following points need to 
> be considered:
> 1. Due to the addition of a bitmap index,  the checkpoint file needs to be 
> added to the index store, so, when the system is restarted we can know the 
> starting checkpoint of the index file;
> 2. Due to the change in file structure, before releasing the version of this 
> feature, we need to first release a historical version compatible with this 
> feature to solve the system rollback problem after this feature version is 
> upgraded abnormally. I think that this is a one-time operation, the price is 
> worth it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (TUBEMQ-124) Structured index storage

Reply via email to