[ 
https://issues.apache.org/jira/browse/TUBEMQ-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135518#comment-17135518
 ] 

Jeff Zhou commented on TUBEMQ-124:
----------------------------------

Yes, that's correct about overhead of index, currently the phase is on WHICH 
level of index we COULD put to best exploit the negative-prediction efficiency, 
and then the use case about whether "interest clients" are sparse enough.

As the index affects on such a bottom layer of system, ultimately it's better 
to be controlled by something like "Coefficient of Sparse" inside of 
SegmentList I think, though it's always a good start to make it a system 
low-level option, since there's also chance to aggregate them to be a use-case 
option.

Thanks for your advice.

> Structured index storage
> ------------------------
>
>                 Key: TUBEMQ-124
>                 URL: https://issues.apache.org/jira/browse/TUBEMQ-124
>             Project: Apache TubeMQ
>          Issue Type: Sub-task
>            Reporter: Guocheng Zhang
>            Assignee: Jeff Zhou
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: s15917036711158.png, screenshot-1.png
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> 1. Structured index storage: optimize the current index storage, for example, 
> increase the structured index storage, which can be quickly retrieved through 
> the index when in use to quickly locate the data; the increase in the index 
> structure may make the write request slower, At the same time, it takes more 
> time to check and restore the index when the system restarts
> --------------------------------------------------------------------------
> To solve this problem, I plan to implement it like this:
>   !screenshot-1.png! 
> The first add 2 bytes of version information at the end of the segment file, 
> then, divide the datas to bucket in index segment file, and use the Bloom 
> filter algorithm to save the position of the filter item for each data in the 
> bucket. After this improvement, there are level 2 indexes in the index 
> segment file, the Bloom filter bitmap is the first level, and the index 
> bucket with message index information is the second level. 
> When filtering consumption, the system first searches whether the filter item 
> exists in the corresponding data bucket from the first level. If it does not 
> exist, it continues to search for the existence of the next data bucket until 
> the index segment file is completed and the filter is switched to the next 
> index segment file; if the filter item is in a data bucket, the data in the 
> corresponding data bucket will be read according to the current index file 
> retrieval method. 
> Implementation effect estimation: The results of using the Bloom filter 
> algorithm to locate the results are not guaranteed to be unique, but they 
> should be improved compared to the current item-by-item inspection, at least 
> in the worst case, the filtering effect is consistent; and it will be a very 
> good help if the sparse and non-colliding index item collection. The impact 
> is that we need additional index storage space, and index file recovery 
> requires special attention.
> If the design needs to be implemented, I think the following points need to 
> be considered:
> 1. Due to the addition of a bitmap index,  the checkpoint file needs to be 
> added to the index store, so, when the system is restarted we can know the 
> starting checkpoint of the index file;
> 2. Due to the change in file structure, before releasing the version of this 
> feature, we need to first release a historical version compatible with this 
> feature to solve the system rollback problem after this feature version is 
> upgraded abnormally. I think that this is a one-time operation, the price is 
> worth it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to