[ https://issues.apache.org/jira/browse/TUBEMQ-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135518#comment-17135518 ]
Jeff Zhou commented on TUBEMQ-124: ---------------------------------- Yes, that's correct about overhead of index, currently the phase is on WHICH level of index we COULD put to best exploit the negative-prediction efficiency, and then the use case about whether "interest clients" are sparse enough. As the index affects on such a bottom layer of system, ultimately it's better to be controlled by something like "Coefficient of Sparse" inside of SegmentList I think, though it's always a good start to make it a system low-level option, since there's also chance to aggregate them to be a use-case option. Thanks for your advice. > Structured index storage > ------------------------ > > Key: TUBEMQ-124 > URL: https://issues.apache.org/jira/browse/TUBEMQ-124 > Project: Apache TubeMQ > Issue Type: Sub-task > Reporter: Guocheng Zhang > Assignee: Jeff Zhou > Priority: Major > Labels: pull-request-available > Attachments: s15917036711158.png, screenshot-1.png > > Time Spent: 20m > Remaining Estimate: 0h > > 1. Structured index storage: optimize the current index storage, for example, > increase the structured index storage, which can be quickly retrieved through > the index when in use to quickly locate the data; the increase in the index > structure may make the write request slower, At the same time, it takes more > time to check and restore the index when the system restarts > -------------------------------------------------------------------------- > To solve this problem, I plan to implement it like this: > !screenshot-1.png! > The first add 2 bytes of version information at the end of the segment file, > then, divide the datas to bucket in index segment file, and use the Bloom > filter algorithm to save the position of the filter item for each data in the > bucket. After this improvement, there are level 2 indexes in the index > segment file, the Bloom filter bitmap is the first level, and the index > bucket with message index information is the second level. > When filtering consumption, the system first searches whether the filter item > exists in the corresponding data bucket from the first level. If it does not > exist, it continues to search for the existence of the next data bucket until > the index segment file is completed and the filter is switched to the next > index segment file; if the filter item is in a data bucket, the data in the > corresponding data bucket will be read according to the current index file > retrieval method. > Implementation effect estimation: The results of using the Bloom filter > algorithm to locate the results are not guaranteed to be unique, but they > should be improved compared to the current item-by-item inspection, at least > in the worst case, the filtering effect is consistent; and it will be a very > good help if the sparse and non-colliding index item collection. The impact > is that we need additional index storage space, and index file recovery > requires special attention. > If the design needs to be implemented, I think the following points need to > be considered: > 1. Due to the addition of a bitmap index, the checkpoint file needs to be > added to the index store, so, when the system is restarted we can know the > starting checkpoint of the index file; > 2. Due to the change in file structure, before releasing the version of this > feature, we need to first release a historical version compatible with this > feature to solve the system rollback problem after this feature version is > upgraded abnormally. I think that this is a one-time operation, the price is > worth it. -- This message was sent by Atlassian Jira (v8.3.4#803005)