Hi, Zhonghang. I actually only want to know the two specified options compare in the benchmark: 1. just the bytebuffer to load the header and test the EQ predicate 2. base the option 1, introducing secondary indexes, test the EQ predicate
I just want to know how much improvement I can get by introducing secondary indexes if the IO problem is fixed. If Option 1 works well in most common situations, you can simply go ahead and use it. But if it doesn't quite meet your needs, then Option 2 would be a better choice. Best, JiaLiang. zhonghang <1649067...@qq.com.invalid> 于2025年1月23日周四 15:01写道: > Hi, JiaLiang. > > > When the size of the serialized structure is unknown and multiple IO > times > may be generated, we can actually nest a layer of BufferedInputStream > directly on the inputStream, which can greatly improve the IO performance. > Although this method is not the optimal solution, its advantage is that > it > does not require modifying our serialization structure. > > > I added this method to the bitmap benchmark: > > > > bitmap32-deserialize-benchmark: > > > Best/Avg > Time(ms) Row Rate(K/s) Per Row(ns) > Relative > > -------------------------------------------------------------------------------------------------------------------------------------------------------------------- > OPERATORTEST_bitmap32-deserialize-benchmark_deserialize(DataInput) > > 153 > / 156 0.7 > 1531722.1 1.0X > OPERATORTEST_bitmap32-deserialize-benchmark_deserialize(DataInputStream(BufferedInputStream)) > 3 / 3 > 32.7 > 30615.0 50.0X > OPERATORTEST_bitmap32-deserialize-benchmark_deserialize(DataInput, > byte[]) > 2 / > 2 60.8 > 16434.2 93.2X > OPERATORTEST_bitmap32-deserialize-benchmark_deserialize(ByteBuffer) > > 1 > / 2 86.6 > 11553.2 132.6X > > > > BufferedInputStream can already solve the IO performance problem caused > by > too many IO times to a large extent, but this is not the same problem as > the > secondary index I designed. When the index cardinality is very high, > even if the IO is optimized to the extreme, it cannot make up for the > overhead > caused by the high cardinality. > > > Best, > Zhonghang. > > > > > zhonghang > 1649067...@qq.com > > > > > > > > > ------------------ 原始邮件 ------------------ > 发件人: > "dev" > < > tanjialiang1...@gmail.com>; > 发送时间: 2025年1月22日(星期三) 中午11:34 > 收件人: "dev"<dev@paimon.apache.org>; > > 主题: Re: PIP-30: Optimize bitmap index format > > > > Hi, Zhonghang. > > > Since I guess there will be a buffer in the implementation of > SeekableInputStream to reduce the times of IO, the length of the header and > the length of the bitmap are not recorded in the first version. > Same as mine, so the bsi index I contributed also have some similar problem > I need to spend some time to fix. And this is another topic. > > > Additionally, would you be able to take some time to conduct a benchmark > comparison between your design and ByteBuffer? > If it turns out that ByteBuffer can resolve this issue, we might consider > postponing the introduction of secondary indexes for now. > > Best, > JiaLiang. > > > jialiang tan <tanjialiang1...@gmail.com> 于2025年1月21日周二 19:51写道: > > > Thank you very much, Zhonghang, for driving this initiative. > > > > I noticed this issue as well, and I think your design looks great! > > > > Here are a few suggestions I have: > > > > If my memory serves me correctly, the core problem is that using > > SeekableInputStream to load the header is too slow. > > Specifically, loading the value and offset requires multiple calls to > > SeekableInputStream. > > Therefore, I propose we consider adding a header size in the metadata > and > > use ByteBuffer to retrieve the entire header in one go. > > This approach could also be applied to your designs. > > > > Best, > > Jialiang. > > > > Jingsong Li <jingsongl...@gmail.com> 于2025年1月20日周一 10:24写道: > > > >> Thanks zhonghang, > >> > >> Overall design looks good to me! > >> > >> I have some questions: > >> > >> 1. Are there any other bitmap designs that can be referenced, > such as > >> the bitmap structure of starrocks' internal tables? > >> 2. We should store the size of the roaring bitmap to make > >> deserialization faster. See > >> https://github.com/apache/incubator-paimon/pull/4765 > >> 3. Maybe we can have better names for dictionaries, maybe we can > have > >> three parts: Index header + Index blocks + Bitmap blocks > >> > >> Best, > >> Jingsong > >> > >> On Fri, Jan 17, 2025 at 5:16 PM Jingsong Li < > jingsongl...@gmail.com> > >> wrote: > >> > > >> > Hi zhonghang, > >> > > >> > Please grant public access to the document. > >> > > >> > Best, > >> > Jingsong > >> > > >> > On Fri, Jan 17, 2025 at 3:46 PM zhonghang > <1649067...@qq.com.invalid> > >> wrote: > >> > > > >> > > Hi, devs: > >> > > &nbsp; &nbsp; We found that the first version > of the bitmap index > >> had&nbsp; > >> > > some performance issues in high cardinality > scenarios.&nbsp; > >> > > We have now made some optimizations and hope to > discuss&nbsp; > >> > > them with you.&nbsp;The detailed design document is > as follows [1]. > >> > > > >> > > > >> > > [1]: > >> > > > >> > https://docs.google.com/document/d/11dJlGlSX3dwYKKrPN0DQ2XQTsx6d9wI6DTBIiiBwomM/edit?usp=sharing > >> > <https://docs.google.com/document/d/11dJlGlSX3dwYKKrPN0DQ2XQTsx6d9wI6DTBIiiBwomM/edit?usp=sharing>>>; > > > > >> > > > >> > > Thanks > >> > > ZhonghangLiu. > >> > > > >> > > > >> > > zhonghang > >> > > 1649067...@qq.com > >> > > > >> > > > >> > > > >> > > &nbsp; > >> > >