Re: PIP-30: Optimize bitmap index format

jialiang tan Thu, 23 Jan 2025 01:11:14 -0800

Hi, Zhonghang.

I actually only want to know the two specified options compare in the
benchmark:
1. just the bytebuffer to load the header and test the EQ predicate
2. base the option 1, introducing secondary indexes, test the EQ predicate


I just want to know how much improvement I can get by introducing secondary
indexes if the IO problem is fixed.

If Option 1 works well in most common situations, you can simply go ahead
and use it. But if it doesn't quite meet your needs, then Option 2 would be
a better choice.

Best,
JiaLiang.

zhonghang <1649067...@qq.com.invalid> 于2025年1月23日周四 15:01写道：

> Hi, JiaLiang.
>
>
> When the size of the serialized structure is unknown and multiple IO
> times&nbsp;
> may be generated, we can actually nest a layer of BufferedInputStream&nbsp;
> directly on the inputStream, which can greatly improve the IO performance.
> Although this method is not the optimal solution, its advantage is that
> it&nbsp;
> does not require modifying our serialization structure.
>
>
> I added this method to the bitmap benchmark:
>
>
>
> bitmap32-deserialize-benchmark:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Best/Avg
> Time(ms)&nbsp; &nbsp; Row Rate(K/s)&nbsp; &nbsp; &nbsp; Per Row(ns)&nbsp;
> &nbsp;Relative
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------
> OPERATORTEST_bitmap32-deserialize-benchmark_deserialize(DataInput)&nbsp;
> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;153
> /&nbsp; 156&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0.7&nbsp;
> &nbsp; &nbsp; &nbsp; 1531722.1&nbsp; &nbsp; &nbsp; &nbsp;1.0X
> OPERATORTEST_bitmap32-deserialize-benchmark_deserialize(DataInputStream(BufferedInputStream))&nbsp;
> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3 /&nbsp; &nbsp; 3&nbsp;
> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;32.7&nbsp; &nbsp; &nbsp; &nbsp;
> &nbsp; 30615.0&nbsp; &nbsp; &nbsp; 50.0X
> OPERATORTEST_bitmap32-deserialize-benchmark_deserialize(DataInput,
> byte[])&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;2 /&nbsp;
> &nbsp; 2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;60.8&nbsp; &nbsp;
> &nbsp; &nbsp; &nbsp; 16434.2&nbsp; &nbsp; &nbsp; 93.2X
> OPERATORTEST_bitmap32-deserialize-benchmark_deserialize(ByteBuffer)&nbsp;
> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1
> /&nbsp; &nbsp; 2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;86.6&nbsp;
> &nbsp; &nbsp; &nbsp; &nbsp; 11553.2&nbsp; &nbsp; &nbsp;132.6X
>
>
>
> BufferedInputStream can already solve the IO performance problem caused
> by&nbsp;
> too many IO times to a large extent, but this is not the same problem as
> the&nbsp;
> secondary index I designed. When the index cardinality is very high,&nbsp;
> even if the IO is optimized to the extreme, it cannot make up for the
> overhead&nbsp;
> caused by the high cardinality.
>
>
> Best,
> Zhonghang.
>
>
>
>
> zhonghang
> 1649067...@qq.com
>
>
>
> &nbsp;
>
>
>
>
> ------------------&nbsp;原始邮件&nbsp;------------------
> 发件人:
>                                                   "dev"
>                                                                 <
> tanjialiang1...@gmail.com&gt;;
> 发送时间:&nbsp;2025年1月22日(星期三) 中午11:34
> 收件人:&nbsp;"dev"<dev@paimon.apache.org&gt;;
>
> 主题:&nbsp;Re: PIP-30: Optimize bitmap index format
>
>
>
> Hi, Zhonghang.
>
> &gt; Since I guess there will be a buffer in the implementation of
> SeekableInputStream to reduce the times of IO, the length of the header and
> the length of the bitmap are not recorded in the first version.
> Same as mine, so the bsi index I contributed also have some similar problem
> I need to spend some time to fix. And this is another topic.
>
>
> Additionally, would you be able to take some time to conduct a benchmark
> comparison between your design and ByteBuffer?
> If it turns out that ByteBuffer can resolve this issue, we might consider
> postponing the introduction of secondary indexes for now.
>
> Best,
> JiaLiang.
>
>
> jialiang tan <tanjialiang1...@gmail.com&gt; 于2025年1月21日周二 19:51写道：
>
> &gt; Thank you very much, Zhonghang, for driving this initiative.
> &gt;
> &gt; I noticed this issue as well, and I think your design looks great!
> &gt;
> &gt; Here are a few suggestions I have:
> &gt;
> &gt; If my memory serves me correctly, the core problem is that using
> &gt; SeekableInputStream to load the header is too slow.
> &gt; Specifically, loading the value and offset requires multiple calls to
> &gt; SeekableInputStream.
> &gt; Therefore, I propose we consider adding a header size in the metadata
> and
> &gt; use ByteBuffer to retrieve the entire header in one go.
> &gt; This approach could also be applied to your designs.
> &gt;
> &gt; Best,
> &gt; Jialiang.
> &gt;
> &gt; Jingsong Li <jingsongl...@gmail.com&gt; 于2025年1月20日周一 10:24写道：
> &gt;
> &gt;&gt; Thanks zhonghang,
> &gt;&gt;
> &gt;&gt; Overall design looks good to me!
> &gt;&gt;
> &gt;&gt; I have some questions:
> &gt;&gt;
> &gt;&gt; 1. Are there any other bitmap designs that can be referenced,
> such as
> &gt;&gt; the bitmap structure of starrocks' internal tables?
> &gt;&gt; 2. We should store the size of the roaring bitmap to make
> &gt;&gt; deserialization faster. See
> &gt;&gt; https://github.com/apache/incubator-paimon/pull/4765
> &gt;&gt; 3. Maybe we can have better names for dictionaries, maybe we can
> have
> &gt;&gt; three parts: Index header + Index blocks + Bitmap blocks
> &gt;&gt;
> &gt;&gt; Best,
> &gt;&gt; Jingsong
> &gt;&gt;
> &gt;&gt; On Fri, Jan 17, 2025 at 5:16 PM Jingsong Li <
> jingsongl...@gmail.com&gt;
> &gt;&gt; wrote:
> &gt;&gt; &gt;
> &gt;&gt; &gt; Hi zhonghang,
> &gt;&gt; &gt;
> &gt;&gt; &gt; Please grant public access to the document.
> &gt;&gt; &gt;
> &gt;&gt; &gt; Best,
> &gt;&gt; &gt; Jingsong
> &gt;&gt; &gt;
> &gt;&gt; &gt; On Fri, Jan 17, 2025 at 3:46 PM zhonghang
> <1649067...@qq.com.invalid&gt;
> &gt;&gt; wrote:
> &gt;&gt; &gt; &gt;
> &gt;&gt; &gt; &gt; Hi, devs:
> &gt;&gt; &gt; &gt; &amp;nbsp; &amp;nbsp; We found that the first version
> of the bitmap index
> &gt;&gt; had&amp;nbsp;
> &gt;&gt; &gt; &gt; some performance issues in high cardinality
> scenarios.&amp;nbsp;
> &gt;&gt; &gt; &gt; We have now made some optimizations and hope to
> discuss&amp;nbsp;
> &gt;&gt; &gt; &gt; them with you.&amp;nbsp;The detailed design document is
> as follows [1].
> &gt;&gt; &gt; &gt;
> &gt;&gt; &gt; &gt;
> &gt;&gt; &gt; &gt; [1]:
> &gt;&gt; &gt; &gt;
> &gt;&gt;
> https://docs.google.com/document/d/11dJlGlSX3dwYKKrPN0DQ2XQTsx6d9wI6DTBIiiBwomM/edit?usp=sharing
> &gt;&gt
> <https://docs.google.com/document/d/11dJlGlSX3dwYKKrPN0DQ2XQTsx6d9wI6DTBIiiBwomM/edit?usp=sharing&gt;&gt>;
> &gt; &gt;
> &gt;&gt; &gt; &gt;
> &gt;&gt; &gt; &gt; Thanks
> &gt;&gt; &gt; &gt; ZhonghangLiu.
> &gt;&gt; &gt; &gt;
> &gt;&gt; &gt; &gt;
> &gt;&gt; &gt; &gt; zhonghang
> &gt;&gt; &gt; &gt; 1649067...@qq.com
> &gt;&gt; &gt; &gt;
> &gt;&gt; &gt; &gt;
> &gt;&gt; &gt; &gt;
> &gt;&gt; &gt; &gt; &amp;nbsp;
> &gt;&gt;
> &gt;

Re: PIP-30: Optimize bitmap index format

Reply via email to