Re: [DISCUSS] PIP-17: Introduce secondary column index

Aitozi Tue, 19 Mar 2024 21:14:46 -0700

Thanks for your inputs, I have no other questions, +1 for this.
Looking forward to this feature.


Best,
Aitozi.

JUNHAO YE <[email protected]> 于2024年3月20日周三 10:19写道：

> Hi，aitozi
>
> Really thanks for comment! I have read your question and reply here:
>
> (1) For now, the secondary index is mainly designed for append-only table.
> More and more users migrate from hive and hudi to paimon, their main table
> format is append-only. In the future, after deletion files down, I think
> the secondary
> index is also useful for primary key with deletion file.
> See PIP-16 (
> https://cwiki.apache.org/confluence/display/PAIMON/PIP-16%3A+Paimon+position+delete+mode
> )
> But that's not the job of this period. I should add this to PIP.
>
> (2) The answer is yes. I refer to the approach of Hudi and Delta Lake.
> Hudi put the index bytes in the user meta space of orc file and parquet
> file,
> delta lake use an extra file to support index, as a result, I want it more
> flexible.
> Indeed, it will cause the x2 file numbers, but the file it self will not
> be touched
> often. Maybe later in the future, we can consider to combine these index
> file
> to reduce the pressure for filesystem, but I think we can implement it
> this way
>  for now.
>
> (3) Correct. If you want drop one column index (this does not happen
> often),
> we just rewrite the index file, then discard the corresponding bytes,
> last, write it
> back to file and rewrite DataFileMeta in ManifestEntry.
>
> Thanks again for comment!
>
> Best,
> Junhao
>
>
>
>
> > 2024年3月19日 下午11:07，Aitozi <[email protected]> 写道：
> >
> > Hi, junhao
> >
> >    I's nice to see the secondary index feature in paimon. After read the
> > PIP, I have several questions here.
> >
> > (1) For the primary key table, we only push down the filter for the
> primary
> > key, because,
> > we can not filter the value if the value should be merged with other
> > levels data. So will
> > the primary key table be benefit from the secondary column index ? Or the
> > main improvement
> > is for the append table ?
> >
> > (2) The storage of the index file, "one file for one datafile of one
> index
> > type", will this bring too much
> > extra files, an index type will x2 the file number ?
> >
> > (3) "While drop column index, for example, I have indexed column a and
> b, I
> > don't want to index a anymore. I just need to drop the target index bytes
> > from index file,
> > and don't have to read the data file again."
> >
> > Do you mean we will have to rewrite the index file when drop one column
> > index in it ?
> >
> > Best,
> > Aitozi
> >
> > JUNHAO YE <[email protected]> 于2024年3月19日周二 19:26写道：
> >
> >> Hi, Zhang YiLong
> >>
> >> You are right, as I mentioned in PIP-17. We should have priority of
> >> different index types. We should consider about combine the result of
> >> different index type.
> >>
> >> Best, junhao.
> >>
> >>
> >>> 2024年3月18日 上午10:49，Zhang YiLong <[email protected]> 写道：
> >>>
> >>> This is a big improvement, but I don't think it's for low cardinal
> >> fields, because the index at the file level, and for low cardinal fields
> >> (e.g. gender is only male and female) in most cases (the field is not
> >> sorted) it is present in all files.
> >>>
> >>> For specific business, we wants a json index, bitmap index, reverse
> >> index, etc  to adapt to different query conditions. So we also need a
> >> priority, using different indexes for different query filter and finally
> >> combining the results (based on the actual filter criteria and/or)
> >>>
> >>> ________________________________
> >>> 发件人: yu zelin <[email protected]>
> >>> 发送时间: 2024年3月15日 14:43
> >>> 收件人: [email protected] <[email protected]>
> >>> 主题: Re: [DISCUSS] PIP-17: Introduce secondary column index
> >>>
> >>> An exciting feature, +1.
> >>>
> >>> Best Regards,
> >>> Zelin Yu
> >>>
> >>> On Thu, Mar 14, 2024 at 5:53 PM yejunhao <[email protected]>
> >> wrote:
> >>>
> >>>> Hi, Paimon Devs, I’d like to start a discussion about PIP-17[1].
> >>>>
> >>>> Up to now, Paimon use zorder & order & hilbert sort compaction to
> speed
> >> up
> >>>> query. After sort compaction, files will be sorted by the order of
> >>>> specified columns. But in some situations, for example, we have tens
> of
> >>>> columns that should be added in the filter column, sometimes all of
> them
> >>>> come up together, sometimes, just a few of them. Zorder or order
> >> compaction
> >>>> can't handle this situation, because too many columns will reduce the
> >>>> effect of sorting. So if the column base number of these columns is
> >> small,
> >>>> we can use bloomfilter or other indexes to speed up queries. That's
> why
> >>>> this PIP comes up. I want to introduce an index framework to support
> >> paimon
> >>>> with flexible index system.
> >>>>
> >>>> Look forward to your question and suggestions.
> >>>>
> >>>> Best, junhao
> >>>>
> >>>> [1]
> >>>>
> >>
> https://cwiki.apache.org/confluence/display/PAIMON/PIP-17%3A+Introduce+secondary+column+index
> >>
> >>
>
>

Re: [DISCUSS] PIP-17: Introduce secondary column index

Reply via email to