Re: [DISCUSS] PIP-17: Introduce secondary column index

Yong Fang Thu, 09 May 2024 19:21:09 -0700

Hi yejunhao,

I'd like to know what's the status of this feature, are you still working
on it? Thanks


Best,
Fang Yong

On Wed, Mar 20, 2024 at 12:14 PM Aitozi <[email protected]> wrote:

> Thanks for your inputs, I have no other questions, +1 for this.
> Looking forward to this feature.
>
> Best,
> Aitozi.
>
> JUNHAO YE <[email protected]> 于2024年3月20日周三 10:19写道：
>
> > Hi，aitozi
> >
> > Really thanks for comment! I have read your question and reply here:
> >
> > (1) For now, the secondary index is mainly designed for append-only
> table.
> > More and more users migrate from hive and hudi to paimon, their main
> table
> > format is append-only. In the future, after deletion files down, I think
> > the secondary
> > index is also useful for primary key with deletion file.
> > See PIP-16 (
> >
> https://cwiki.apache.org/confluence/display/PAIMON/PIP-16%3A+Paimon+position+delete+mode
> > )
> > But that's not the job of this period. I should add this to PIP.
> >
> > (2) The answer is yes. I refer to the approach of Hudi and Delta Lake.
> > Hudi put the index bytes in the user meta space of orc file and parquet
> > file,
> > delta lake use an extra file to support index, as a result, I want it
> more
> > flexible.
> > Indeed, it will cause the x2 file numbers, but the file it self will not
> > be touched
> > often. Maybe later in the future, we can consider to combine these index
> > file
> > to reduce the pressure for filesystem, but I think we can implement it
> > this way
> >  for now.
> >
> > (3) Correct. If you want drop one column index (this does not happen
> > often),
> > we just rewrite the index file, then discard the corresponding bytes,
> > last, write it
> > back to file and rewrite DataFileMeta in ManifestEntry.
> >
> > Thanks again for comment!
> >
> > Best,
> > Junhao
> >
> >
> >
> >
> > > 2024年3月19日 下午11:07，Aitozi <[email protected]> 写道：
> > >
> > > Hi, junhao
> > >
> > >    I's nice to see the secondary index feature in paimon. After read
> the
> > > PIP, I have several questions here.
> > >
> > > (1) For the primary key table, we only push down the filter for the
> > primary
> > > key, because,
> > > we can not filter the value if the value should be merged with other
> > > levels data. So will
> > > the primary key table be benefit from the secondary column index ? Or
> the
> > > main improvement
> > > is for the append table ?
> > >
> > > (2) The storage of the index file, "one file for one datafile of one
> > index
> > > type", will this bring too much
> > > extra files, an index type will x2 the file number ?
> > >
> > > (3) "While drop column index, for example, I have indexed column a and
> > b, I
> > > don't want to index a anymore. I just need to drop the target index
> bytes
> > > from index file,
> > > and don't have to read the data file again."
> > >
> > > Do you mean we will have to rewrite the index file when drop one column
> > > index in it ?
> > >
> > > Best,
> > > Aitozi
> > >
> > > JUNHAO YE <[email protected]> 于2024年3月19日周二 19:26写道：
> > >
> > >> Hi, Zhang YiLong
> > >>
> > >> You are right, as I mentioned in PIP-17. We should have priority of
> > >> different index types. We should consider about combine the result of
> > >> different index type.
> > >>
> > >> Best, junhao.
> > >>
> > >>
> > >>> 2024年3月18日 上午10:49，Zhang YiLong <[email protected]> 写道：
> > >>>
> > >>> This is a big improvement, but I don't think it's for low cardinal
> > >> fields, because the index at the file level, and for low cardinal
> fields
> > >> (e.g. gender is only male and female) in most cases (the field is not
> > >> sorted) it is present in all files.
> > >>>
> > >>> For specific business, we wants a json index, bitmap index, reverse
> > >> index, etc  to adapt to different query conditions. So we also need a
> > >> priority, using different indexes for different query filter and
> finally
> > >> combining the results (based on the actual filter criteria and/or)
> > >>>
> > >>> ________________________________
> > >>> 发件人: yu zelin <[email protected]>
> > >>> 发送时间: 2024年3月15日 14:43
> > >>> 收件人: [email protected] <[email protected]>
> > >>> 主题: Re: [DISCUSS] PIP-17: Introduce secondary column index
> > >>>
> > >>> An exciting feature, +1.
> > >>>
> > >>> Best Regards,
> > >>> Zelin Yu
> > >>>
> > >>> On Thu, Mar 14, 2024 at 5:53 PM yejunhao <[email protected]>
> > >> wrote:
> > >>>
> > >>>> Hi, Paimon Devs, I’d like to start a discussion about PIP-17[1].
> > >>>>
> > >>>> Up to now, Paimon use zorder & order & hilbert sort compaction to
> > speed
> > >> up
> > >>>> query. After sort compaction, files will be sorted by the order of
> > >>>> specified columns. But in some situations, for example, we have tens
> > of
> > >>>> columns that should be added in the filter column, sometimes all of
> > them
> > >>>> come up together, sometimes, just a few of them. Zorder or order
> > >> compaction
> > >>>> can't handle this situation, because too many columns will reduce
> the
> > >>>> effect of sorting. So if the column base number of these columns is
> > >> small,
> > >>>> we can use bloomfilter or other indexes to speed up queries. That's
> > why
> > >>>> this PIP comes up. I want to introduce an index framework to support
> > >> paimon
> > >>>> with flexible index system.
> > >>>>
> > >>>> Look forward to your question and suggestions.
> > >>>>
> > >>>> Best, junhao
> > >>>>
> > >>>> [1]
> > >>>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/PAIMON/PIP-17%3A+Introduce+secondary+column+index
> > >>
> > >>
> >
> >
>

Re: [DISCUSS] PIP-17: Introduce secondary column index

Reply via email to