Thanks for your inputs, I have no other questions, +1 for this. Looking forward to this feature.
Best, Aitozi. JUNHAO YE <[email protected]> 于2024年3月20日周三 10:19写道: > Hi,aitozi > > Really thanks for comment! I have read your question and reply here: > > (1) For now, the secondary index is mainly designed for append-only table. > More and more users migrate from hive and hudi to paimon, their main table > format is append-only. In the future, after deletion files down, I think > the secondary > index is also useful for primary key with deletion file. > See PIP-16 ( > https://cwiki.apache.org/confluence/display/PAIMON/PIP-16%3A+Paimon+position+delete+mode > ) > But that's not the job of this period. I should add this to PIP. > > (2) The answer is yes. I refer to the approach of Hudi and Delta Lake. > Hudi put the index bytes in the user meta space of orc file and parquet > file, > delta lake use an extra file to support index, as a result, I want it more > flexible. > Indeed, it will cause the x2 file numbers, but the file it self will not > be touched > often. Maybe later in the future, we can consider to combine these index > file > to reduce the pressure for filesystem, but I think we can implement it > this way > for now. > > (3) Correct. If you want drop one column index (this does not happen > often), > we just rewrite the index file, then discard the corresponding bytes, > last, write it > back to file and rewrite DataFileMeta in ManifestEntry. > > Thanks again for comment! > > Best, > Junhao > > > > > > 2024年3月19日 下午11:07,Aitozi <[email protected]> 写道: > > > > Hi, junhao > > > > I's nice to see the secondary index feature in paimon. After read the > > PIP, I have several questions here. > > > > (1) For the primary key table, we only push down the filter for the > primary > > key, because, > > we can not filter the value if the value should be merged with other > > levels data. So will > > the primary key table be benefit from the secondary column index ? Or the > > main improvement > > is for the append table ? > > > > (2) The storage of the index file, "one file for one datafile of one > index > > type", will this bring too much > > extra files, an index type will x2 the file number ? > > > > (3) "While drop column index, for example, I have indexed column a and > b, I > > don't want to index a anymore. I just need to drop the target index bytes > > from index file, > > and don't have to read the data file again." > > > > Do you mean we will have to rewrite the index file when drop one column > > index in it ? > > > > Best, > > Aitozi > > > > JUNHAO YE <[email protected]> 于2024年3月19日周二 19:26写道: > > > >> Hi, Zhang YiLong > >> > >> You are right, as I mentioned in PIP-17. We should have priority of > >> different index types. We should consider about combine the result of > >> different index type. > >> > >> Best, junhao. > >> > >> > >>> 2024年3月18日 上午10:49,Zhang YiLong <[email protected]> 写道: > >>> > >>> This is a big improvement, but I don't think it's for low cardinal > >> fields, because the index at the file level, and for low cardinal fields > >> (e.g. gender is only male and female) in most cases (the field is not > >> sorted) it is present in all files. > >>> > >>> For specific business, we wants a json index, bitmap index, reverse > >> index, etc to adapt to different query conditions. So we also need a > >> priority, using different indexes for different query filter and finally > >> combining the results (based on the actual filter criteria and/or) > >>> > >>> ________________________________ > >>> 发件人: yu zelin <[email protected]> > >>> 发送时间: 2024年3月15日 14:43 > >>> 收件人: [email protected] <[email protected]> > >>> 主题: Re: [DISCUSS] PIP-17: Introduce secondary column index > >>> > >>> An exciting feature, +1. > >>> > >>> Best Regards, > >>> Zelin Yu > >>> > >>> On Thu, Mar 14, 2024 at 5:53 PM yejunhao <[email protected]> > >> wrote: > >>> > >>>> Hi, Paimon Devs, I’d like to start a discussion about PIP-17[1]. > >>>> > >>>> Up to now, Paimon use zorder & order & hilbert sort compaction to > speed > >> up > >>>> query. After sort compaction, files will be sorted by the order of > >>>> specified columns. But in some situations, for example, we have tens > of > >>>> columns that should be added in the filter column, sometimes all of > them > >>>> come up together, sometimes, just a few of them. Zorder or order > >> compaction > >>>> can't handle this situation, because too many columns will reduce the > >>>> effect of sorting. So if the column base number of these columns is > >> small, > >>>> we can use bloomfilter or other indexes to speed up queries. That's > why > >>>> this PIP comes up. I want to introduce an index framework to support > >> paimon > >>>> with flexible index system. > >>>> > >>>> Look forward to your question and suggestions. > >>>> > >>>> Best, junhao > >>>> > >>>> [1] > >>>> > >> > https://cwiki.apache.org/confluence/display/PAIMON/PIP-17%3A+Introduce+secondary+column+index > >> > >> > >
