Re: [DISCUSS] PIP-17: Introduce secondary column index

JUNHAO YE Tue, 19 Mar 2024 19:19:26 -0700

Hi，aitozi

Really thanks for comment! I have read your question and reply here:


(1) For now, the secondary index is mainly designed for append-only table.
More and more users migrate from hive and hudi to paimon, their main table
format is append-only. In the future, after deletion files down, I think the 
secondary
index is also useful for primary key with deletion file. 
See PIP-16 
(https://cwiki.apache.org/confluence/display/PAIMON/PIP-16%3A+Paimon+position+delete+mode)
But that's not the job of this period. I should add this to PIP.

(2) The answer is yes. I refer to the approach of Hudi and Delta Lake.
Hudi put the index bytes in the user meta space of orc file and parquet file,
delta lake use an extra file to support index, as a result, I want it more 
flexible. 
Indeed, it will cause the x2 file numbers, but the file it self will not be 
touched 
often. Maybe later in the future, we can consider to combine these index file 
to reduce the pressure for filesystem, but I think we can implement it this way
 for now.

(3) Correct. If you want drop one column index (this does not happen often),
we just rewrite the index file, then discard the corresponding bytes, last, 
write it 
back to file and rewrite DataFileMeta in ManifestEntry.

Thanks again for comment!

Best,
Junhao




> 2024年3月19日 下午11:07，Aitozi <[email protected]> 写道：
> 
> Hi, junhao
> 
>    I's nice to see the secondary index feature in paimon. After read the
> PIP, I have several questions here.
> 
> (1) For the primary key table, we only push down the filter for the primary
> key, because,
> we can not filter the value if the value should be merged with other
> levels data. So will
> the primary key table be benefit from the secondary column index ? Or the
> main improvement
> is for the append table ?
> 
> (2) The storage of the index file, "one file for one datafile of one index
> type", will this bring too much
> extra files, an index type will x2 the file number ?
> 
> (3) "While drop column index, for example, I have indexed column a and b, I
> don't want to index a anymore. I just need to drop the target index bytes
> from index file,
> and don't have to read the data file again."
> 
> Do you mean we will have to rewrite the index file when drop one column
> index in it ?
> 
> Best,
> Aitozi
> 
> JUNHAO YE <[email protected]> 于2024年3月19日周二 19:26写道：
> 
>> Hi, Zhang YiLong
>> 
>> You are right, as I mentioned in PIP-17. We should have priority of
>> different index types. We should consider about combine the result of
>> different index type.
>> 
>> Best, junhao.
>> 
>> 
>>> 2024年3月18日 上午10:49，Zhang YiLong <[email protected]> 写道：
>>> 
>>> This is a big improvement, but I don't think it's for low cardinal
>> fields, because the index at the file level, and for low cardinal fields
>> (e.g. gender is only male and female) in most cases (the field is not
>> sorted) it is present in all files.
>>> 
>>> For specific business, we wants a json index, bitmap index, reverse
>> index, etc  to adapt to different query conditions. So we also need a
>> priority, using different indexes for different query filter and finally
>> combining the results (based on the actual filter criteria and/or)
>>> 
>>> ________________________________
>>> 发件人: yu zelin <[email protected]>
>>> 发送时间: 2024年3月15日 14:43
>>> 收件人: [email protected] <[email protected]>
>>> 主题: Re: [DISCUSS] PIP-17: Introduce secondary column index
>>> 
>>> An exciting feature, +1.
>>> 
>>> Best Regards,
>>> Zelin Yu
>>> 
>>> On Thu, Mar 14, 2024 at 5:53 PM yejunhao <[email protected]>
>> wrote:
>>> 
>>>> Hi, Paimon Devs, I’d like to start a discussion about PIP-17[1].
>>>> 
>>>> Up to now, Paimon use zorder & order & hilbert sort compaction to speed
>> up
>>>> query. After sort compaction, files will be sorted by the order of
>>>> specified columns. But in some situations, for example, we have tens of
>>>> columns that should be added in the filter column, sometimes all of them
>>>> come up together, sometimes, just a few of them. Zorder or order
>> compaction
>>>> can't handle this situation, because too many columns will reduce the
>>>> effect of sorting. So if the column base number of these columns is
>> small,
>>>> we can use bloomfilter or other indexes to speed up queries. That's why
>>>> this PIP comes up. I want to introduce an index framework to support
>> paimon
>>>> with flexible index system.
>>>> 
>>>> Look forward to your question and suggestions.
>>>> 
>>>> Best, junhao
>>>> 
>>>> [1]
>>>> 
>> https://cwiki.apache.org/confluence/display/PAIMON/PIP-17%3A+Introduce+secondary+column+index
>> 
>>

Re: [DISCUSS] PIP-17: Introduce secondary column index

Reply via email to