Re: [DISCUSS] PIP-41: Introduce FilePath Global Index And Optimizations For Lookup In Append Table

Yong Fang Mon, 19 Jan 2026 21:26:24 -0800

Hi Jingsong,

1> For PIP-42, `clustering.columns` sounds good to me, and we can add an
option `clustering.mode` to distinguish the scope of data clustering, such
as `local` and `partition`


2>  For PIP-41, it mainly consists of two parts: a) Add sort fields
-> filename global index which can accelerate the speed of plan
construction during the query process; b) The storage implementation of the
file path global index.

I understand the concerns about Paimon introducing external storage.
Ideally, Paimon should include a high-performance key-value index framework
itself, and I'm glad to hear that the community is currently promoting the
development.
So I'd like to update PIP-41 to mainly define the file path global index
and use it in Paimon's queries, without external index storage
implementations.
After the community improves the index framework, we can directly integrate
it. This way, our internal work can evolve together with the community.
What do you think? THX

Best,
Fang Yong

On Mon, Jan 19, 2026 at 10:18 AM Jingsong Li <[email protected]> wrote:

> Hi Yong,
>
> +1 to PIP-42, I think maybe we can combine it to `clustering.columns`
> options.
>
> -0 to PIP-41, I'm quite concerned about PIP-41. Integrating an
> external system is a very difficult task to maintain. Perhaps we could
> consider implementing something similar on the periphery of Paimon?
>
> In addition, we are currently developing Paimon's own global index,
> which is not as powerful as HBase. However, there may be some
> conflicts along the way, or we may consider embedding the ability to
> use external indexes within the current global index framework.
>
> Best,
> Jingsong
>
> On Thu, Jan 15, 2026 at 3:17 PM Yong Fang <[email protected]> wrote:
> >
> > Hi Lei,
> >
> > Thanks for your input. We anticipate that the number of row groups
> within a
> > single parquet file will not be too large. Therefore, we can first
> > implement direct sorting by key to ensure data order within individual
> > files. Later, we can further consider more complex sorting algorithms
> such
> > as Z-Order based on query requirements and the number of row groups.
> >
> > Best,
> > FangYong
> >
> >
> > On Thu, Jan 15, 2026 at 2:33 PM Yong Fang <[email protected]> wrote:
> >
> > > Hi Jingsong,
> > >
> > > I have separated PIP-41 into PIP-41: Introduce FilePath Global Index
> And
> > > Optimizations For Lookup In Append Table [1] and PIP-42: Local Sort And
> > > Parquet Lookup Optimizations For In Append Table [2], google docs are
> > >
> https://docs.google.com/document/d/1qV7lUW5GsZ72IuF9FV96fX3wtJO_gzEWY6aLG97caxU/edit?tab=t.0
> > > and
> > >
> https://docs.google.com/document/d/1M6R5V6zRwlj0LeuKnhZcUK4Ss-9JLs9pF42n822k-Bo/edit?tab=t.0
> > >
> > > [1]
> > >
> https://cwiki.apache.org/confluence/display/PAIMON/PIP-41%3A+Introduce+FilePath+Global+Index+And+Optimizations+For+Lookup+In+Append+Table
> > > [2]
> > >
> https://cwiki.apache.org/confluence/display/PAIMON/PIP-42%3A+Local+Sort+And+Parquet+Lookup+Optimizations+For+In+Append+Table
> > >
> > > Best,
> > > FangYong
> > >
> > > On Wed, Jan 14, 2026 at 8:31 PM lei li <[email protected]>
> wrote:
> > >
> > >> Hi Yong,
> > >>
> > >>
> > >> Thank you for sharing this fascinating PIP proposal!
> > >>
> > >>
> > >> I'm particularly intrigued by the sort buffer concept for Append
> tables.
> > >> I'd like to ask: does the current sorting process only support
> multi-field
> > >> sequential sorting, or would it be possible to introduce more advanced
> > >> sorting strategies, such as Z-order sorting or other space-filling
> curve
> > >> algorithms?
> > >>
> > >>
> > >> The reason I'm curious about this is that in high-volume batch write
> > >> scenarios, having the data already sorted upon completion of the batch
> > >> write could significantly improve query efficiency. Z-order sorting,
> for
> > >> instance, could provide better data locality for multi-dimensional
> range
> > >> queries, which is quite common in analytical workloads.
> > >>
> > >>
> > >> Would love to hear your thoughts on whether such sorting enhancements
> > >> align with the current design goals! Looking forward to your insights
> and
> > >> the continued development of this proposal!
> > >>
> > >>
> > >> Best regards,
> > >>
> > >> Lei Li
> > >>
> > >> 2026年1月13日 17:32，Yong Fang <[email protected]> 写道：
> > >>
> > >> Hi devs,
> > >>
> > >> I'd like to initiate a discussion on PIP-41: Introduce FilePath Global
> > >> Index And Optimizations For Lookup In Append Table [1].
> > >>
> > >> We use Paimon as cold storage for sample data of businesses such as
> > >> search,
> > >> recommendation, and advertising. Given the extremely large volume of
> > >> sample
> > >> data, we adopt Append Tables for data storage.
> > >>
> > >> Batch jobs read and process this sample data, while lookup by key
> > >> capability for historical data is also required during the sample data
> > >> processing. To support hybrid queries for such ultra-high-dimensional
> data
> > >> in Paimon, we introduce the FilePath Global Index, along with
> > >> optimizations
> > >> for reading Parquet metadata and Parquet file data in Append Tables
> (Some
> > >> of the major optimization designs come from our partner teams cc
> > >> @lingpeng,
> > >> @guanziyue, thx), aiming to enhance the lookup capability of Paimon
> Append
> > >> Tables.
> > >>
> > >> Looking forward to hearing from you, thanks
> > >>
> > >>
> > >> [1]
> > >>
> > >>
> https://cwiki.apache.org/confluence/display/PAIMON/PIP-41%3A+Introduce+FilePath+Global+Index+And+Optimizations+For+Lookup+In+Append+Table
> > >>
> > >> Best,
> > >> Fang Yong
> > >>
> > >>
>

Re: [DISCUSS] PIP-41: Introduce FilePath Global Index And Optimizations For Lookup In Append Table

Reply via email to