Re: [DISCUSS] PIP-41: Introduce FilePath Global Index And Optimizations For Lookup In Append Table

Yong Fang Tue, 27 Jan 2026 03:19:19 -0800

Thanks to Jingsong, I have added 'clustering.incremental.mode', which
supports different file - selection strategies for `LSM` (by default) and
`SINGLE` for single file.


If there are no more comments, I will start a vote thread for PIP-42 in 72
hours.

And then, based on the current global index related API, I will update the
content of PIP-41 and initiate a new discussion. Thank you.

Best,
Fang Yong

On Mon, Jan 26, 2026 at 11:57 AM Jingsong Li <[email protected]> wrote:

> Thanks Yong!
>
> +1 to PIP-42, maybe we should unify 'clustering.incremental'.
>
> Best,
> Jingsong
>
> On Thu, Jan 22, 2026 at 7:02 PM Yong Fang <[email protected]> wrote:
> >
> > Hi Jingsong,
> >
> > Thanks for your feedback, I'v updated PIP-41 and PIP-42, pls help to
> review
> > them again when you're free.
> >
> >
> > Best,
> > Fang Yong
> >
> > On Tue, Jan 20, 2026 at 3:58 PM Jingsong Li <[email protected]>
> wrote:
> >
> > > Hi Yong,
> > >
> > > Sounds good to me!
> > >
> > > You are right, the first thing is we should have a framework to index
> > > the file name and position not just global row id. And then we can
> > > implement it using the Paimon index in the community. Using this
> > > framework, you can implement HBase index in your inner environment.
> > >
> > > Best,
> > > Jingsong
> > >
> > > On Tue, Jan 20, 2026 at 1:26 PM Yong Fang <[email protected]> wrote:
> > > >
> > > > Hi Jingsong,
> > > >
> > > > 1> For PIP-42, `clustering.columns` sounds good to me, and we can
> add an
> > > > option `clustering.mode` to distinguish the scope of data clustering,
> > > such
> > > > as `local` and `partition`
> > > >
> > > > 2>  For PIP-41, it mainly consists of two parts: a) Add sort fields
> > > > -> filename global index which can accelerate the speed of plan
> > > > construction during the query process; b) The storage implementation
> of
> > > the
> > > > file path global index.
> > > >
> > > > I understand the concerns about Paimon introducing external storage.
> > > > Ideally, Paimon should include a high-performance key-value index
> > > framework
> > > > itself, and I'm glad to hear that the community is currently
> promoting
> > > the
> > > > development.
> > > > So I'd like to update PIP-41 to mainly define the file path global
> index
> > > > and use it in Paimon's queries, without external index storage
> > > > implementations.
> > > > After the community improves the index framework, we can directly
> > > integrate
> > > > it. This way, our internal work can evolve together with the
> community.
> > > > What do you think? THX
> > > >
> > > > Best,
> > > > Fang Yong
> > > >
> > > > On Mon, Jan 19, 2026 at 10:18 AM Jingsong Li <[email protected]
> >
> > > wrote:
> > > >
> > > > > Hi Yong,
> > > > >
> > > > > +1 to PIP-42, I think maybe we can combine it to
> `clustering.columns`
> > > > > options.
> > > > >
> > > > > -0 to PIP-41, I'm quite concerned about PIP-41. Integrating an
> > > > > external system is a very difficult task to maintain. Perhaps we
> could
> > > > > consider implementing something similar on the periphery of Paimon?
> > > > >
> > > > > In addition, we are currently developing Paimon's own global index,
> > > > > which is not as powerful as HBase. However, there may be some
> > > > > conflicts along the way, or we may consider embedding the ability
> to
> > > > > use external indexes within the current global index framework.
> > > > >
> > > > > Best,
> > > > > Jingsong
> > > > >
> > > > > On Thu, Jan 15, 2026 at 3:17 PM Yong Fang <[email protected]>
> wrote:
> > > > > >
> > > > > > Hi Lei,
> > > > > >
> > > > > > Thanks for your input. We anticipate that the number of row
> groups
> > > > > within a
> > > > > > single parquet file will not be too large. Therefore, we can
> first
> > > > > > implement direct sorting by key to ensure data order within
> > > individual
> > > > > > files. Later, we can further consider more complex sorting
> algorithms
> > > > > such
> > > > > > as Z-Order based on query requirements and the number of row
> groups.
> > > > > >
> > > > > > Best,
> > > > > > FangYong
> > > > > >
> > > > > >
> > > > > > On Thu, Jan 15, 2026 at 2:33 PM Yong Fang <[email protected]>
> wrote:
> > > > > >
> > > > > > > Hi Jingsong,
> > > > > > >
> > > > > > > I have separated PIP-41 into PIP-41: Introduce FilePath Global
> > > Index
> > > > > And
> > > > > > > Optimizations For Lookup In Append Table [1] and PIP-42: Local
> > > Sort And
> > > > > > > Parquet Lookup Optimizations For In Append Table [2], google
> docs
> > > are
> > > > > > >
> > > > >
> > >
> https://docs.google.com/document/d/1qV7lUW5GsZ72IuF9FV96fX3wtJO_gzEWY6aLG97caxU/edit?tab=t.0
> > > > > > > and
> > > > > > >
> > > > >
> > >
> https://docs.google.com/document/d/1M6R5V6zRwlj0LeuKnhZcUK4Ss-9JLs9pF42n822k-Bo/edit?tab=t.0
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/PAIMON/PIP-41%3A+Introduce+FilePath+Global+Index+And+Optimizations+For+Lookup+In+Append+Table
> > > > > > > [2]
> > > > > > >
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/PAIMON/PIP-42%3A+Local+Sort+And+Parquet+Lookup+Optimizations+For+In+Append+Table
> > > > > > >
> > > > > > > Best,
> > > > > > > FangYong
> > > > > > >
> > > > > > > On Wed, Jan 14, 2026 at 8:31 PM lei li <
> [email protected]>
> > > > > wrote:
> > > > > > >
> > > > > > >> Hi Yong,
> > > > > > >>
> > > > > > >>
> > > > > > >> Thank you for sharing this fascinating PIP proposal!
> > > > > > >>
> > > > > > >>
> > > > > > >> I'm particularly intrigued by the sort buffer concept for
> Append
> > > > > tables.
> > > > > > >> I'd like to ask: does the current sorting process only support
> > > > > multi-field
> > > > > > >> sequential sorting, or would it be possible to introduce more
> > > advanced
> > > > > > >> sorting strategies, such as Z-order sorting or other
> space-filling
> > > > > curve
> > > > > > >> algorithms?
> > > > > > >>
> > > > > > >>
> > > > > > >> The reason I'm curious about this is that in high-volume batch
> > > write
> > > > > > >> scenarios, having the data already sorted upon completion of
> the
> > > batch
> > > > > > >> write could significantly improve query efficiency. Z-order
> > > sorting,
> > > > > for
> > > > > > >> instance, could provide better data locality for
> multi-dimensional
> > > > > range
> > > > > > >> queries, which is quite common in analytical workloads.
> > > > > > >>
> > > > > > >>
> > > > > > >> Would love to hear your thoughts on whether such sorting
> > > enhancements
> > > > > > >> align with the current design goals! Looking forward to your
> > > insights
> > > > > and
> > > > > > >> the continued development of this proposal!
> > > > > > >>
> > > > > > >>
> > > > > > >> Best regards,
> > > > > > >>
> > > > > > >> Lei Li
> > > > > > >>
> > > > > > >> 2026年1月13日 17:32，Yong Fang <[email protected]> 写道：
> > > > > > >>
> > > > > > >> Hi devs,
> > > > > > >>
> > > > > > >> I'd like to initiate a discussion on PIP-41: Introduce
> FilePath
> > > Global
> > > > > > >> Index And Optimizations For Lookup In Append Table [1].
> > > > > > >>
> > > > > > >> We use Paimon as cold storage for sample data of businesses
> such
> > > as
> > > > > > >> search,
> > > > > > >> recommendation, and advertising. Given the extremely large
> volume
> > > of
> > > > > > >> sample
> > > > > > >> data, we adopt Append Tables for data storage.
> > > > > > >>
> > > > > > >> Batch jobs read and process this sample data, while lookup by
> key
> > > > > > >> capability for historical data is also required during the
> sample
> > > data
> > > > > > >> processing. To support hybrid queries for such
> > > ultra-high-dimensional
> > > > > data
> > > > > > >> in Paimon, we introduce the FilePath Global Index, along with
> > > > > > >> optimizations
> > > > > > >> for reading Parquet metadata and Parquet file data in Append
> > > Tables
> > > > > (Some
> > > > > > >> of the major optimization designs come from our partner teams
> cc
> > > > > > >> @lingpeng,
> > > > > > >> @guanziyue, thx), aiming to enhance the lookup capability of
> > > Paimon
> > > > > Append
> > > > > > >> Tables.
> > > > > > >>
> > > > > > >> Looking forward to hearing from you, thanks
> > > > > > >>
> > > > > > >>
> > > > > > >> [1]
> > > > > > >>
> > > > > > >>
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/PAIMON/PIP-41%3A+Introduce+FilePath+Global+Index+And+Optimizations+For+Lookup+In+Append+Table
> > > > > > >>
> > > > > > >> Best,
> > > > > > >> Fang Yong
> > > > > > >>
> > > > > > >>
> > > > >
> > >
>

Re: [DISCUSS] PIP-41: Introduce FilePath Global Index And Optimizations For Lookup In Append Table

Reply via email to