Re: [DISCUSS] PIP-41: Introduce FilePath Global Index And Optimizations For Lookup In Append Table

Jingsong Li Sun, 25 Jan 2026 19:57:41 -0800

Thanks Yong!

+1 to PIP-42, maybe we should unify 'clustering.incremental'.


Best,
Jingsong

On Thu, Jan 22, 2026 at 7:02 PM Yong Fang <[email protected]> wrote:
>
> Hi Jingsong,
>
> Thanks for your feedback, I'v updated PIP-41 and PIP-42, pls help to review
> them again when you're free.
>
>
> Best,
> Fang Yong
>
> On Tue, Jan 20, 2026 at 3:58 PM Jingsong Li <[email protected]> wrote:
>
> > Hi Yong,
> >
> > Sounds good to me!
> >
> > You are right, the first thing is we should have a framework to index
> > the file name and position not just global row id. And then we can
> > implement it using the Paimon index in the community. Using this
> > framework, you can implement HBase index in your inner environment.
> >
> > Best,
> > Jingsong
> >
> > On Tue, Jan 20, 2026 at 1:26 PM Yong Fang <[email protected]> wrote:
> > >
> > > Hi Jingsong,
> > >
> > > 1> For PIP-42, `clustering.columns` sounds good to me, and we can add an
> > > option `clustering.mode` to distinguish the scope of data clustering,
> > such
> > > as `local` and `partition`
> > >
> > > 2>  For PIP-41, it mainly consists of two parts: a) Add sort fields
> > > -> filename global index which can accelerate the speed of plan
> > > construction during the query process; b) The storage implementation of
> > the
> > > file path global index.
> > >
> > > I understand the concerns about Paimon introducing external storage.
> > > Ideally, Paimon should include a high-performance key-value index
> > framework
> > > itself, and I'm glad to hear that the community is currently promoting
> > the
> > > development.
> > > So I'd like to update PIP-41 to mainly define the file path global index
> > > and use it in Paimon's queries, without external index storage
> > > implementations.
> > > After the community improves the index framework, we can directly
> > integrate
> > > it. This way, our internal work can evolve together with the community.
> > > What do you think? THX
> > >
> > > Best,
> > > Fang Yong
> > >
> > > On Mon, Jan 19, 2026 at 10:18 AM Jingsong Li <[email protected]>
> > wrote:
> > >
> > > > Hi Yong,
> > > >
> > > > +1 to PIP-42, I think maybe we can combine it to `clustering.columns`
> > > > options.
> > > >
> > > > -0 to PIP-41, I'm quite concerned about PIP-41. Integrating an
> > > > external system is a very difficult task to maintain. Perhaps we could
> > > > consider implementing something similar on the periphery of Paimon?
> > > >
> > > > In addition, we are currently developing Paimon's own global index,
> > > > which is not as powerful as HBase. However, there may be some
> > > > conflicts along the way, or we may consider embedding the ability to
> > > > use external indexes within the current global index framework.
> > > >
> > > > Best,
> > > > Jingsong
> > > >
> > > > On Thu, Jan 15, 2026 at 3:17 PM Yong Fang <[email protected]> wrote:
> > > > >
> > > > > Hi Lei,
> > > > >
> > > > > Thanks for your input. We anticipate that the number of row groups
> > > > within a
> > > > > single parquet file will not be too large. Therefore, we can first
> > > > > implement direct sorting by key to ensure data order within
> > individual
> > > > > files. Later, we can further consider more complex sorting algorithms
> > > > such
> > > > > as Z-Order based on query requirements and the number of row groups.
> > > > >
> > > > > Best,
> > > > > FangYong
> > > > >
> > > > >
> > > > > On Thu, Jan 15, 2026 at 2:33 PM Yong Fang <[email protected]> wrote:
> > > > >
> > > > > > Hi Jingsong,
> > > > > >
> > > > > > I have separated PIP-41 into PIP-41: Introduce FilePath Global
> > Index
> > > > And
> > > > > > Optimizations For Lookup In Append Table [1] and PIP-42: Local
> > Sort And
> > > > > > Parquet Lookup Optimizations For In Append Table [2], google docs
> > are
> > > > > >
> > > >
> > https://docs.google.com/document/d/1qV7lUW5GsZ72IuF9FV96fX3wtJO_gzEWY6aLG97caxU/edit?tab=t.0
> > > > > > and
> > > > > >
> > > >
> > https://docs.google.com/document/d/1M6R5V6zRwlj0LeuKnhZcUK4Ss-9JLs9pF42n822k-Bo/edit?tab=t.0
> > > > > >
> > > > > > [1]
> > > > > >
> > > >
> > https://cwiki.apache.org/confluence/display/PAIMON/PIP-41%3A+Introduce+FilePath+Global+Index+And+Optimizations+For+Lookup+In+Append+Table
> > > > > > [2]
> > > > > >
> > > >
> > https://cwiki.apache.org/confluence/display/PAIMON/PIP-42%3A+Local+Sort+And+Parquet+Lookup+Optimizations+For+In+Append+Table
> > > > > >
> > > > > > Best,
> > > > > > FangYong
> > > > > >
> > > > > > On Wed, Jan 14, 2026 at 8:31 PM lei li <[email protected]>
> > > > wrote:
> > > > > >
> > > > > >> Hi Yong,
> > > > > >>
> > > > > >>
> > > > > >> Thank you for sharing this fascinating PIP proposal!
> > > > > >>
> > > > > >>
> > > > > >> I'm particularly intrigued by the sort buffer concept for Append
> > > > tables.
> > > > > >> I'd like to ask: does the current sorting process only support
> > > > multi-field
> > > > > >> sequential sorting, or would it be possible to introduce more
> > advanced
> > > > > >> sorting strategies, such as Z-order sorting or other space-filling
> > > > curve
> > > > > >> algorithms?
> > > > > >>
> > > > > >>
> > > > > >> The reason I'm curious about this is that in high-volume batch
> > write
> > > > > >> scenarios, having the data already sorted upon completion of the
> > batch
> > > > > >> write could significantly improve query efficiency. Z-order
> > sorting,
> > > > for
> > > > > >> instance, could provide better data locality for multi-dimensional
> > > > range
> > > > > >> queries, which is quite common in analytical workloads.
> > > > > >>
> > > > > >>
> > > > > >> Would love to hear your thoughts on whether such sorting
> > enhancements
> > > > > >> align with the current design goals! Looking forward to your
> > insights
> > > > and
> > > > > >> the continued development of this proposal!
> > > > > >>
> > > > > >>
> > > > > >> Best regards,
> > > > > >>
> > > > > >> Lei Li
> > > > > >>
> > > > > >> 2026年1月13日 17:32，Yong Fang <[email protected]> 写道：
> > > > > >>
> > > > > >> Hi devs,
> > > > > >>
> > > > > >> I'd like to initiate a discussion on PIP-41: Introduce FilePath
> > Global
> > > > > >> Index And Optimizations For Lookup In Append Table [1].
> > > > > >>
> > > > > >> We use Paimon as cold storage for sample data of businesses such
> > as
> > > > > >> search,
> > > > > >> recommendation, and advertising. Given the extremely large volume
> > of
> > > > > >> sample
> > > > > >> data, we adopt Append Tables for data storage.
> > > > > >>
> > > > > >> Batch jobs read and process this sample data, while lookup by key
> > > > > >> capability for historical data is also required during the sample
> > data
> > > > > >> processing. To support hybrid queries for such
> > ultra-high-dimensional
> > > > data
> > > > > >> in Paimon, we introduce the FilePath Global Index, along with
> > > > > >> optimizations
> > > > > >> for reading Parquet metadata and Parquet file data in Append
> > Tables
> > > > (Some
> > > > > >> of the major optimization designs come from our partner teams cc
> > > > > >> @lingpeng,
> > > > > >> @guanziyue, thx), aiming to enhance the lookup capability of
> > Paimon
> > > > Append
> > > > > >> Tables.
> > > > > >>
> > > > > >> Looking forward to hearing from you, thanks
> > > > > >>
> > > > > >>
> > > > > >> [1]
> > > > > >>
> > > > > >>
> > > >
> > https://cwiki.apache.org/confluence/display/PAIMON/PIP-41%3A+Introduce+FilePath+Global+Index+And+Optimizations+For+Lookup+In+Append+Table
> > > > > >>
> > > > > >> Best,
> > > > > >> Fang Yong
> > > > > >>
> > > > > >>
> > > >
> >

Re: [DISCUSS] PIP-41: Introduce FilePath Global Index And Optimizations For Lookup In Append Table

Reply via email to