Thanks Yong! +1 to PIP-42, maybe we should unify 'clustering.incremental'.
Best, Jingsong On Thu, Jan 22, 2026 at 7:02 PM Yong Fang <[email protected]> wrote: > > Hi Jingsong, > > Thanks for your feedback, I'v updated PIP-41 and PIP-42, pls help to review > them again when you're free. > > > Best, > Fang Yong > > On Tue, Jan 20, 2026 at 3:58 PM Jingsong Li <[email protected]> wrote: > > > Hi Yong, > > > > Sounds good to me! > > > > You are right, the first thing is we should have a framework to index > > the file name and position not just global row id. And then we can > > implement it using the Paimon index in the community. Using this > > framework, you can implement HBase index in your inner environment. > > > > Best, > > Jingsong > > > > On Tue, Jan 20, 2026 at 1:26 PM Yong Fang <[email protected]> wrote: > > > > > > Hi Jingsong, > > > > > > 1> For PIP-42, `clustering.columns` sounds good to me, and we can add an > > > option `clustering.mode` to distinguish the scope of data clustering, > > such > > > as `local` and `partition` > > > > > > 2> For PIP-41, it mainly consists of two parts: a) Add sort fields > > > -> filename global index which can accelerate the speed of plan > > > construction during the query process; b) The storage implementation of > > the > > > file path global index. > > > > > > I understand the concerns about Paimon introducing external storage. > > > Ideally, Paimon should include a high-performance key-value index > > framework > > > itself, and I'm glad to hear that the community is currently promoting > > the > > > development. > > > So I'd like to update PIP-41 to mainly define the file path global index > > > and use it in Paimon's queries, without external index storage > > > implementations. > > > After the community improves the index framework, we can directly > > integrate > > > it. This way, our internal work can evolve together with the community. > > > What do you think? THX > > > > > > Best, > > > Fang Yong > > > > > > On Mon, Jan 19, 2026 at 10:18 AM Jingsong Li <[email protected]> > > wrote: > > > > > > > Hi Yong, > > > > > > > > +1 to PIP-42, I think maybe we can combine it to `clustering.columns` > > > > options. > > > > > > > > -0 to PIP-41, I'm quite concerned about PIP-41. Integrating an > > > > external system is a very difficult task to maintain. Perhaps we could > > > > consider implementing something similar on the periphery of Paimon? > > > > > > > > In addition, we are currently developing Paimon's own global index, > > > > which is not as powerful as HBase. However, there may be some > > > > conflicts along the way, or we may consider embedding the ability to > > > > use external indexes within the current global index framework. > > > > > > > > Best, > > > > Jingsong > > > > > > > > On Thu, Jan 15, 2026 at 3:17 PM Yong Fang <[email protected]> wrote: > > > > > > > > > > Hi Lei, > > > > > > > > > > Thanks for your input. We anticipate that the number of row groups > > > > within a > > > > > single parquet file will not be too large. Therefore, we can first > > > > > implement direct sorting by key to ensure data order within > > individual > > > > > files. Later, we can further consider more complex sorting algorithms > > > > such > > > > > as Z-Order based on query requirements and the number of row groups. > > > > > > > > > > Best, > > > > > FangYong > > > > > > > > > > > > > > > On Thu, Jan 15, 2026 at 2:33 PM Yong Fang <[email protected]> wrote: > > > > > > > > > > > Hi Jingsong, > > > > > > > > > > > > I have separated PIP-41 into PIP-41: Introduce FilePath Global > > Index > > > > And > > > > > > Optimizations For Lookup In Append Table [1] and PIP-42: Local > > Sort And > > > > > > Parquet Lookup Optimizations For In Append Table [2], google docs > > are > > > > > > > > > > > > https://docs.google.com/document/d/1qV7lUW5GsZ72IuF9FV96fX3wtJO_gzEWY6aLG97caxU/edit?tab=t.0 > > > > > > and > > > > > > > > > > > > https://docs.google.com/document/d/1M6R5V6zRwlj0LeuKnhZcUK4Ss-9JLs9pF42n822k-Bo/edit?tab=t.0 > > > > > > > > > > > > [1] > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/PAIMON/PIP-41%3A+Introduce+FilePath+Global+Index+And+Optimizations+For+Lookup+In+Append+Table > > > > > > [2] > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/PAIMON/PIP-42%3A+Local+Sort+And+Parquet+Lookup+Optimizations+For+In+Append+Table > > > > > > > > > > > > Best, > > > > > > FangYong > > > > > > > > > > > > On Wed, Jan 14, 2026 at 8:31 PM lei li <[email protected]> > > > > wrote: > > > > > > > > > > > >> Hi Yong, > > > > > >> > > > > > >> > > > > > >> Thank you for sharing this fascinating PIP proposal! > > > > > >> > > > > > >> > > > > > >> I'm particularly intrigued by the sort buffer concept for Append > > > > tables. > > > > > >> I'd like to ask: does the current sorting process only support > > > > multi-field > > > > > >> sequential sorting, or would it be possible to introduce more > > advanced > > > > > >> sorting strategies, such as Z-order sorting or other space-filling > > > > curve > > > > > >> algorithms? > > > > > >> > > > > > >> > > > > > >> The reason I'm curious about this is that in high-volume batch > > write > > > > > >> scenarios, having the data already sorted upon completion of the > > batch > > > > > >> write could significantly improve query efficiency. Z-order > > sorting, > > > > for > > > > > >> instance, could provide better data locality for multi-dimensional > > > > range > > > > > >> queries, which is quite common in analytical workloads. > > > > > >> > > > > > >> > > > > > >> Would love to hear your thoughts on whether such sorting > > enhancements > > > > > >> align with the current design goals! Looking forward to your > > insights > > > > and > > > > > >> the continued development of this proposal! > > > > > >> > > > > > >> > > > > > >> Best regards, > > > > > >> > > > > > >> Lei Li > > > > > >> > > > > > >> 2026年1月13日 17:32,Yong Fang <[email protected]> 写道: > > > > > >> > > > > > >> Hi devs, > > > > > >> > > > > > >> I'd like to initiate a discussion on PIP-41: Introduce FilePath > > Global > > > > > >> Index And Optimizations For Lookup In Append Table [1]. > > > > > >> > > > > > >> We use Paimon as cold storage for sample data of businesses such > > as > > > > > >> search, > > > > > >> recommendation, and advertising. Given the extremely large volume > > of > > > > > >> sample > > > > > >> data, we adopt Append Tables for data storage. > > > > > >> > > > > > >> Batch jobs read and process this sample data, while lookup by key > > > > > >> capability for historical data is also required during the sample > > data > > > > > >> processing. To support hybrid queries for such > > ultra-high-dimensional > > > > data > > > > > >> in Paimon, we introduce the FilePath Global Index, along with > > > > > >> optimizations > > > > > >> for reading Parquet metadata and Parquet file data in Append > > Tables > > > > (Some > > > > > >> of the major optimization designs come from our partner teams cc > > > > > >> @lingpeng, > > > > > >> @guanziyue, thx), aiming to enhance the lookup capability of > > Paimon > > > > Append > > > > > >> Tables. > > > > > >> > > > > > >> Looking forward to hearing from you, thanks > > > > > >> > > > > > >> > > > > > >> [1] > > > > > >> > > > > > >> > > > > > > https://cwiki.apache.org/confluence/display/PAIMON/PIP-41%3A+Introduce+FilePath+Global+Index+And+Optimizations+For+Lookup+In+Append+Table > > > > > >> > > > > > >> Best, > > > > > >> Fang Yong > > > > > >> > > > > > >> > > > > > >
