Thanks everyone for their perspectives. I think as a concrete next step, I'll try to pull together a Google doc that covers the topics covered here as I think that might be a more productive way to further the conversation (I don't want threads to get split too much).
On Tue, May 14, 2024 at 8:33 AM wish maple <maplewish...@gmail.com> wrote: > I also think most of the proposed benefits from these new formats can be > achieved using the current parquet format and improved implementations. > > My concern is that: > 1. For encoding, though so many interesting encoding is introduced, most > implementation now just uses and implements PLAIN and Dictionary. > We can make full use of current encoding and introduce some new > encoding allowing skip, compress and read data in some specific > scenario. > 2. We can start optimizing for semi-structure and ML data. And we can do > specific > optimization for these case like[1] Rep-Level and Def-Level is feature > rich, however > we can also optimize when not necessary to read them. Besides, we can > support > some type like geo within Parquet > > [1] https://github.com/apache/arrow/issues/34510#issuecomment-2109768275 >