Sorry I am late to the party! It's great to see this discussion! Thank you
everyone for the many good points and thank you, Micah, for starting the
discussion and putting it together into a document, which is very helpful!
I agree with most of the points we discussed above, and we need to improve
Parquet and sometimes even speed up to catch up with industry changes.

With that said, we need people to work on it, as Julien mentioned. The
document
<https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit>
that Micah created covers pretty much everything we discussed here. I
encourage all of us to contribute by raising questions, providing
suggestions, adding missing functionality, etc. Once we reach a consensus
on each topic, we can create different tracks and working streams to kick
off the implementations.

I believe continuously improving Parquet would benefit the industry more
than creating a new format, which could add friction. These improvement
ideas are exciting opportunities. If you, your team members, or friends
have time and interest, please encourage them to contribute.

Our Parquet community meeting is next week, on May 28, 2024. We can have
discussions there if you can join. Currently, it is scheduled for 7:00 am
PDT, but I can change it according to the majority's availability.

On Fri, May 17, 2024 at 3:58 PM Rok Mihevc <rok.mih...@gmail.com> wrote:

> Hi all,
>
> I've discussed with my colleagues and we would dedicate two engineers for
> 4-6 months on tasks related to implementing the format changes. We're
> already active in design discussions and can help with C++, Rust and C#
> implementations. I thought it'd be good to state this explicitly FWIW.
>
> Our main areas of interest are efficient reads for tables with wide schemas
> and faster random rowgroup access [1].
>
> To workaround the wide schemas issue we actually implemented an internal
> tool [3] for storing index information into a separate file which allows
> for reading only the necessary subset of metadata. We would offer this
> approach for consideration as a possible approach to solve the wide schema
> problem.
>
> [1] https://github.com/apache/arrow/issues/39676
> [2] https://github.com/G-Research/PalletJack
>
> Rok
>
> On Sun, May 12, 2024 at 12:59 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
> > Hi Parquet Dev,
> > I wanted to start a conversation within the community about working on a
> > new revision of Parquet.  For context there have been a bunch of new
> > formats [1][2][3] that show there is decent room for improvement across
> > data encodings and how metadata is organized.
> >
> > Specifically, in a new format revision I think we should be thinking
> about
> > the following areas for improvements:
> > 1.  More efficient encodings that allow for data skipping and SIMD
> > optimizations.
> > 2.  More efficient metadata handling for deserialization and projection
> to
> > address areas when metadata deserialization time is not trivial [4].
> > 3.  Possibly thinking about different encodings instead of
> > repetition/definition for repeated and nested field
> > 4.  Support for optimizing semi-structured data (e.g. JSON or Variant
> type)
> > that can shred elements into individual columns (a recent thread in
> Iceberg
> > mentions doing this at the metadata level [5])
> >
> > I think the goals of V3 would be to provide existing API compatibility as
> > broadly as possible (possibly with some performance loss) and expose new
> > API surface areas where appropriate to make use of new elements.  New
> > encodings could be backported so they can be made use of without metadata
> > changes.  I think unfortunately that for points 2 and 3 we would want to
> > break file level compatibility.  More thought would be needed to consider
> > whether 4 could be backported effectively.
> >
> > This is a non-trivial amount of work to get good coverage across
> > implementations, so before putting together more formal proposal it would
> > be nice to know if:
> >
> > 1.  If there is an appetite in the general community to consider these
> > changes
> > 2.  If anybody from the community is interested in collaborating on
> > proposals/implementation in this area.
> >
> > Thanks,
> > Micah
> >
> > [1] https://github.com/maxi-k/btrblocks
> > [2] https://github.com/facebookincubator/nimble
> > [3] https://blog.lancedb.com/lance-v2/
> > [4] https://github.com/apache/arrow/issues/39676
> > [5] https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34
> >
>


-- 
Xinli Shang

Reply via email to