Thanks everybody for the input.  I'll try to summarize some main points and
my thoughts below.

1.  "V3" branding is problematic and getting adoption is difficult
with V2.  I agree, we should not lump all potential improvements into a
single V3 milestone (I used V3 to indicate that at least some changes might
be backward incompatible with existing format revisions).   In my mind, I
think the way to make it more likely that new features are used would be
starting to think about a more formal release process for them.  For
example:
    a.  A clear cadence of major version library releases (e.g. maybe once
per year).
    b.  A clear policy for when a new feature becomes the default in a
library release (e.g. as a strawman once the feature lands in reference
implementation, it is eligible to become default in the next major release
that occurs >1 year later).
    c.  For reference implementations that are effectively doing major
version releases on each release, I think following parquet-mr for flipping
defaults would make sense.

2.  How much of the improvements can be a clean slate vs
evolutionary/implementation optimizations?  I really think this depends on
which aspects we are tackling. For metadata issues, I think it might pay to
rethink things from the ground up, but any proposals along these lines
should obviously have clear rationales and benchmarks to clarify how the
decisions are made.  For better encodings, most likely work can be added to
the existing format.  I don't think allowing for arbitrary plugin encodings
would be a good thing.  I believe one of the reasons that Parquet has been
successful has been its specification which allows for guaranteed
compatibility.

3.  Amount of effort required/Sustainability of effort.  I agree this is a
big risk. It will take a lot of work to cover the major parquet bindings,
which is why I started the thread. Personally, I am fairly time constrained
and unless my employer is willing to approve devoting work hours to the
project I likely won't be able to contribute much.  However, it seems like
there might be enough interest from the community that I can potentially
make the case for doing so.

Thanks,
Micah

On Mon, May 13, 2024 at 10:41 AM Ed Seidl <etse...@live.com> wrote:

> I think the whole "V1" vs "V2" mess is unfortunate. IMO there is only one
> version of the Parquet file format. At its core, the data layout (row
> groups
> composed of column chunks composed of Dremel encoded pages) has
> never changed. Encodings/codecs/structures have been added to that core,
> but always in a backwards compatible way.
>
> I agree that many of the perceived shortcomings might be addressed without
> breaking changes to the file format. I myself would be interested in
> exploring
> ways to address the point lookup and wide tables issues while maintaining
> backwards compatibility. But that said, if there are ways to gain large
> performance gains that would necessitate an actual new file format version
> (such as replacing thrift, new metadata organization, some alternative to
> Dremel), I'd be open to exploring those options as well.
>
> Thanks,
> Ed
>
> On 5/11/24 3:58 PM, Micah Kornfield wrote:
> > Hi Parquet Dev,
> > I wanted to start a conversation within the community about working on a
> > new revision of Parquet.  For context there have been a bunch of new
> > formats [1][2][3] that show there is decent room for improvement across
> > data encodings and how metadata is organized.
> >
> > Specifically, in a new format revision I think we should be thinking
> about
> > the following areas for improvements:
> > 1.  More efficient encodings that allow for data skipping and SIMD
> > optimizations.
> > 2.  More efficient metadata handling for deserialization and projection
> to
> > address areas when metadata deserialization time is not trivial [4].
> > 3.  Possibly thinking about different encodings instead of
> > repetition/definition for repeated and nested field
> > 4.  Support for optimizing semi-structured data (e.g. JSON or Variant
> type)
> > that can shred elements into individual columns (a recent thread in
> Iceberg
> > mentions doing this at the metadata level [5])
> >
> > I think the goals of V3 would be to provide existing API compatibility as
> > broadly as possible (possibly with some performance loss) and expose new
> > API surface areas where appropriate to make use of new elements.  New
> > encodings could be backported so they can be made use of without metadata
> > changes.  I think unfortunately that for points 2 and 3 we would want to
> > break file level compatibility.  More thought would be needed to consider
> > whether 4 could be backported effectively.
> >
> > This is a non-trivial amount of work to get good coverage across
> > implementations, so before putting together more formal proposal it would
> > be nice to know if:
> >
> > 1.  If there is an appetite in the general community to consider these
> > changes
> > 2.  If anybody from the community is interested in collaborating on
> > proposals/implementation in this area.
> >
> > Thanks,
> > Micah
> >
> > [1] https://github.com/maxi-k/btrblocks
> > [2] https://github.com/facebookincubator/nimble
> > [3] https://blog.lancedb.com/lance-v2/
> > [4] https://github.com/apache/arrow/issues/39676
> > [5] https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34
> >
>
>

Reply via email to