I would support it as long as we maintain a list of the implementations
that we consider "accredited" to be reference implementations (we being a
PMC vote here).
Not all implementations are created equal from an adoption point of view.
Originally the Impala implementation was the second implementation for
interrop. Later on the parquet-cpp implementation was added as a standalone
implementation in the Parquet project. This is the implementation that
lives in the arrow repository.
The parquet java implementation and the parquet cpp implementation in the
arrow repo are on top of that list IMO.


On Thu, May 16, 2024 at 6:17 AM Rok Mihevc <rok.mih...@gmail.com> wrote:

> I would support a "two interoperable open source implementations"
> requirement.
>
> Rok
>
> On Thu, May 16, 2024 at 10:06 AM Antoine Pitrou <anto...@python.org>
> wrote:
>
> >
> > I'm in (non-binding) agreement with Ed here. I would just add that the
> > requirement for two interoperable implementations should mandate that
> > these are open source implementations.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > On Tue, 14 May 2024 14:48:09 -0700
> > Ed Seidl <etse...@live.com> wrote:
> > > Given the breadth of the parquet community at this point, I don't think
> > > we should be singling out one or two "reference" implementations. Even
> > > parquet-mr, AFAIK, still doesn't implement DELTA_LENGTH_BYTE_ARRAY
> > > encoding in a user-accessible way (it's only available as part of the
> > > DELTA_BYTE_ARRAY writer). There are many situations in which the
> > > former would be the superior choice, and in fact the specification
> > > documentation still lists DLBA as "always preferred over PLAIN for byte
> > > array columns" [1]. Similarly, DELTA_BYTE_ARRAY encoding was only added
> > > to parquet-cpp in the last year [2], and column indexes a few months
> > > before that [3].
> > >
> > > Instead, I think we should leave out any mention of a reference
> > > implementation,
> > > and continue to require two, independent, interoperable implementations
> > > before adopting a change to the spec. This, IMO, would go a long way
> > towards
> > > increasing excitement for Parquet outside the parquet-mr/arrow world.
> > >
> > > Just my (non-binding) two cents.
> > >
> > > Cheers,
> > > Ed
> > >
> > > [1]
> > >
> >
> https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6
> > > [2] https://github.com/apache/arrow/pull/14341
> > > [3] https://github.com/apache/arrow/pull/34054
> > >
> > > On 5/14/24 9:44 AM, Julien Le Dem wrote:
> > > > I agree that parquet-mr implementation is a requirement to evolve the
> > spec.
> > > > It makes sense to me that we call parquet-mr the reference
> > implementation
> > > > and make it a requirement to evolve the spec.
> > > > I would add the requirement to implement it in the parquet cpp
> > > > implementation that lives in apache Arrow:
> > > > https://github.com/apache/arrow/tree/main/cpp/src/parquet
> > > > This code used to live in the parquet-cpp repo in the Parquet
> project.
> > > > Being language agnostic is an important feature of the format.
> > > > Interoperability tests should also be included.
> > > >
> > > > On Tue, May 14, 2024 at 9:31 AM Antoine Pitrou <
> > antoine-+zn9apsxkcednm+yrof...@public.gmane.org> wrote:
> > > >
> > > >> AFAIK, the only Parquet implementation under the Apache Parquet
> > project
> > > >> is parquet-mr :-)
> > > >>
> > > >>
> > > >> On Tue, 14 May 2024 10:58:58 +0200
> > > >> Rok Mihevc <rok.mih...@gmail.com> wrote:
> > > >>> Second Raphael's point.
> > > >>> Would it be reasonable to say specification change requires
> > > >> implementation
> > > >>> in two parquet implementations within Apache Parquet project?
> > > >>>
> > > >>> Rok
> > > >>>
> > > >>> On Tue, May 14, 2024 at 10:50 AM Gang Wu <
> > > >> ustcwg-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote:
> > > >>>> IMHO, it looks more reasonable if a reference implementation is
> > > >> required
> > > >>>> to support most (not all) elements from the specification.
> > > >>>>
> > > >>>> Another question is: should we discuss (and vote for) each
> candidate
> > > >>>> one by one? We can start with parquet-mr which is most well-known
> > > >>>> implementation.
> > > >>>>
> > > >>>> Best,
> > > >>>> Gang
> > > >>>>
> > > >>>> On Tue, May 14, 2024 at 4:41 PM Raphael Taylor-Davies
> > > >>>> <r.taylordav...@googlemail.com.invalid> wrote:
> > > >>>>
> > > >>>>> Potentially it would be helpful to flip the question around. As
> > > >> Andrew
> > > >>>>> articulates, a reference implementation is required to implement
> > all
> > > >>>>> elements from the specification, and therefore the major
> > consequence
> > > >> of
> > > >>>>> labeling parquet-mr thusly would be that any specification change
> > > >> would
> > > >>>>> have to be implemented within parquet-mr as part of the
> > > >> standardisation
> > > >>>>> process. It would be insufficient for it to be implemented in,
> for
> > > >>>>> example, two of the parquet implementations maintained by the
> arrow
> > > >>>>> project. I personally think that would be a shame and likely
> > exclude
> > > >>>>> many people who would otherwise be interested in evolving the
> > parquet
> > > >>>>> specification, but think that is at the core of this question.
> > > >>>>>
> > > >>>>> Kind Regards,
> > > >>>>>
> > > >>>>> Raphael
> > > >>>>>
> > > >>>>> On 13/05/2024 20:55, Andrew Lamb wrote:
> > > >>>>>> Question: Should we label parquet-mr or any other parquet
> > > >>>> implementations
> > > >>>>>> "reference" implications"?
> > > >>>>>>
> > > >>>>>> This came up as part of Vinoo's great PR to list different
> parquet
> > > >>>>>> reference implementations[1][2].
> > > >>>>>>
> > > >>>>>> The term "reference implementation" often has an official
> > > >> connotation.
> > > >>>>> For
> > > >>>>>> example the wikipedia definition is "a program that implements
> all
> > > >>>>>> requirements from a corresponding specification. The reference
> > > >>>>>> implementation ... should be considered the "correct" behavior
> > of
> > > >> any
> > > >>>>> other
> > > >>>>>> implementation of it."[3]
> > > >>>>>>
> > > >>>>>> Given the close association of parquet-mr to the parquet
> > standard,
> > > >> it
> > > >>>> is
> > > >>>>> a
> > > >>>>>> natural candidate to label as "reference implementation."
> > However,
> > > >> it
> > > >>>> is
> > > >>>>>> not clear to me if there is consensus that it should be thusly
> > > >> labeled.
> > > >>>>>> I have a strong opinion that a consensus on this question would
> > be
> > > >> very
> > > >>>>>> helpful. I don't actually have a strong opinion about the answer
> > > >>>>>>
> > > >>>>>> Andrew
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> [1]:
> > > >>
> https://github.com/apache/parquet-site/pull/53#discussion_r1582882267
> >
> > > >>>>>> [2]:
> > > >>
> https://github.com/apache/parquet-site/pull/53#discussion_r1598283465
> >
> > > >>>>>> [3]:  https://en.wikipedia.org/wiki/Reference_implementation
> > > >>>>>>
> > > >>
> > > >>
> > > >>
> > >
> > >
> >
> >
> >
> >
>

Reply via email to