I would support a "two interoperable open source implementations"
requirement.

Rok

On Thu, May 16, 2024 at 10:06 AM Antoine Pitrou <anto...@python.org> wrote:

>
> I'm in (non-binding) agreement with Ed here. I would just add that the
> requirement for two interoperable implementations should mandate that
> these are open source implementations.
>
> Regards
>
> Antoine.
>
>
> On Tue, 14 May 2024 14:48:09 -0700
> Ed Seidl <etse...@live.com> wrote:
> > Given the breadth of the parquet community at this point, I don't think
> > we should be singling out one or two "reference" implementations. Even
> > parquet-mr, AFAIK, still doesn't implement DELTA_LENGTH_BYTE_ARRAY
> > encoding in a user-accessible way (it's only available as part of the
> > DELTA_BYTE_ARRAY writer). There are many situations in which the
> > former would be the superior choice, and in fact the specification
> > documentation still lists DLBA as "always preferred over PLAIN for byte
> > array columns" [1]. Similarly, DELTA_BYTE_ARRAY encoding was only added
> > to parquet-cpp in the last year [2], and column indexes a few months
> > before that [3].
> >
> > Instead, I think we should leave out any mention of a reference
> > implementation,
> > and continue to require two, independent, interoperable implementations
> > before adopting a change to the spec. This, IMO, would go a long way
> towards
> > increasing excitement for Parquet outside the parquet-mr/arrow world.
> >
> > Just my (non-binding) two cents.
> >
> > Cheers,
> > Ed
> >
> > [1]
> >
> https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6
> > [2] https://github.com/apache/arrow/pull/14341
> > [3] https://github.com/apache/arrow/pull/34054
> >
> > On 5/14/24 9:44 AM, Julien Le Dem wrote:
> > > I agree that parquet-mr implementation is a requirement to evolve the
> spec.
> > > It makes sense to me that we call parquet-mr the reference
> implementation
> > > and make it a requirement to evolve the spec.
> > > I would add the requirement to implement it in the parquet cpp
> > > implementation that lives in apache Arrow:
> > > https://github.com/apache/arrow/tree/main/cpp/src/parquet
> > > This code used to live in the parquet-cpp repo in the Parquet project.
> > > Being language agnostic is an important feature of the format.
> > > Interoperability tests should also be included.
> > >
> > > On Tue, May 14, 2024 at 9:31 AM Antoine Pitrou <
> antoine-+zn9apsxkcednm+yrof...@public.gmane.org> wrote:
> > >
> > >> AFAIK, the only Parquet implementation under the Apache Parquet
> project
> > >> is parquet-mr :-)
> > >>
> > >>
> > >> On Tue, 14 May 2024 10:58:58 +0200
> > >> Rok Mihevc <rok.mih...@gmail.com> wrote:
> > >>> Second Raphael's point.
> > >>> Would it be reasonable to say specification change requires
> > >> implementation
> > >>> in two parquet implementations within Apache Parquet project?
> > >>>
> > >>> Rok
> > >>>
> > >>> On Tue, May 14, 2024 at 10:50 AM Gang Wu <
> > >> ustcwg-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote:
> > >>>> IMHO, it looks more reasonable if a reference implementation is
> > >> required
> > >>>> to support most (not all) elements from the specification.
> > >>>>
> > >>>> Another question is: should we discuss (and vote for) each candidate
> > >>>> one by one? We can start with parquet-mr which is most well-known
> > >>>> implementation.
> > >>>>
> > >>>> Best,
> > >>>> Gang
> > >>>>
> > >>>> On Tue, May 14, 2024 at 4:41 PM Raphael Taylor-Davies
> > >>>> <r.taylordav...@googlemail.com.invalid> wrote:
> > >>>>
> > >>>>> Potentially it would be helpful to flip the question around. As
> > >> Andrew
> > >>>>> articulates, a reference implementation is required to implement
> all
> > >>>>> elements from the specification, and therefore the major
> consequence
> > >> of
> > >>>>> labeling parquet-mr thusly would be that any specification change
> > >> would
> > >>>>> have to be implemented within parquet-mr as part of the
> > >> standardisation
> > >>>>> process. It would be insufficient for it to be implemented in, for
> > >>>>> example, two of the parquet implementations maintained by the arrow
> > >>>>> project. I personally think that would be a shame and likely
> exclude
> > >>>>> many people who would otherwise be interested in evolving the
> parquet
> > >>>>> specification, but think that is at the core of this question.
> > >>>>>
> > >>>>> Kind Regards,
> > >>>>>
> > >>>>> Raphael
> > >>>>>
> > >>>>> On 13/05/2024 20:55, Andrew Lamb wrote:
> > >>>>>> Question: Should we label parquet-mr or any other parquet
> > >>>> implementations
> > >>>>>> "reference" implications"?
> > >>>>>>
> > >>>>>> This came up as part of Vinoo's great PR to list different parquet
> > >>>>>> reference implementations[1][2].
> > >>>>>>
> > >>>>>> The term "reference implementation" often has an official
> > >> connotation.
> > >>>>> For
> > >>>>>> example the wikipedia definition is "a program that implements all
> > >>>>>> requirements from a corresponding specification. The reference
> > >>>>>> implementation ... should be considered the "correct" behavior
> of
> > >> any
> > >>>>> other
> > >>>>>> implementation of it."[3]
> > >>>>>>
> > >>>>>> Given the close association of parquet-mr to the parquet
> standard,
> > >> it
> > >>>> is
> > >>>>> a
> > >>>>>> natural candidate to label as "reference implementation."
> However,
> > >> it
> > >>>> is
> > >>>>>> not clear to me if there is consensus that it should be thusly
> > >> labeled.
> > >>>>>> I have a strong opinion that a consensus on this question would
> be
> > >> very
> > >>>>>> helpful. I don't actually have a strong opinion about the answer
> > >>>>>>
> > >>>>>> Andrew
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> [1]:
> > >> https://github.com/apache/parquet-site/pull/53#discussion_r1582882267
>
> > >>>>>> [2]:
> > >> https://github.com/apache/parquet-site/pull/53#discussion_r1598283465
>
> > >>>>>> [3]:  https://en.wikipedia.org/wiki/Reference_implementation
> > >>>>>>
> > >>
> > >>
> > >>
> >
> >
>
>
>
>

Reply via email to