I'm in (non-binding) agreement with Ed here. I would just add that the requirement for two interoperable implementations should mandate that these are open source implementations.
Regards Antoine. On Tue, 14 May 2024 14:48:09 -0700 Ed Seidl <etse...@live.com> wrote: > Given the breadth of the parquet community at this point, I don't think > we should be singling out one or two "reference" implementations. Even > parquet-mr, AFAIK, still doesn't implement DELTA_LENGTH_BYTE_ARRAY > encoding in a user-accessible way (it's only available as part of the > DELTA_BYTE_ARRAY writer). There are many situations in which the > former would be the superior choice, and in fact the specification > documentation still lists DLBA as "always preferred over PLAIN for byte > array columns" [1]. Similarly, DELTA_BYTE_ARRAY encoding was only added > to parquet-cpp in the last year [2], and column indexes a few months > before that [3]. > > Instead, I think we should leave out any mention of a reference > implementation, > and continue to require two, independent, interoperable implementations > before adopting a change to the spec. This, IMO, would go a long way towards > increasing excitement for Parquet outside the parquet-mr/arrow world. > > Just my (non-binding) two cents. > > Cheers, > Ed > > [1] > https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6 > [2] https://github.com/apache/arrow/pull/14341 > [3] https://github.com/apache/arrow/pull/34054 > > On 5/14/24 9:44 AM, Julien Le Dem wrote: > > I agree that parquet-mr implementation is a requirement to evolve the spec. > > It makes sense to me that we call parquet-mr the reference implementation > > and make it a requirement to evolve the spec. > > I would add the requirement to implement it in the parquet cpp > > implementation that lives in apache Arrow: > > https://github.com/apache/arrow/tree/main/cpp/src/parquet > > This code used to live in the parquet-cpp repo in the Parquet project. > > Being language agnostic is an important feature of the format. > > Interoperability tests should also be included. > > > > On Tue, May 14, 2024 at 9:31 AM Antoine Pitrou > > <antoine-+zn9apsxkcednm+yrof...@public.gmane.org> wrote: > > > >> AFAIK, the only Parquet implementation under the Apache Parquet project > >> is parquet-mr :-) > >> > >> > >> On Tue, 14 May 2024 10:58:58 +0200 > >> Rok Mihevc <rok.mih...@gmail.com> wrote: > >>> Second Raphael's point. > >>> Would it be reasonable to say specification change requires > >> implementation > >>> in two parquet implementations within Apache Parquet project? > >>> > >>> Rok > >>> > >>> On Tue, May 14, 2024 at 10:50 AM Gang Wu < > >> ustcwg-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote: > >>>> IMHO, it looks more reasonable if a reference implementation is > >> required > >>>> to support most (not all) elements from the specification. > >>>> > >>>> Another question is: should we discuss (and vote for) each candidate > >>>> one by one? We can start with parquet-mr which is most well-known > >>>> implementation. > >>>> > >>>> Best, > >>>> Gang > >>>> > >>>> On Tue, May 14, 2024 at 4:41 PM Raphael Taylor-Davies > >>>> <r.taylordav...@googlemail.com.invalid> wrote: > >>>> > >>>>> Potentially it would be helpful to flip the question around. As > >> Andrew > >>>>> articulates, a reference implementation is required to implement all > >>>>> elements from the specification, and therefore the major consequence > >> of > >>>>> labeling parquet-mr thusly would be that any specification change > >> would > >>>>> have to be implemented within parquet-mr as part of the > >> standardisation > >>>>> process. It would be insufficient for it to be implemented in, for > >>>>> example, two of the parquet implementations maintained by the arrow > >>>>> project. I personally think that would be a shame and likely exclude > >>>>> many people who would otherwise be interested in evolving the parquet > >>>>> specification, but think that is at the core of this question. > >>>>> > >>>>> Kind Regards, > >>>>> > >>>>> Raphael > >>>>> > >>>>> On 13/05/2024 20:55, Andrew Lamb wrote: > >>>>>> Question: Should we label parquet-mr or any other parquet > >>>> implementations > >>>>>> "reference" implications"? > >>>>>> > >>>>>> This came up as part of Vinoo's great PR to list different parquet > >>>>>> reference implementations[1][2]. > >>>>>> > >>>>>> The term "reference implementation" often has an official > >> connotation. > >>>>> For > >>>>>> example the wikipedia definition is "a program that implements all > >>>>>> requirements from a corresponding specification. The reference > >>>>>> implementation ... should be considered the "correct" behavior of > >> any > >>>>> other > >>>>>> implementation of it."[3] > >>>>>> > >>>>>> Given the close association of parquet-mr to the parquet standard, > >> it > >>>> is > >>>>> a > >>>>>> natural candidate to label as "reference implementation." However, > >> it > >>>> is > >>>>>> not clear to me if there is consensus that it should be thusly > >> labeled. > >>>>>> I have a strong opinion that a consensus on this question would be > >> very > >>>>>> helpful. I don't actually have a strong opinion about the answer > >>>>>> > >>>>>> Andrew > >>>>>> > >>>>>> > >>>>>> > >>>>>> [1]: > >> https://github.com/apache/parquet-site/pull/53#discussion_r1582882267 > >>>>>> [2]: > >> https://github.com/apache/parquet-site/pull/53#discussion_r1598283465 > >>>>>> [3]: https://en.wikipedia.org/wiki/Reference_implementation > >>>>>> > >> > >> > >> > >