I'm in (non-binding) agreement with Ed here. I would just add that the
requirement for two interoperable implementations should mandate that
these are open source implementations.

Regards

Antoine.


On Tue, 14 May 2024 14:48:09 -0700
Ed Seidl <etse...@live.com> wrote:
> Given the breadth of the parquet community at this point, I don't think
> we should be singling out one or two "reference" implementations. Even
> parquet-mr, AFAIK, still doesn't implement DELTA_LENGTH_BYTE_ARRAY
> encoding in a user-accessible way (it's only available as part of the
> DELTA_BYTE_ARRAY writer). There are many situations in which the
> former would be the superior choice, and in fact the specification
> documentation still lists DLBA as "always preferred over PLAIN for byte
> array columns" [1]. Similarly, DELTA_BYTE_ARRAY encoding was only added
> to parquet-cpp in the last year [2], and column indexes a few months
> before that [3].
> 
> Instead, I think we should leave out any mention of a reference 
> implementation,
> and continue to require two, independent, interoperable implementations
> before adopting a change to the spec. This, IMO, would go a long way towards
> increasing excitement for Parquet outside the parquet-mr/arrow world.
> 
> Just my (non-binding) two cents.
> 
> Cheers,
> Ed
> 
> [1] 
> https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6
> [2] https://github.com/apache/arrow/pull/14341
> [3] https://github.com/apache/arrow/pull/34054
> 
> On 5/14/24 9:44 AM, Julien Le Dem wrote:
> > I agree that parquet-mr implementation is a requirement to evolve the spec.
> > It makes sense to me that we call parquet-mr the reference implementation
> > and make it a requirement to evolve the spec.
> > I would add the requirement to implement it in the parquet cpp
> > implementation that lives in apache Arrow:
> > https://github.com/apache/arrow/tree/main/cpp/src/parquet
> > This code used to live in the parquet-cpp repo in the Parquet project.
> > Being language agnostic is an important feature of the format.
> > Interoperability tests should also be included.
> >
> > On Tue, May 14, 2024 at 9:31 AM Antoine Pitrou 
> > <antoine-+zn9apsxkcednm+yrof...@public.gmane.org> wrote:
> >  
> >> AFAIK, the only Parquet implementation under the Apache Parquet project
> >> is parquet-mr :-)
> >>
> >>
> >> On Tue, 14 May 2024 10:58:58 +0200
> >> Rok Mihevc <rok.mih...@gmail.com> wrote:  
> >>> Second Raphael's point.
> >>> Would it be reasonable to say specification change requires  
> >> implementation  
> >>> in two parquet implementations within Apache Parquet project?
> >>>
> >>> Rok
> >>>
> >>> On Tue, May 14, 2024 at 10:50 AM Gang Wu <  
> >> ustcwg-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote:  
> >>>> IMHO, it looks more reasonable if a reference implementation is  
> >> required  
> >>>> to support most (not all) elements from the specification.
> >>>>
> >>>> Another question is: should we discuss (and vote for) each candidate
> >>>> one by one? We can start with parquet-mr which is most well-known
> >>>> implementation.
> >>>>
> >>>> Best,
> >>>> Gang
> >>>>
> >>>> On Tue, May 14, 2024 at 4:41 PM Raphael Taylor-Davies
> >>>> <r.taylordav...@googlemail.com.invalid> wrote:
> >>>>  
> >>>>> Potentially it would be helpful to flip the question around. As  
> >> Andrew  
> >>>>> articulates, a reference implementation is required to implement all
> >>>>> elements from the specification, and therefore the major consequence  
> >> of  
> >>>>> labeling parquet-mr thusly would be that any specification change  
> >> would  
> >>>>> have to be implemented within parquet-mr as part of the  
> >> standardisation  
> >>>>> process. It would be insufficient for it to be implemented in, for
> >>>>> example, two of the parquet implementations maintained by the arrow
> >>>>> project. I personally think that would be a shame and likely exclude
> >>>>> many people who would otherwise be interested in evolving the parquet
> >>>>> specification, but think that is at the core of this question.
> >>>>>
> >>>>> Kind Regards,
> >>>>>
> >>>>> Raphael
> >>>>>
> >>>>> On 13/05/2024 20:55, Andrew Lamb wrote:  
> >>>>>> Question: Should we label parquet-mr or any other parquet  
> >>>> implementations  
> >>>>>> "reference" implications"?
> >>>>>>
> >>>>>> This came up as part of Vinoo's great PR to list different parquet
> >>>>>> reference implementations[1][2].
> >>>>>>
> >>>>>> The term "reference implementation" often has an official  
> >> connotation.  
> >>>>> For  
> >>>>>> example the wikipedia definition is "a program that implements all
> >>>>>> requirements from a corresponding specification. The reference
> >>>>>> implementation ... should be considered the "correct" behavior of  
> >> any  
> >>>>> other  
> >>>>>> implementation of it."[3]
> >>>>>>
> >>>>>> Given the close association of parquet-mr to the parquet standard,  
> >> it  
> >>>> is  
> >>>>> a  
> >>>>>> natural candidate to label as "reference implementation." However,  
> >> it  
> >>>> is  
> >>>>>> not clear to me if there is consensus that it should be thusly  
> >> labeled.  
> >>>>>> I have a strong opinion that a consensus on this question would be  
> >> very  
> >>>>>> helpful. I don't actually have a strong opinion about the answer
> >>>>>>
> >>>>>> Andrew
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> [1]:  
> >> https://github.com/apache/parquet-site/pull/53#discussion_r1582882267  
> >>>>>> [2]:  
> >> https://github.com/apache/parquet-site/pull/53#discussion_r1598283465  
> >>>>>> [3]:  https://en.wikipedia.org/wiki/Reference_implementation
> >>>>>>  
> >>
> >>
> >>  
> 
> 



Reply via email to