I'd argue the compatibility across implementation is "can they correctly
read the data generated by the others?", so there's less of an RI than
compliance testing, the way closed source stuff often works.

Specification

   1. Files generated by the implementation which are believed to match the
   specification
   2. Assertions about the contents of these files (this is
   something which needs to be declared in a way that can be used by test
   runners of the different implementations, so tricky.
   3. Tests which validate those assertions on the parsed contents


I've never done anything like this before. maybe tanyone who has tried to
implement an SQL standard has some suggestions. Indeed, SQL might be
language for those assertions, which would then have to go through
spark/hive/impala/etc for validation. Which is ultimately what you want,
just a lot harder to build, test, debug and identify what is broken

On Fri, 17 May 2024 at 09:40, Antoine Pitrou <anto...@python.org> wrote:

>
> +1 (non-binding :-)) on the idea of having a shortlist of "accredited"
> implementations.
>
> I would suggest to add a third implementation such as parquet-rs, since
> its authors are active here; especially as the Parquet Java and C++
> teams seem to have some overlap historically, and a third
> implementation helps bring different perspectives.
>
> Regards
>
> Antoine.
>
>
> On Thu, 16 May 2024 17:37:35 -0700
> Julien Le Dem <jul...@apache.org> wrote:
> > I would support it as long as we maintain a list of the implementations
> > that we consider "accredited" to be reference implementations (we being a
> > PMC vote here).
> > Not all implementations are created equal from an adoption point of view.
> > Originally the Impala implementation was the second implementation for
> > interrop. Later on the parquet-cpp implementation was added as a
> standalone
> > implementation in the Parquet project. This is the implementation that
> > lives in the arrow repository.
> > The parquet java implementation and the parquet cpp implementation in the
> > arrow repo are on top of that list IMO.
> >
> >
> > On Thu, May 16, 2024 at 6:17 AM Rok Mihevc <
> rok.mihevc-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote:
> >
> > > I would support a "two interoperable open source implementations"
> > > requirement.
> > >
> > > Rok
> > >
> > > On Thu, May 16, 2024 at 10:06 AM Antoine Pitrou <anto...@python.org>
> > > wrote:
> > >
> > > >
> > > > I'm in (non-binding) agreement with Ed here. I would just add that
> the
> > > > requirement for two interoperable implementations should mandate that
> > > > these are open source implementations.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > On Tue, 14 May 2024 14:48:09 -0700
> > > > Ed Seidl <etse...@live.com> wrote:
> > > > > Given the breadth of the parquet community at this point, I don't
> think
> > > > > we should be singling out one or two "reference" implementations.
> Even
> > > > > parquet-mr, AFAIK, still doesn't implement DELTA_LENGTH_BYTE_ARRAY
> > > > > encoding in a user-accessible way (it's only available as part of
> the
> > > > > DELTA_BYTE_ARRAY writer). There are many situations in which the
> > > > > former would be the superior choice, and in fact the specification
> > > > > documentation still lists DLBA as "always preferred over PLAIN for
> byte
> > > > > array columns" [1]. Similarly, DELTA_BYTE_ARRAY encoding was only
> added
> > > > > to parquet-cpp in the last year [2], and column indexes a few
> months
> > > > > before that [3].
> > > > >
> > > > > Instead, I think we should leave out any mention of a reference
> > > > > implementation,
> > > > > and continue to require two, independent, interoperable
> implementations
> > > > > before adopting a change to the spec. This, IMO, would go a long
> way
> > > > towards
> > > > > increasing excitement for Parquet outside the parquet-mr/arrow
> world.
> > > > >
> > > > > Just my (non-binding) two cents.
> > > > >
> > > > > Cheers,
> > > > > Ed
> > > > >
> > > > > [1]
> > > > >
> > > >
> > >
> https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6
>
> > > > > [2] https://github.com/apache/arrow/pull/14341
> > > > > [3] https://github.com/apache/arrow/pull/34054
> > > > >
> > > > > On 5/14/24 9:44 AM, Julien Le Dem wrote:
> > > > > > I agree that parquet-mr implementation is a requirement to
> evolve the
> > > > spec.
> > > > > > It makes sense to me that we call parquet-mr the reference
> > > > implementation
> > > > > > and make it a requirement to evolve the spec.
> > > > > > I would add the requirement to implement it in the parquet cpp
> > > > > > implementation that lives in apache Arrow:
> > > > > > https://github.com/apache/arrow/tree/main/cpp/src/parquet
> > > > > > This code used to live in the parquet-cpp repo in the Parquet
> > > project.
> > > > > > Being language agnostic is an important feature of the format.
> > > > > > Interoperability tests should also be included.
> > > > > >
> > > > > > On Tue, May 14, 2024 at 9:31 AM Antoine Pitrou <
> > > >
> antoine-+zn9apsxkcednm+yrofe0a-xmd5yjdbdmrexy1tmh2...@public.gmane.org>
> wrote:
> > > > > >
> > > > > >> AFAIK, the only Parquet implementation under the Apache
> Parquet
> > > > project
> > > > > >> is parquet-mr :-)
> > > > > >>
> > > > > >>
> > > > > >> On Tue, 14 May 2024 10:58:58 +0200
> > > > > >> Rok Mihevc <rok.mih...@gmail.com> wrote:
> > > > > >>> Second Raphael's point.
> > > > > >>> Would it be reasonable to say specification change requires
> > > > > >> implementation
> > > > > >>> in two parquet implementations within Apache Parquet project?
> > > > > >>>
> > > > > >>> Rok
> > > > > >>>
> > > > > >>> On Tue, May 14, 2024 at 10:50 AM Gang Wu <
> > > > > >>
> ustcwg-re5jqeeqqe8avxtiumwx3w-xmd5yjdbdmrexy1tmh2...@public.gmane.org>
> wrote:
> > > > > >>>> IMHO, it looks more reasonable if a reference implementation
> is
> > > > > >> required
> > > > > >>>> to support most (not all) elements from the specification.
> > > > > >>>>
> > > > > >>>> Another question is: should we discuss (and vote for) each
> > > candidate
> > > > > >>>> one by one? We can start with parquet-mr which is most
> well-known
> > > > > >>>> implementation.
> > > > > >>>>
> > > > > >>>> Best,
> > > > > >>>> Gang
> > > > > >>>>
> > > > > >>>> On Tue, May 14, 2024 at 4:41 PM Raphael Taylor-Davies
> > > > > >>>> <r.taylordavies-gM/Ye1E23mxENrl/
> abpvzguzikbjl...@public.gmane.org> wrote:
> > > > > >>>>
> > > > > >>>>> Potentially it would be helpful to flip the question around.
> As
> > > > > >> Andrew
> > > > > >>>>> articulates, a reference implementation is required to
> implement
> > > > all
> > > > > >>>>> elements from the specification, and therefore the major
> > > > consequence
> > > > > >> of
> > > > > >>>>> labeling parquet-mr thusly would be that any specification
> change
> > > > > >> would
> > > > > >>>>> have to be implemented within parquet-mr as part of the
> > > > > >> standardisation
> > > > > >>>>> process. It would be insufficient for it to be implemented
> in,
> > > for
> > > > > >>>>> example, two of the parquet implementations maintained by
> the
> > > arrow
> > > > > >>>>> project. I personally think that would be a shame and
> likely
> > > > exclude
> > > > > >>>>> many people who would otherwise be interested in evolving
> the
> > > > parquet
> > > > > >>>>> specification, but think that is at the core of this
> question.
> > > > > >>>>>
> > > > > >>>>> Kind Regards,
> > > > > >>>>>
> > > > > >>>>> Raphael
> > > > > >>>>>
> > > > > >>>>> On 13/05/2024 20:55, Andrew Lamb wrote:
> > > > > >>>>>> Question: Should we label parquet-mr or any other parquet
> > > > > >>>> implementations
> > > > > >>>>>> "reference" implications"?
> > > > > >>>>>>
> > > > > >>>>>> This came up as part of Vinoo's great PR to list different
> > > parquet
> > > > > >>>>>> reference implementations[1][2].
> > > > > >>>>>>
> > > > > >>>>>> The term "reference implementation" often has an official
> > > > > >> connotation.
> > > > > >>>>> For
> > > > > >>>>>> example the wikipedia definition is "a program that
> implements
> > > all
> > > > > >>>>>> requirements from a corresponding specification. The
> reference
> > > > > >>>>>> implementation ... should be considered the "correct"
> behavior
> > > > of
> > > > > >> any
> > > > > >>>>> other
> > > > > >>>>>> implementation of it."[3]
> > > > > >>>>>>
> > > > > >>>>>> Given the close association of parquet-mr to the parquet
> > > > standard,
> > > > > >> it
> > > > > >>>> is
> > > > > >>>>> a
> > > > > >>>>>> natural candidate to label as "reference implementation."
> > > > However,
> > > > > >> it
> > > > > >>>> is
> > > > > >>>>>> not clear to me if there is consensus that it should be
> thusly
> > > > > >> labeled.
> > > > > >>>>>> I have a strong opinion that a consensus on this question
> would
> > > > be
> > > > > >> very
> > > > > >>>>>> helpful. I don't actually have a strong opinion about the
> answer
> > > > > >>>>>>
> > > > > >>>>>> Andrew
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>> [1]:
> > > > > >>
> > > https://github.com/apache/parquet-site/pull/53#discussion_r1582882267
>
> > > >
> > > > > >>>>>> [2]:
> > > > > >>
> > > https://github.com/apache/parquet-site/pull/53#discussion_r1598283465
>
> > > >
> > > > > >>>>>> [3]:
> https://en.wikipedia.org/wiki/Reference_implementation
> > > > > >>>>>>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > > >
> > >
> >
>
>
>
>

Reply via email to