Given the breadth of the parquet community at this point, I don't think
we should be singling out one or two "reference" implementations. Even
parquet-mr, AFAIK, still doesn't implement DELTA_LENGTH_BYTE_ARRAY
encoding in a user-accessible way (it's only available as part of the
DELTA_BYTE_ARRAY writer). There are many situations in which the
former would be the superior choice, and in fact the specification
documentation still lists DLBA as "always preferred over PLAIN for byte
array columns" [1]. Similarly, DELTA_BYTE_ARRAY encoding was only added
to parquet-cpp in the last year [2], and column indexes a few months
before that [3].

Instead, I think we should leave out any mention of a reference implementation,
and continue to require two, independent, interoperable implementations
before adopting a change to the spec. This, IMO, would go a long way towards
increasing excitement for Parquet outside the parquet-mr/arrow world.

Just my (non-binding) two cents.

Cheers,
Ed

[1] https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6
[2] https://github.com/apache/arrow/pull/14341
[3] https://github.com/apache/arrow/pull/34054

On 5/14/24 9:44 AM, Julien Le Dem wrote:
I agree that parquet-mr implementation is a requirement to evolve the spec.
It makes sense to me that we call parquet-mr the reference implementation
and make it a requirement to evolve the spec.
I would add the requirement to implement it in the parquet cpp
implementation that lives in apache Arrow:
https://github.com/apache/arrow/tree/main/cpp/src/parquet
This code used to live in the parquet-cpp repo in the Parquet project.
Being language agnostic is an important feature of the format.
Interoperability tests should also be included.

On Tue, May 14, 2024 at 9:31 AM Antoine Pitrou <anto...@python.org> wrote:

AFAIK, the only Parquet implementation under the Apache Parquet project
is parquet-mr :-)


On Tue, 14 May 2024 10:58:58 +0200
Rok Mihevc <rok.mih...@gmail.com> wrote:
Second Raphael's point.
Would it be reasonable to say specification change requires
implementation
in two parquet implementations within Apache Parquet project?

Rok

On Tue, May 14, 2024 at 10:50 AM Gang Wu <
ustcwg-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote:
IMHO, it looks more reasonable if a reference implementation is
required
to support most (not all) elements from the specification.

Another question is: should we discuss (and vote for) each candidate
one by one? We can start with parquet-mr which is most well-known
implementation.

Best,
Gang

On Tue, May 14, 2024 at 4:41 PM Raphael Taylor-Davies
<r.taylordav...@googlemail.com.invalid> wrote:

Potentially it would be helpful to flip the question around. As
Andrew
articulates, a reference implementation is required to implement all
elements from the specification, and therefore the major consequence
of
labeling parquet-mr thusly would be that any specification change
would
have to be implemented within parquet-mr as part of the
standardisation
process. It would be insufficient for it to be implemented in, for
example, two of the parquet implementations maintained by the arrow
project. I personally think that would be a shame and likely exclude
many people who would otherwise be interested in evolving the parquet
specification, but think that is at the core of this question.

Kind Regards,

Raphael

On 13/05/2024 20:55, Andrew Lamb wrote:
Question: Should we label parquet-mr or any other parquet
implementations
"reference" implications"?

This came up as part of Vinoo's great PR to list different parquet
reference implementations[1][2].

The term "reference implementation" often has an official
connotation.
For
example the wikipedia definition is "a program that implements all
requirements from a corresponding specification. The reference
implementation ... should be considered the "correct" behavior of
any
other
implementation of it."[3]

Given the close association of parquet-mr to the parquet standard,
it
is
a
natural candidate to label as "reference implementation." However,
it
is
not clear to me if there is consensus that it should be thusly
labeled.
I have a strong opinion that a consensus on this question would be
very
helpful. I don't actually have a strong opinion about the answer

Andrew



[1]:
https://github.com/apache/parquet-site/pull/53#discussion_r1582882267
[2]:
https://github.com/apache/parquet-site/pull/53#discussion_r1598283465
[3]:  https://en.wikipedia.org/wiki/Reference_implementation





Reply via email to