Re: [C++] Parquet and Arrow overlap

Julien Le Dem Thu, 16 May 2024 18:23:52 -0700

> Hmm... I'm not sure I understand your point here. The Parquet spec and
> the Java implementation are already living in distinct repos and have
> distinct versioning schemes. The main thing that they share in common is
> the JIRA instance (while the C++ Parquet implementation mostly relies on
> Arrow's GH issue tracker), but is that really important?

It is not a problem that they are in separate repos. The problem is the
friction created because it makes access control difficult and creates
confusion on governance.
This thread "[DISCUSS] Parquet C++ under which PMC?" is a clear example of
it: https://lists.apache.org/thread/128wv5cwv51scm8vdfn1g9gskw717qyt

All I'm suggesting is that if the inconvenience this creates around unclear
governance discussions is greater than the convenience of being in the same
repo, we should revisit that decision.
It would be less inconvenient today to have parquet-cpp in its own repo
than it was at the time.

As discussed, that code was moved in the arrow repo for convenience:
https://lists.apache.org/thread/gkvbm6yyly1r4cg3f6xtnqkjz6ogn6o2

To take an excerpt of that original decision:

4) The Parquet and Arrow C++ communities will collaborate to provide
development workflows to enable contributors working exclusively on the
Parquet core functionality to be able to work unencumbered with unnecessary
build or test dependencies from the rest of the Arrow codebase. Note that
parquet-cpp already builds a significant portion of Apache Arrow en route
to creating its libraries 5) The Parquet community can create scripts to
"cut" Parquet C++ releases by packaging up the appropriate components and
ensuring that they can be built and installed independently as now

development workflows to enable contributors working exclusively on the
Parquet core functionality to be able to work unencumbered with unnecessary
build or test dependencies from the rest of the Arrow codebase. Note that
parquet-cpp already builds a significant portion of Apache Arrow en route
to creating its libraries

The alternative is to live up to the part where we agreed that the two
communities collaborate on making it easy for the Parquet community to
govern its code base in the arrow repo.
Would you agree?

On Thu, May 16, 2024 at 1:00 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> From my perspective I agree, that I don't think there is benefit of moving
> parquet C++ out of arrow given what it would actually cost to make clean
> boundaries.  I also don't think it will hurt iteration speed.
>
> I think the main challenge could be in compatibility testing, but Arrow has
> solved this between implementations that live in different repositories so
> I think the same solutions could apply for Parquet.
>
> On Thu, May 16, 2024 at 12:57 AM Antoine Pitrou <anto...@python.org>
> wrote:
>
> > On Tue, 14 May 2024 10:22:37 -0700
> > Julien Le Dem <jul...@apache.org> wrote:
> > > 1. I think we should make it easy for people contributing to the C++
> > > codebase. (which is why I voted for the move at the time)
> > > 2. If merging repos removes the need to deal with the circular
> dependency
> > > between repos issue for the C++ code bases, it does it at the expense
> of
> > > making it easy to evolve the parquet spec and the java and c++
> > > implementations together.
> >
> > Hmm... I'm not sure I understand your point here. The Parquet spec and
> > the Java implementation are already living in distinct repos and have
> > distinct versioning schemes. The main thing that they share in common is
> > the JIRA instance (while the C++ Parquet implementation mostly relies on
> > Arrow's GH issue tracker), but is that really important?
> >
> > > parquet-cpp depends only on arrow-core that does not have to depend on
> > > parquet-cpp.
> >
> > That is true.
> >
> > > Other components like
> > > arrow-dataset and pyarrow can depend on parquet-cpp just like they
> depend
> > > on orc externally.
> >
> > Ideally yes. In practice there are two problems:
> > 1) it creates a circular dependency between *repositories*.
> > 2) the C++ Arrow Datasets component is not built independently, it is an
> > optional component when building Arrow C++. So we would also have a
> > chicken-and-egg problem when building Arrow C++ and Parquet C++.
> >
> > > I realize that would be work to make it happen, but the current
> location
> > of
> > > the parquet-cpp codebase is a big trade-off of prioritizing quick
> > iteration
> > > on the C++ implementations over iteration on the format.
> >
> > Having recently worked on a format addition and its respective
> > implementations (in Java and C++), I haven't found the current setup
> > more difficult to work with for Parquet C++ than it was for Parquet
> > Java. Admittedly I'm biased, being a heavy contributor to Arrow C++,
> > but I'm curious why the current situation would be detrimental to
> > iteration on the format.
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
>

Re: [C++] Parquet and Arrow overlap

Reply via email to