On Tue, 14 May 2024 10:22:37 -0700
Julien Le Dem <jul...@apache.org> wrote:
> 1. I think we should make it easy for people contributing to the C++
> codebase. (which is why I voted for the move at the time)
> 2. If merging repos removes the need to deal with the circular dependency
> between repos issue for the C++ code bases, it does it at the expense of
> making it easy to evolve the parquet spec and the java and c++
> implementations together.

Hmm... I'm not sure I understand your point here. The Parquet spec and
the Java implementation are already living in distinct repos and have
distinct versioning schemes. The main thing that they share in common is
the JIRA instance (while the C++ Parquet implementation mostly relies on
Arrow's GH issue tracker), but is that really important?

> parquet-cpp depends only on arrow-core that does not have to depend on
> parquet-cpp.

That is true.

> Other components like
> arrow-dataset and pyarrow can depend on parquet-cpp just like they depend
> on orc externally.

Ideally yes. In practice there are two problems:
1) it creates a circular dependency between *repositories*.
2) the C++ Arrow Datasets component is not built independently, it is an
optional component when building Arrow C++. So we would also have a
chicken-and-egg problem when building Arrow C++ and Parquet C++.

> I realize that would be work to make it happen, but the current location of
> the parquet-cpp codebase is a big trade-off of prioritizing quick iteration
> on the C++ implementations over iteration on the format.

Having recently worked on a format addition and its respective
implementations (in Java and C++), I haven't found the current setup
more difficult to work with for Parquet C++ than it was for Parquet
Java. Admittedly I'm biased, being a heavy contributor to Arrow C++,
but I'm curious why the current situation would be detrimental to
iteration on the format.

Regards

Antoine.


Reply via email to