> Hmm... I'm not sure I understand your point here. The Parquet spec and > the Java implementation are already living in distinct repos and have > distinct versioning schemes. The main thing that they share in common is > the JIRA instance (while the C++ Parquet implementation mostly relies on > Arrow's GH issue tracker), but is that really important?
It is not a problem that they are in separate repos. The problem is the friction created because it makes access control difficult and creates confusion on governance. This thread "[DISCUSS] Parquet C++ under which PMC?" is a clear example of it: https://lists.apache.org/thread/128wv5cwv51scm8vdfn1g9gskw717qyt All I'm suggesting is that if the inconvenience this creates around unclear governance discussions is greater than the convenience of being in the same repo, we should revisit that decision. It would be less inconvenient today to have parquet-cpp in its own repo than it was at the time. As discussed, that code was moved in the arrow repo for convenience: https://lists.apache.org/thread/gkvbm6yyly1r4cg3f6xtnqkjz6ogn6o2 To take an excerpt of that original decision: 4) The Parquet and Arrow C++ communities will collaborate to provide development workflows to enable contributors working exclusively on the Parquet core functionality to be able to work unencumbered with unnecessary build or test dependencies from the rest of the Arrow codebase. Note that parquet-cpp already builds a significant portion of Apache Arrow en route to creating its libraries 5) The Parquet community can create scripts to "cut" Parquet C++ releases by packaging up the appropriate components and ensuring that they can be built and installed independently as now development workflows to enable contributors working exclusively on the Parquet core functionality to be able to work unencumbered with unnecessary build or test dependencies from the rest of the Arrow codebase. Note that parquet-cpp already builds a significant portion of Apache Arrow en route to creating its libraries The alternative is to live up to the part where we agreed that the two communities collaborate on making it easy for the Parquet community to govern its code base in the arrow repo. Would you agree? On Thu, May 16, 2024 at 1:00 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > From my perspective I agree, that I don't think there is benefit of moving > parquet C++ out of arrow given what it would actually cost to make clean > boundaries. I also don't think it will hurt iteration speed. > > I think the main challenge could be in compatibility testing, but Arrow has > solved this between implementations that live in different repositories so > I think the same solutions could apply for Parquet. > > On Thu, May 16, 2024 at 12:57 AM Antoine Pitrou <anto...@python.org> > wrote: > > > On Tue, 14 May 2024 10:22:37 -0700 > > Julien Le Dem <jul...@apache.org> wrote: > > > 1. I think we should make it easy for people contributing to the C++ > > > codebase. (which is why I voted for the move at the time) > > > 2. If merging repos removes the need to deal with the circular > dependency > > > between repos issue for the C++ code bases, it does it at the expense > of > > > making it easy to evolve the parquet spec and the java and c++ > > > implementations together. > > > > Hmm... I'm not sure I understand your point here. The Parquet spec and > > the Java implementation are already living in distinct repos and have > > distinct versioning schemes. The main thing that they share in common is > > the JIRA instance (while the C++ Parquet implementation mostly relies on > > Arrow's GH issue tracker), but is that really important? > > > > > parquet-cpp depends only on arrow-core that does not have to depend on > > > parquet-cpp. > > > > That is true. > > > > > Other components like > > > arrow-dataset and pyarrow can depend on parquet-cpp just like they > depend > > > on orc externally. > > > > Ideally yes. In practice there are two problems: > > 1) it creates a circular dependency between *repositories*. > > 2) the C++ Arrow Datasets component is not built independently, it is an > > optional component when building Arrow C++. So we would also have a > > chicken-and-egg problem when building Arrow C++ and Parquet C++. > > > > > I realize that would be work to make it happen, but the current > location > > of > > > the parquet-cpp codebase is a big trade-off of prioritizing quick > > iteration > > > on the C++ implementations over iteration on the format. > > > > Having recently worked on a format addition and its respective > > implementations (in Java and C++), I haven't found the current setup > > more difficult to work with for Parquet C++ than it was for Parquet > > Java. Admittedly I'm biased, being a heavy contributor to Arrow C++, > > but I'm curious why the current situation would be detrimental to > > iteration on the format. > > > > Regards > > > > Antoine. > > > > > > >