hi Donald, This would make things worse, not better. Code changes routinely involve changes to the build system, and so you could be talking about having to making changes to 2 or 3 git repositories as the result of a single new feature or bug fix. There isn't really a cross-repo CI solution available
I've seen some approaches to the monorepo problem using multiple git repositories, such as https://github.com/twosigma/git-meta Until something like this has first class support by the GitHub platform and its CI services (Travis CI, Appveyor), I don't think it will work for us. - Wes On Sun, Jul 29, 2018 at 10:54 PM, Donald E. Foss <donald.f...@gmail.com> wrote: > Could this work as each module gets configured as sub-git repots. Top level > build tool go into each sub-repo, pick the correct release version to test. > Tests in Python is dependent on cpp sub-repo to ensure the API still pass. > > This should be the best of both worlds, if sub-repo are supposed option. > > --Donald E. Foss > > On Sun, Jul 29, 2018, 10:44 PM Deepak Majeti <majeti.dee...@gmail.com> > wrote: > >> I dislike the current build system complications as well. >> >> However, in my opinion, combining the code bases will severely impact the >> progress of the parquet-cpp project and implicitly the progress of the >> entire parquet project. >> Combining would have made much more sense if parquet-cpp is a mature >> project and codebase. But parquet-cpp (and the entire parquet project) is >> evolving continuously with new features being added including bloom >> filters, column encryption, and indexes. >> >> If the two code bases merged, it will be much more difficult to contribute >> to the parquet-cpp project since now Arrow bindings have to be supported as >> well. Please correct me if I am wrong here. >> >> Out of the two evils, I think handling the build system, packaging >> duplication is much more manageable since they are quite stable at this >> point. >> >> Regarding "* API changes cause awkward release coordination issues between >> Arrow and Parquet". Can we make minor releases for parquet-cpp (with API >> changes needed) as and when Arrow is released? >> >> Regarding "we maintain a Arrow conversion code in parquet-cpp for >> converting between Arrow columnar memory format and Parquet". Can this be >> moved to the Arrow project and expose the more stable low-level APIs in >> parquet-cpp? >> >> I am also curious if the Arrow and Parquet Java implementations have >> similar API compatibility issues. >> >> >> On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <wesmck...@gmail.com> wrote: >> >> > hi folks, >> > >> > We've been struggling for quite some time with the development >> > workflow between the Arrow and Parquet C++ (and Python) codebases. >> > >> > To explain the root issues: >> > >> > * parquet-cpp depends on "platform code" in Apache Arrow; this >> > includes file interfaces, memory management, miscellaneous algorithms >> > (e.g. dictionary encoding), etc. Note that before this "platform" >> > dependency was introduced, there was significant duplicated code >> > between these codebases and incompatible abstract interfaces for >> > things like files >> > >> > * we maintain a Arrow conversion code in parquet-cpp for converting >> > between Arrow columnar memory format and Parquet >> > >> > * we maintain Python bindings for parquet-cpp + Arrow interop in >> > Apache Arrow. This introduces a circular dependency into our CI. >> > >> > * Substantial portions of our CMake build system and related tooling >> > are duplicated between the Arrow and Parquet repos >> > >> > * API changes cause awkward release coordination issues between Arrow >> > and Parquet >> > >> > I believe the best way to remedy the situation is to adopt a >> > "Community over Code" approach and find a way for the Parquet and >> > Arrow C++ development communities to operate out of the same code >> > repository, i.e. the apache/arrow git repository. >> > >> > This would bring major benefits: >> > >> > * Shared CMake build infrastructure, developer tools, and CI >> > infrastructure (Parquet is already being built as a dependency in >> > Arrow's CI systems) >> > >> > * Share packaging and release management infrastructure >> > >> > * Reduce / eliminate problems due to API changes (where we currently >> > introduce breakage into our CI workflow when there is a breaking / >> > incompatible change) >> > >> > * Arrow releases would include a coordinated snapshot of the Parquet >> > implementation as it stands >> > >> > Continuing with the status quo has become unsatisfactory to me and as >> > a result I've become less motivated to work on the parquet-cpp >> > codebase. >> > >> > The only Parquet C++ committer who is not an Arrow committer is Deepak >> > Majeti. I think the issue of commit privileges could be resolved >> > without too much difficulty or time. >> > >> > I also think if it is truly necessary that the Apache Parquet >> > community could create release scripts to cut a miniml versioned >> > Apache Parquet C++ release if that is deemed truly necessary. >> > >> > I know that some people are wary of monorepos and megaprojects, but as >> > an example TensorFlow is at least 10 times as large of a projects in >> > terms of LOCs and number of different platform components, and it >> > seems to be getting along just fine. I think we should be able to work >> > together as a community to function just as well. >> > >> > Interested in the opinions of others, and any other ideas for >> > practical solutions to the above problems. >> > >> > Thanks, >> > Wes >> > >> >> >> -- >> regards, >> Deepak Majeti >>