I'm not going to comment on the design of the parquet-cpp module and whether it is “closer” to parquet or arrow.
But I do think Wes’s proposal is consistent with Apache policy. PMCs make releases and govern communities; they don’t exist to manage code bases, except as a means to the end of creating releases of known provenance. The Parquet PMC can continue to make parquet-cpp releases, and to end-users those releases will look the same as they do today, even if the code for those releases were to move to a different git repo in the ASF. Julian > On Jul 30, 2018, at 3:05 PM, Wes McKinney <wesmck...@gmail.com> wrote: > > hi Deepak > > On Mon, Jul 30, 2018 at 5:18 PM, Deepak Majeti <majeti.dee...@gmail.com> > wrote: >> @Wes >> My observation is that most of the parquet-cpp contributors you listed that >> overlap with the Arrow community mainly contribute to the Arrow >> bindings(parquet::arrow layer)/platform API changes in the parquet-cpp >> repo. Very few of them review/contribute patches to the parquet-cpp core. >> > > So, what are you saying exactly, that some contributions or > contributors to Apache Parquet matter more than others? I don't > follow. > > As a result of these individual's efforts, the parquet-cpp libraries > are being installed well over 100,000 times per month on a single > install path (Python) alone. > >> I believe improvements to the parquet-cpp core will be negatively impacted >> since merging the parquet-cpp and arrow-cpp repos will increase the barrier >> of entry to new contributors interested in the parquet-cpp core. The >> current extensions to the parquet-cpp core related to bloom-filters, and >> column encryption are all being done by first-time contributors. > > I don't understand why this would "increase the barrier of entry". > Could you explain? > > It is true that there would be more code in the codebase, but the > build and test procedure would be no more complex. If anything, > community productivity will be improved by having a more cohesive / > centralized development platform (large amounts of code that Parquet > depends on are in Apache Arrow already). > >> >> If you believe there will be new interest in the parquet-cpp core with the >> mono-repo approach, I am all up for it. > > Yes, I believe that this change will result in more and higher quality > code review to Parquet core changes and general improvements to > developer productivity across the board. Developer productivity is > what this is all about. > > - Wes > >> >> >> On Mon, Jul 30, 2018 at 12:18 AM Philipp Moritz <pcmor...@gmail.com> wrote: >> >>> I do not claim to have insight into parquet-cpp development. However, from >>> our experience developing Ray, I can say that the monorepo approach (for >>> Ray) has improved things a lot. Before we tried various schemes to split >>> the project into multiple repos, but the build system and test >>> infrastructure duplications and overhead from synchronizing changes slowed >>> development down significantly (and fixing bugs that touch the subrepos and >>> the main repo is inconvenient). >>> >>> Also the decision to put arrow and parquet-cpp into a common repo is >>> independent of how tightly coupled the two projects are (and there could be >>> a matrix entry in travis which tests that PRs keep them decoupled, or >>> rather that they both just depend on a small common "base"). Google and >>> Facebook demonstrate such independence by having many many projects in the >>> same repo of course. It would be great if the open source community would >>> move more into this direction too I think. >>> >>> Best, >>> Philipp. >>> >>> On Sun, Jul 29, 2018 at 8:54 PM, Wes McKinney <wesmck...@gmail.com> wrote: >>> >>>> hi Donald, >>>> >>>> This would make things worse, not better. Code changes routinely >>>> involve changes to the build system, and so you could be talking about >>>> having to making changes to 2 or 3 git repositories as the result of a >>>> single new feature or bug fix. There isn't really a cross-repo CI >>>> solution available >>>> >>>> I've seen some approaches to the monorepo problem using multiple git >>>> repositories, such as >>>> >>>> https://github.com/twosigma/git-meta >>>> >>>> Until something like this has first class support by the GitHub >>>> platform and its CI services (Travis CI, Appveyor), I don't think it >>>> will work for us. >>>> >>>> - Wes >>>> >>>> On Sun, Jul 29, 2018 at 10:54 PM, Donald E. Foss <donald.f...@gmail.com> >>>> wrote: >>>>> Could this work as each module gets configured as sub-git repots. Top >>>> level >>>>> build tool go into each sub-repo, pick the correct release version to >>>> test. >>>>> Tests in Python is dependent on cpp sub-repo to ensure the API still >>>> pass. >>>>> >>>>> This should be the best of both worlds, if sub-repo are supposed >>> option. >>>>> >>>>> --Donald E. Foss >>>>> >>>>> On Sun, Jul 29, 2018, 10:44 PM Deepak Majeti <majeti.dee...@gmail.com> >>>>> wrote: >>>>> >>>>>> I dislike the current build system complications as well. >>>>>> >>>>>> However, in my opinion, combining the code bases will severely impact >>>> the >>>>>> progress of the parquet-cpp project and implicitly the progress of the >>>>>> entire parquet project. >>>>>> Combining would have made much more sense if parquet-cpp is a mature >>>>>> project and codebase. But parquet-cpp (and the entire parquet >>> project) >>>> is >>>>>> evolving continuously with new features being added including bloom >>>>>> filters, column encryption, and indexes. >>>>>> >>>>>> If the two code bases merged, it will be much more difficult to >>>> contribute >>>>>> to the parquet-cpp project since now Arrow bindings have to be >>>> supported as >>>>>> well. Please correct me if I am wrong here. >>>>>> >>>>>> Out of the two evils, I think handling the build system, packaging >>>>>> duplication is much more manageable since they are quite stable at >>> this >>>>>> point. >>>>>> >>>>>> Regarding "* API changes cause awkward release coordination issues >>>> between >>>>>> Arrow and Parquet". Can we make minor releases for parquet-cpp (with >>> API >>>>>> changes needed) as and when Arrow is released? >>>>>> >>>>>> Regarding "we maintain a Arrow conversion code in parquet-cpp for >>>>>> converting between Arrow columnar memory format and Parquet". Can this >>>> be >>>>>> moved to the Arrow project and expose the more stable low-level APIs >>> in >>>>>> parquet-cpp? >>>>>> >>>>>> I am also curious if the Arrow and Parquet Java implementations have >>>>>> similar API compatibility issues. >>>>>> >>>>>> >>>>>> On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <wesmck...@gmail.com> >>>> wrote: >>>>>> >>>>>>> hi folks, >>>>>>> >>>>>>> We've been struggling for quite some time with the development >>>>>>> workflow between the Arrow and Parquet C++ (and Python) codebases. >>>>>>> >>>>>>> To explain the root issues: >>>>>>> >>>>>>> * parquet-cpp depends on "platform code" in Apache Arrow; this >>>>>>> includes file interfaces, memory management, miscellaneous >>> algorithms >>>>>>> (e.g. dictionary encoding), etc. Note that before this "platform" >>>>>>> dependency was introduced, there was significant duplicated code >>>>>>> between these codebases and incompatible abstract interfaces for >>>>>>> things like files >>>>>>> >>>>>>> * we maintain a Arrow conversion code in parquet-cpp for converting >>>>>>> between Arrow columnar memory format and Parquet >>>>>>> >>>>>>> * we maintain Python bindings for parquet-cpp + Arrow interop in >>>>>>> Apache Arrow. This introduces a circular dependency into our CI. >>>>>>> >>>>>>> * Substantial portions of our CMake build system and related tooling >>>>>>> are duplicated between the Arrow and Parquet repos >>>>>>> >>>>>>> * API changes cause awkward release coordination issues between >>> Arrow >>>>>>> and Parquet >>>>>>> >>>>>>> I believe the best way to remedy the situation is to adopt a >>>>>>> "Community over Code" approach and find a way for the Parquet and >>>>>>> Arrow C++ development communities to operate out of the same code >>>>>>> repository, i.e. the apache/arrow git repository. >>>>>>> >>>>>>> This would bring major benefits: >>>>>>> >>>>>>> * Shared CMake build infrastructure, developer tools, and CI >>>>>>> infrastructure (Parquet is already being built as a dependency in >>>>>>> Arrow's CI systems) >>>>>>> >>>>>>> * Share packaging and release management infrastructure >>>>>>> >>>>>>> * Reduce / eliminate problems due to API changes (where we currently >>>>>>> introduce breakage into our CI workflow when there is a breaking / >>>>>>> incompatible change) >>>>>>> >>>>>>> * Arrow releases would include a coordinated snapshot of the Parquet >>>>>>> implementation as it stands >>>>>>> >>>>>>> Continuing with the status quo has become unsatisfactory to me and >>> as >>>>>>> a result I've become less motivated to work on the parquet-cpp >>>>>>> codebase. >>>>>>> >>>>>>> The only Parquet C++ committer who is not an Arrow committer is >>> Deepak >>>>>>> Majeti. I think the issue of commit privileges could be resolved >>>>>>> without too much difficulty or time. >>>>>>> >>>>>>> I also think if it is truly necessary that the Apache Parquet >>>>>>> community could create release scripts to cut a miniml versioned >>>>>>> Apache Parquet C++ release if that is deemed truly necessary. >>>>>>> >>>>>>> I know that some people are wary of monorepos and megaprojects, but >>> as >>>>>>> an example TensorFlow is at least 10 times as large of a projects in >>>>>>> terms of LOCs and number of different platform components, and it >>>>>>> seems to be getting along just fine. I think we should be able to >>> work >>>>>>> together as a community to function just as well. >>>>>>> >>>>>>> Interested in the opinions of others, and any other ideas for >>>>>>> practical solutions to the above problems. >>>>>>> >>>>>>> Thanks, >>>>>>> Wes >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> regards, >>>>>> Deepak Majeti >>>>>> >>>> >>> >> >> >> -- >> regards, >> Deepak Majeti