Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Wes McKinney Sun, 29 Jul 2018 20:55:48 -0700

hi Donald,

This would make things worse, not better. Code changes routinely
involve changes to the build system, and so you could be talking about
having to making changes to 2 or 3 git repositories as the result of a
single new feature or bug fix. There isn't really a cross-repo CI
solution available


I've seen some approaches to the monorepo problem using multiple git
repositories, such as

https://github.com/twosigma/git-meta

Until something like this has first class support by the GitHub
platform and its CI services (Travis CI, Appveyor), I don't think it
will work for us.

- Wes

On Sun, Jul 29, 2018 at 10:54 PM, Donald E. Foss <[email protected]> wrote:
> Could this work as each module gets configured as sub-git repots. Top level
> build tool go into each sub-repo, pick the correct release version to test.
> Tests in Python is dependent on cpp sub-repo to ensure the API still pass.
>
> This should be the best of both worlds, if sub-repo are supposed option.
>
> --Donald E. Foss
>
> On Sun, Jul 29, 2018, 10:44 PM Deepak Majeti <[email protected]>
> wrote:
>
>> I dislike the current build system complications as well.
>>
>> However, in my opinion, combining the code bases will severely impact the
>> progress of the parquet-cpp project and implicitly the progress of the
>> entire parquet project.
>> Combining would have made much more sense if parquet-cpp is a mature
>> project and codebase.  But parquet-cpp (and the entire parquet project) is
>> evolving continuously with new features being added including bloom
>> filters,  column encryption, and indexes.
>>
>> If the two code bases merged, it will be much more difficult to contribute
>> to the parquet-cpp project since now Arrow bindings have to be supported as
>> well. Please correct me if I am wrong here.
>>
>> Out of the two evils, I think handling the build system, packaging
>> duplication is much more manageable since they are quite stable at this
>> point.
>>
>> Regarding "* API changes cause awkward release coordination issues between
>> Arrow and Parquet". Can we make minor releases for parquet-cpp (with API
>> changes needed) as and when Arrow is released?
>>
>> Regarding "we maintain a Arrow conversion code in parquet-cpp for
>> converting between Arrow columnar memory format and Parquet". Can this be
>> moved to the Arrow project and expose the more stable low-level APIs in
>> parquet-cpp?
>>
>> I am also curious if the Arrow and Parquet Java implementations have
>> similar API compatibility issues.
>>
>>
>> On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <[email protected]> wrote:
>>
>> > hi folks,
>> >
>> > We've been struggling for quite some time with the development
>> > workflow between the Arrow and Parquet C++ (and Python) codebases.
>> >
>> > To explain the root issues:
>> >
>> > * parquet-cpp depends on "platform code" in Apache Arrow; this
>> > includes file interfaces, memory management, miscellaneous algorithms
>> > (e.g. dictionary encoding), etc. Note that before this "platform"
>> > dependency was introduced, there was significant duplicated code
>> > between these codebases and incompatible abstract interfaces for
>> > things like files
>> >
>> > * we maintain a Arrow conversion code in parquet-cpp for converting
>> > between Arrow columnar memory format and Parquet
>> >
>> > * we maintain Python bindings for parquet-cpp + Arrow interop in
>> > Apache Arrow. This introduces a circular dependency into our CI.
>> >
>> > * Substantial portions of our CMake build system and related tooling
>> > are duplicated between the Arrow and Parquet repos
>> >
>> > * API changes cause awkward release coordination issues between Arrow
>> > and Parquet
>> >
>> > I believe the best way to remedy the situation is to adopt a
>> > "Community over Code" approach and find a way for the Parquet and
>> > Arrow C++ development communities to operate out of the same code
>> > repository, i.e. the apache/arrow git repository.
>> >
>> > This would bring major benefits:
>> >
>> > * Shared CMake build infrastructure, developer tools, and CI
>> > infrastructure (Parquet is already being built as a dependency in
>> > Arrow's CI systems)
>> >
>> > * Share packaging and release management infrastructure
>> >
>> > * Reduce / eliminate problems due to API changes (where we currently
>> > introduce breakage into our CI workflow when there is a breaking /
>> > incompatible change)
>> >
>> > * Arrow releases would include a coordinated snapshot of the Parquet
>> > implementation as it stands
>> >
>> > Continuing with the status quo has become unsatisfactory to me and as
>> > a result I've become less motivated to work on the parquet-cpp
>> > codebase.
>> >
>> > The only Parquet C++ committer who is not an Arrow committer is Deepak
>> > Majeti. I think the issue of commit privileges could be resolved
>> > without too much difficulty or time.
>> >
>> > I also think if it is truly necessary that the Apache Parquet
>> > community could create release scripts to cut a miniml versioned
>> > Apache Parquet C++ release if that is deemed truly necessary.
>> >
>> > I know that some people are wary of monorepos and megaprojects, but as
>> > an example TensorFlow is at least 10 times as large of a projects in
>> > terms of LOCs and number of different platform components, and it
>> > seems to be getting along just fine. I think we should be able to work
>> > together as a community to function just as well.
>> >
>> > Interested in the opinions of others, and any other ideas for
>> > practical solutions to the above problems.
>> >
>> > Thanks,
>> > Wes
>> >
>>
>>
>> --
>> regards,
>> Deepak Majeti
>>

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Reply via email to