hi Donald,

This would make things worse, not better. Code changes routinely
involve changes to the build system, and so you could be talking about
having to making changes to 2 or 3 git repositories as the result of a
single new feature or bug fix. There isn't really a cross-repo CI
solution available

I've seen some approaches to the monorepo problem using multiple git
repositories, such as

https://github.com/twosigma/git-meta

Until something like this has first class support by the GitHub
platform and its CI services (Travis CI, Appveyor), I don't think it
will work for us.

- Wes

On Sun, Jul 29, 2018 at 10:54 PM, Donald E. Foss <donald.f...@gmail.com> wrote:
> Could this work as each module gets configured as sub-git repots. Top level
> build tool go into each sub-repo, pick the correct release version to test.
> Tests in Python is dependent on cpp sub-repo to ensure the API still pass.
>
> This should be the best of both worlds, if sub-repo are supposed option.
>
> --Donald E. Foss
>
> On Sun, Jul 29, 2018, 10:44 PM Deepak Majeti <majeti.dee...@gmail.com>
> wrote:
>
>> I dislike the current build system complications as well.
>>
>> However, in my opinion, combining the code bases will severely impact the
>> progress of the parquet-cpp project and implicitly the progress of the
>> entire parquet project.
>> Combining would have made much more sense if parquet-cpp is a mature
>> project and codebase.  But parquet-cpp (and the entire parquet project) is
>> evolving continuously with new features being added including bloom
>> filters,  column encryption, and indexes.
>>
>> If the two code bases merged, it will be much more difficult to contribute
>> to the parquet-cpp project since now Arrow bindings have to be supported as
>> well. Please correct me if I am wrong here.
>>
>> Out of the two evils, I think handling the build system, packaging
>> duplication is much more manageable since they are quite stable at this
>> point.
>>
>> Regarding "* API changes cause awkward release coordination issues between
>> Arrow and Parquet". Can we make minor releases for parquet-cpp (with API
>> changes needed) as and when Arrow is released?
>>
>> Regarding "we maintain a Arrow conversion code in parquet-cpp for
>> converting between Arrow columnar memory format and Parquet". Can this be
>> moved to the Arrow project and expose the more stable low-level APIs in
>> parquet-cpp?
>>
>> I am also curious if the Arrow and Parquet Java implementations have
>> similar API compatibility issues.
>>
>>
>> On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <wesmck...@gmail.com> wrote:
>>
>> > hi folks,
>> >
>> > We've been struggling for quite some time with the development
>> > workflow between the Arrow and Parquet C++ (and Python) codebases.
>> >
>> > To explain the root issues:
>> >
>> > * parquet-cpp depends on "platform code" in Apache Arrow; this
>> > includes file interfaces, memory management, miscellaneous algorithms
>> > (e.g. dictionary encoding), etc. Note that before this "platform"
>> > dependency was introduced, there was significant duplicated code
>> > between these codebases and incompatible abstract interfaces for
>> > things like files
>> >
>> > * we maintain a Arrow conversion code in parquet-cpp for converting
>> > between Arrow columnar memory format and Parquet
>> >
>> > * we maintain Python bindings for parquet-cpp + Arrow interop in
>> > Apache Arrow. This introduces a circular dependency into our CI.
>> >
>> > * Substantial portions of our CMake build system and related tooling
>> > are duplicated between the Arrow and Parquet repos
>> >
>> > * API changes cause awkward release coordination issues between Arrow
>> > and Parquet
>> >
>> > I believe the best way to remedy the situation is to adopt a
>> > "Community over Code" approach and find a way for the Parquet and
>> > Arrow C++ development communities to operate out of the same code
>> > repository, i.e. the apache/arrow git repository.
>> >
>> > This would bring major benefits:
>> >
>> > * Shared CMake build infrastructure, developer tools, and CI
>> > infrastructure (Parquet is already being built as a dependency in
>> > Arrow's CI systems)
>> >
>> > * Share packaging and release management infrastructure
>> >
>> > * Reduce / eliminate problems due to API changes (where we currently
>> > introduce breakage into our CI workflow when there is a breaking /
>> > incompatible change)
>> >
>> > * Arrow releases would include a coordinated snapshot of the Parquet
>> > implementation as it stands
>> >
>> > Continuing with the status quo has become unsatisfactory to me and as
>> > a result I've become less motivated to work on the parquet-cpp
>> > codebase.
>> >
>> > The only Parquet C++ committer who is not an Arrow committer is Deepak
>> > Majeti. I think the issue of commit privileges could be resolved
>> > without too much difficulty or time.
>> >
>> > I also think if it is truly necessary that the Apache Parquet
>> > community could create release scripts to cut a miniml versioned
>> > Apache Parquet C++ release if that is deemed truly necessary.
>> >
>> > I know that some people are wary of monorepos and megaprojects, but as
>> > an example TensorFlow is at least 10 times as large of a projects in
>> > terms of LOCs and number of different platform components, and it
>> > seems to be getting along just fine. I think we should be able to work
>> > together as a community to function just as well.
>> >
>> > Interested in the opinions of others, and any other ideas for
>> > practical solutions to the above problems.
>> >
>> > Thanks,
>> > Wes
>> >
>>
>>
>> --
>> regards,
>> Deepak Majeti
>>

Reply via email to