Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Wes McKinney Mon, 30 Jul 2018 15:06:02 -0700

hi Deepak

On Mon, Jul 30, 2018 at 5:18 PM, Deepak Majeti <[email protected]> wrote:
> @Wes
> My observation is that most of the parquet-cpp contributors you listed that
> overlap with the Arrow community mainly contribute to the Arrow
> bindings(parquet::arrow layer)/platform API changes in the parquet-cpp
> repo. Very few of them review/contribute patches to the parquet-cpp core.
>


So, what are you saying exactly, that some contributions or
contributors to Apache Parquet matter more than others? I don't
follow.

As a result of these individual's efforts, the parquet-cpp libraries
are being installed well over 100,000 times per month on a single
install path (Python) alone.

> I believe improvements to the parquet-cpp core will be negatively impacted
> since merging the parquet-cpp and arrow-cpp repos will increase the barrier
> of entry to new contributors interested in the parquet-cpp core. The
> current extensions to the parquet-cpp core related to bloom-filters, and
> column encryption are all being done by first-time contributors.

I don't understand why this would "increase the barrier of entry".
Could you explain?

It is true that there would be more code in the codebase, but the
build and test procedure would be no more complex. If anything,
community productivity will be improved by having a more cohesive /
centralized development platform (large amounts of code that Parquet
depends on are in Apache Arrow already).

>
> If you believe there will be new interest in the parquet-cpp core with the
> mono-repo approach, I am all up for it.

Yes, I believe that this change will result in more and higher quality
code review to Parquet core changes and general improvements to
developer productivity across the board. Developer productivity is
what this is all about.

- Wes

>
>
> On Mon, Jul 30, 2018 at 12:18 AM Philipp Moritz <[email protected]> wrote:
>
>> I do not claim to have insight into parquet-cpp development. However, from
>> our experience developing Ray, I can say that the monorepo approach (for
>> Ray) has improved things a lot. Before we tried various schemes to split
>> the project into multiple repos, but the build system and test
>> infrastructure duplications and overhead from synchronizing changes slowed
>> development down significantly (and fixing bugs that touch the subrepos and
>> the main repo is inconvenient).
>>
>> Also the decision to put arrow and parquet-cpp into a common repo is
>> independent of how tightly coupled the two projects are (and there could be
>> a matrix entry in travis which tests that PRs keep them decoupled, or
>> rather that they both just depend on a small common "base"). Google and
>> Facebook demonstrate such independence by having many many projects in the
>> same repo of course. It would be great if the open source community would
>> move more into this direction too I think.
>>
>> Best,
>> Philipp.
>>
>> On Sun, Jul 29, 2018 at 8:54 PM, Wes McKinney <[email protected]> wrote:
>>
>> > hi Donald,
>> >
>> > This would make things worse, not better. Code changes routinely
>> > involve changes to the build system, and so you could be talking about
>> > having to making changes to 2 or 3 git repositories as the result of a
>> > single new feature or bug fix. There isn't really a cross-repo CI
>> > solution available
>> >
>> > I've seen some approaches to the monorepo problem using multiple git
>> > repositories, such as
>> >
>> > https://github.com/twosigma/git-meta
>> >
>> > Until something like this has first class support by the GitHub
>> > platform and its CI services (Travis CI, Appveyor), I don't think it
>> > will work for us.
>> >
>> > - Wes
>> >
>> > On Sun, Jul 29, 2018 at 10:54 PM, Donald E. Foss <[email protected]>
>> > wrote:
>> > > Could this work as each module gets configured as sub-git repots. Top
>> > level
>> > > build tool go into each sub-repo, pick the correct release version to
>> > test.
>> > > Tests in Python is dependent on cpp sub-repo to ensure the API still
>> > pass.
>> > >
>> > > This should be the best of both worlds, if sub-repo are supposed
>> option.
>> > >
>> > > --Donald E. Foss
>> > >
>> > > On Sun, Jul 29, 2018, 10:44 PM Deepak Majeti <[email protected]>
>> > > wrote:
>> > >
>> > >> I dislike the current build system complications as well.
>> > >>
>> > >> However, in my opinion, combining the code bases will severely impact
>> > the
>> > >> progress of the parquet-cpp project and implicitly the progress of the
>> > >> entire parquet project.
>> > >> Combining would have made much more sense if parquet-cpp is a mature
>> > >> project and codebase.  But parquet-cpp (and the entire parquet
>> project)
>> > is
>> > >> evolving continuously with new features being added including bloom
>> > >> filters,  column encryption, and indexes.
>> > >>
>> > >> If the two code bases merged, it will be much more difficult to
>> > contribute
>> > >> to the parquet-cpp project since now Arrow bindings have to be
>> > supported as
>> > >> well. Please correct me if I am wrong here.
>> > >>
>> > >> Out of the two evils, I think handling the build system, packaging
>> > >> duplication is much more manageable since they are quite stable at
>> this
>> > >> point.
>> > >>
>> > >> Regarding "* API changes cause awkward release coordination issues
>> > between
>> > >> Arrow and Parquet". Can we make minor releases for parquet-cpp (with
>> API
>> > >> changes needed) as and when Arrow is released?
>> > >>
>> > >> Regarding "we maintain a Arrow conversion code in parquet-cpp for
>> > >> converting between Arrow columnar memory format and Parquet". Can this
>> > be
>> > >> moved to the Arrow project and expose the more stable low-level APIs
>> in
>> > >> parquet-cpp?
>> > >>
>> > >> I am also curious if the Arrow and Parquet Java implementations have
>> > >> similar API compatibility issues.
>> > >>
>> > >>
>> > >> On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <[email protected]>
>> > wrote:
>> > >>
>> > >> > hi folks,
>> > >> >
>> > >> > We've been struggling for quite some time with the development
>> > >> > workflow between the Arrow and Parquet C++ (and Python) codebases.
>> > >> >
>> > >> > To explain the root issues:
>> > >> >
>> > >> > * parquet-cpp depends on "platform code" in Apache Arrow; this
>> > >> > includes file interfaces, memory management, miscellaneous
>> algorithms
>> > >> > (e.g. dictionary encoding), etc. Note that before this "platform"
>> > >> > dependency was introduced, there was significant duplicated code
>> > >> > between these codebases and incompatible abstract interfaces for
>> > >> > things like files
>> > >> >
>> > >> > * we maintain a Arrow conversion code in parquet-cpp for converting
>> > >> > between Arrow columnar memory format and Parquet
>> > >> >
>> > >> > * we maintain Python bindings for parquet-cpp + Arrow interop in
>> > >> > Apache Arrow. This introduces a circular dependency into our CI.
>> > >> >
>> > >> > * Substantial portions of our CMake build system and related tooling
>> > >> > are duplicated between the Arrow and Parquet repos
>> > >> >
>> > >> > * API changes cause awkward release coordination issues between
>> Arrow
>> > >> > and Parquet
>> > >> >
>> > >> > I believe the best way to remedy the situation is to adopt a
>> > >> > "Community over Code" approach and find a way for the Parquet and
>> > >> > Arrow C++ development communities to operate out of the same code
>> > >> > repository, i.e. the apache/arrow git repository.
>> > >> >
>> > >> > This would bring major benefits:
>> > >> >
>> > >> > * Shared CMake build infrastructure, developer tools, and CI
>> > >> > infrastructure (Parquet is already being built as a dependency in
>> > >> > Arrow's CI systems)
>> > >> >
>> > >> > * Share packaging and release management infrastructure
>> > >> >
>> > >> > * Reduce / eliminate problems due to API changes (where we currently
>> > >> > introduce breakage into our CI workflow when there is a breaking /
>> > >> > incompatible change)
>> > >> >
>> > >> > * Arrow releases would include a coordinated snapshot of the Parquet
>> > >> > implementation as it stands
>> > >> >
>> > >> > Continuing with the status quo has become unsatisfactory to me and
>> as
>> > >> > a result I've become less motivated to work on the parquet-cpp
>> > >> > codebase.
>> > >> >
>> > >> > The only Parquet C++ committer who is not an Arrow committer is
>> Deepak
>> > >> > Majeti. I think the issue of commit privileges could be resolved
>> > >> > without too much difficulty or time.
>> > >> >
>> > >> > I also think if it is truly necessary that the Apache Parquet
>> > >> > community could create release scripts to cut a miniml versioned
>> > >> > Apache Parquet C++ release if that is deemed truly necessary.
>> > >> >
>> > >> > I know that some people are wary of monorepos and megaprojects, but
>> as
>> > >> > an example TensorFlow is at least 10 times as large of a projects in
>> > >> > terms of LOCs and number of different platform components, and it
>> > >> > seems to be getting along just fine. I think we should be able to
>> work
>> > >> > together as a community to function just as well.
>> > >> >
>> > >> > Interested in the opinions of others, and any other ideas for
>> > >> > practical solutions to the above problems.
>> > >> >
>> > >> > Thanks,
>> > >> > Wes
>> > >> >
>> > >>
>> > >>
>> > >> --
>> > >> regards,
>> > >> Deepak Majeti
>> > >>
>> >
>>
>
>
> --
> regards,
> Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Reply via email to