hi,

On Mon, Jul 30, 2018 at 6:52 PM, Deepak Majeti <majeti.dee...@gmail.com> wrote:
> Wes,
>
> I definitely appreciate and do see the impact of contributions made by
> everyone. I made this statement not to rate any contributions but solely to
> support my concern.
> The contribution barrier is higher simply because of the increased code,
> build, and test dependencies. If the community has lesser interest on a
> certain component (parquet-cpp core in this case), it becomes very hard to
> make big changes.

This is a FUD-based argument rather than a fact-based one. If there
are committers in Arrow (via Parquet) who approve changes, why would
they not be merged? The community will be incentivized to make sure
that developers are productive and able to work efficiently on the
part of the project that is relevant to them. parquet-cpp developers
are already building most of Arrow's C++ codebase en route to
development; by building with a single build system development
environments will be simpler to manage in general.

On the subject of code velocity: Arrow has a diverse community and
nearly 200 unique contributors at this point. Parquet has about 30.
The Arrow codebase history starts on February 5, 2016. Since then:

* Arrow has has 2055 patches
* parquet-cpp has had 425

So the patch volume is about 5x as high on average. This does not look
like a project that is struggling to merge patches. We are all
invested in the success of Parquet and changing the structure of the
code and helping the community to work more productively would not
change that.

> The community will be less willing to accept large
> changes that require multiple rounds of patches for stability and API
> convergence. Our contributions to Libhdfs++ in the HDFS community took a
> significantly long time for the very same reason.

Please don't use bad experiences from another open source community as
leverage in this discussion. I'm sorry that things didn't go the way
you wanted in Apache Hadoop but this is a distinct community which
happens to operate under a similar open governance model.

After significant time thinking about it, I think unfortunately that
the next-best option after a monorepo structure would be for the Arrow
community to _fork_ the parquet-cpp codebase and go our separate ways.
Beyond these two options I fail to see a pragmatic solution to the
problems we've been having.

- Wes

>
>
>
> On Mon, Jul 30, 2018 at 6:05 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
>> hi Deepak
>>
>> On Mon, Jul 30, 2018 at 5:18 PM, Deepak Majeti <majeti.dee...@gmail.com>
>> wrote:
>> > @Wes
>> > My observation is that most of the parquet-cpp contributors you listed
>> that
>> > overlap with the Arrow community mainly contribute to the Arrow
>> > bindings(parquet::arrow layer)/platform API changes in the parquet-cpp
>> > repo. Very few of them review/contribute patches to the parquet-cpp core.
>> >
>>
>> So, what are you saying exactly, that some contributions or
>> contributors to Apache Parquet matter more than others? I don't
>> follow.
>>
>> As a result of these individual's efforts, the parquet-cpp libraries
>> are being installed well over 100,000 times per month on a single
>> install path (Python) alone.
>>
>
>> > I believe improvements to the parquet-cpp core will be negatively
>> impacted
>> > since merging the parquet-cpp and arrow-cpp repos will increase the
>> barrier
>> > of entry to new contributors interested in the parquet-cpp core. The
>> > current extensions to the parquet-cpp core related to bloom-filters, and
>> > column encryption are all being done by first-time contributors.
>>
>> I don't understand why this would "increase the barrier of entry".
>> Could you explain?
>>
> It is true that there would be more code in the codebase, but the
>> build and test procedure would be no more complex. If anything,
>> community productivity will be improved by having a more cohesive /
>> centralized development platform (large amounts of code that Parquet
>> depends on are in Apache Arrow already).
>>
>> >
>> > If you believe there will be new interest in the parquet-cpp core with
>> the
>> > mono-repo approach, I am all up for it.
>>
>> Yes, I believe that this change will result in more and higher quality
>> code review to Parquet core changes and general improvements to
>> developer productivity across the board. Developer productivity is
>> what this is all about.
>>
>> - Wes
>>
>> >
>> >
>> > On Mon, Jul 30, 2018 at 12:18 AM Philipp Moritz <pcmor...@gmail.com>
>> wrote:
>> >
>> >> I do not claim to have insight into parquet-cpp development. However,
>> from
>> >> our experience developing Ray, I can say that the monorepo approach (for
>> >> Ray) has improved things a lot. Before we tried various schemes to split
>> >> the project into multiple repos, but the build system and test
>> >> infrastructure duplications and overhead from synchronizing changes
>> slowed
>> >> development down significantly (and fixing bugs that touch the subrepos
>> and
>> >> the main repo is inconvenient).
>> >>
>> >> Also the decision to put arrow and parquet-cpp into a common repo is
>> >> independent of how tightly coupled the two projects are (and there
>> could be
>> >> a matrix entry in travis which tests that PRs keep them decoupled, or
>> >> rather that they both just depend on a small common "base"). Google and
>> >> Facebook demonstrate such independence by having many many projects in
>> the
>> >> same repo of course. It would be great if the open source community
>> would
>> >> move more into this direction too I think.
>> >>
>> >> Best,
>> >> Philipp.
>> >>
>> >> On Sun, Jul 29, 2018 at 8:54 PM, Wes McKinney <wesmck...@gmail.com>
>> wrote:
>> >>
>> >> > hi Donald,
>> >> >
>> >> > This would make things worse, not better. Code changes routinely
>> >> > involve changes to the build system, and so you could be talking about
>> >> > having to making changes to 2 or 3 git repositories as the result of a
>> >> > single new feature or bug fix. There isn't really a cross-repo CI
>> >> > solution available
>> >> >
>> >> > I've seen some approaches to the monorepo problem using multiple git
>> >> > repositories, such as
>> >> >
>> >> > https://github.com/twosigma/git-meta
>> >> >
>> >> > Until something like this has first class support by the GitHub
>> >> > platform and its CI services (Travis CI, Appveyor), I don't think it
>> >> > will work for us.
>> >> >
>> >> > - Wes
>> >> >
>> >> > On Sun, Jul 29, 2018 at 10:54 PM, Donald E. Foss <
>> donald.f...@gmail.com>
>> >> > wrote:
>> >> > > Could this work as each module gets configured as sub-git repots.
>> Top
>> >> > level
>> >> > > build tool go into each sub-repo, pick the correct release version
>> to
>> >> > test.
>> >> > > Tests in Python is dependent on cpp sub-repo to ensure the API still
>> >> > pass.
>> >> > >
>> >> > > This should be the best of both worlds, if sub-repo are supposed
>> >> option.
>> >> > >
>> >> > > --Donald E. Foss
>> >> > >
>> >> > > On Sun, Jul 29, 2018, 10:44 PM Deepak Majeti <
>> majeti.dee...@gmail.com>
>> >> > > wrote:
>> >> > >
>> >> > >> I dislike the current build system complications as well.
>> >> > >>
>> >> > >> However, in my opinion, combining the code bases will severely
>> impact
>> >> > the
>> >> > >> progress of the parquet-cpp project and implicitly the progress of
>> the
>> >> > >> entire parquet project.
>> >> > >> Combining would have made much more sense if parquet-cpp is a
>> mature
>> >> > >> project and codebase.  But parquet-cpp (and the entire parquet
>> >> project)
>> >> > is
>> >> > >> evolving continuously with new features being added including bloom
>> >> > >> filters,  column encryption, and indexes.
>> >> > >>
>> >> > >> If the two code bases merged, it will be much more difficult to
>> >> > contribute
>> >> > >> to the parquet-cpp project since now Arrow bindings have to be
>> >> > supported as
>> >> > >> well. Please correct me if I am wrong here.
>> >> > >>
>> >> > >> Out of the two evils, I think handling the build system, packaging
>> >> > >> duplication is much more manageable since they are quite stable at
>> >> this
>> >> > >> point.
>> >> > >>
>> >> > >> Regarding "* API changes cause awkward release coordination issues
>> >> > between
>> >> > >> Arrow and Parquet". Can we make minor releases for parquet-cpp
>> (with
>> >> API
>> >> > >> changes needed) as and when Arrow is released?
>> >> > >>
>> >> > >> Regarding "we maintain a Arrow conversion code in parquet-cpp for
>> >> > >> converting between Arrow columnar memory format and Parquet". Can
>> this
>> >> > be
>> >> > >> moved to the Arrow project and expose the more stable low-level
>> APIs
>> >> in
>> >> > >> parquet-cpp?
>> >> > >>
>> >> > >> I am also curious if the Arrow and Parquet Java implementations
>> have
>> >> > >> similar API compatibility issues.
>> >> > >>
>> >> > >>
>> >> > >> On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <wesmck...@gmail.com>
>> >> > wrote:
>> >> > >>
>> >> > >> > hi folks,
>> >> > >> >
>> >> > >> > We've been struggling for quite some time with the development
>> >> > >> > workflow between the Arrow and Parquet C++ (and Python)
>> codebases.
>> >> > >> >
>> >> > >> > To explain the root issues:
>> >> > >> >
>> >> > >> > * parquet-cpp depends on "platform code" in Apache Arrow; this
>> >> > >> > includes file interfaces, memory management, miscellaneous
>> >> algorithms
>> >> > >> > (e.g. dictionary encoding), etc. Note that before this "platform"
>> >> > >> > dependency was introduced, there was significant duplicated code
>> >> > >> > between these codebases and incompatible abstract interfaces for
>> >> > >> > things like files
>> >> > >> >
>> >> > >> > * we maintain a Arrow conversion code in parquet-cpp for
>> converting
>> >> > >> > between Arrow columnar memory format and Parquet
>> >> > >> >
>> >> > >> > * we maintain Python bindings for parquet-cpp + Arrow interop in
>> >> > >> > Apache Arrow. This introduces a circular dependency into our CI.
>> >> > >> >
>> >> > >> > * Substantial portions of our CMake build system and related
>> tooling
>> >> > >> > are duplicated between the Arrow and Parquet repos
>> >> > >> >
>> >> > >> > * API changes cause awkward release coordination issues between
>> >> Arrow
>> >> > >> > and Parquet
>> >> > >> >
>> >> > >> > I believe the best way to remedy the situation is to adopt a
>> >> > >> > "Community over Code" approach and find a way for the Parquet and
>> >> > >> > Arrow C++ development communities to operate out of the same code
>> >> > >> > repository, i.e. the apache/arrow git repository.
>> >> > >> >
>> >> > >> > This would bring major benefits:
>> >> > >> >
>> >> > >> > * Shared CMake build infrastructure, developer tools, and CI
>> >> > >> > infrastructure (Parquet is already being built as a dependency in
>> >> > >> > Arrow's CI systems)
>> >> > >> >
>> >> > >> > * Share packaging and release management infrastructure
>> >> > >> >
>> >> > >> > * Reduce / eliminate problems due to API changes (where we
>> currently
>> >> > >> > introduce breakage into our CI workflow when there is a breaking
>> /
>> >> > >> > incompatible change)
>> >> > >> >
>> >> > >> > * Arrow releases would include a coordinated snapshot of the
>> Parquet
>> >> > >> > implementation as it stands
>> >> > >> >
>> >> > >> > Continuing with the status quo has become unsatisfactory to me
>> and
>> >> as
>> >> > >> > a result I've become less motivated to work on the parquet-cpp
>> >> > >> > codebase.
>> >> > >> >
>> >> > >> > The only Parquet C++ committer who is not an Arrow committer is
>> >> Deepak
>> >> > >> > Majeti. I think the issue of commit privileges could be resolved
>> >> > >> > without too much difficulty or time.
>> >> > >> >
>> >> > >> > I also think if it is truly necessary that the Apache Parquet
>> >> > >> > community could create release scripts to cut a miniml versioned
>> >> > >> > Apache Parquet C++ release if that is deemed truly necessary.
>> >> > >> >
>> >> > >> > I know that some people are wary of monorepos and megaprojects,
>> but
>> >> as
>> >> > >> > an example TensorFlow is at least 10 times as large of a
>> projects in
>> >> > >> > terms of LOCs and number of different platform components, and it
>> >> > >> > seems to be getting along just fine. I think we should be able to
>> >> work
>> >> > >> > together as a community to function just as well.
>> >> > >> >
>> >> > >> > Interested in the opinions of others, and any other ideas for
>> >> > >> > practical solutions to the above problems.
>> >> > >> >
>> >> > >> > Thanks,
>> >> > >> > Wes
>> >> > >> >
>> >> > >>
>> >> > >>
>> >> > >> --
>> >> > >> regards,
>> >> > >> Deepak Majeti
>> >> > >>
>> >> >
>> >>
>> >
>> >
>> > --
>> > regards,
>> > Deepak Majeti
>>
>
>
> --
> regards,
> Deepak Majeti

Reply via email to