Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Wes McKinney Tue, 31 Jul 2018 11:30:53 -0700

> If you still strongly feel that the only way forward is to clone the 
> parquet-cpp repo and part ways, I will withdraw my concern. Having two 
> parquet-cpp repos is no way a better approach.


Yes, indeed. In my view, the next best option after a monorepo is to
fork. That would obviously be a bad outcome for the community.

It doesn't look like I will be able to convince you that a monorepo is
a good idea; what I would ask instead is that you be willing to give
it a shot, and if it turns out in the way you're describing (which I
don't think it will) then I suggest that we fork at that point.

- Wes

On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <majeti.dee...@gmail.com> wrote:
> Wes,
>
> Unfortunately, I cannot show you any practical fact-based problems of a
> non-existent Arrow-Parquet mono-repo.
> Bringing in related Apache community experiences are more meaningful than
> how mono-repos work at Google and other big organizations.
> We solely depend on volunteers and cannot hire full-time developers.
> You are very well aware of how difficult it has been to find more
> contributors and maintainers for Arrow. parquet-cpp already has a low
> contribution rate to its core components.
>
> We should target to ensure that new volunteers who want to contribute
> bug-fixes/features should spend the least amount of time in figuring out
> the project repo. We can never come up with an automated build system that
> caters to every possible environment.
> My only concern is if the mono-repo will make it harder for new developers
> to work on parquet-cpp core just due to the additional code, build and test
> dependencies.
> I am not saying that the Arrow community/committers will be less
> co-operative.
> I just don't think the mono-repo structure model will be sustainable in an
> open source community unless there are long-term vested interests. We can't
> predict that.
>
> The current circular dependency problems between Arrow and Parquet is a
> major problem for the community and it is important.
>
> The current Arrow adaptor code for parquet should live in the arrow repo.
> That will remove a majority of the dependency issues.
> Joshua's work would not have been blocked in parquet-cpp if that adapter
> was in the arrow repo.  This will be similar to the ORC adaptor.
>
> The platform API code is pretty stable at this point. Minor changes in the
> future to this code should not be the main reason to combine the arrow
> parquet repos.
>
> "
> *I question whether it's worth the community's time long term to wear*
>
>
> *ourselves out defining custom "ports" / virtual interfaces in eachlibrary
> to plug components together rather than utilizing commonplatform APIs.*"
>
> My answer to your question below would be "Yes". Modularity/separation is
> very important in an open source community where priorities of contributors
> are often short term.
> The retention is low and therefore the acquisition costs should be low as
> well. This is the community over code approach according to me. Minor code
> duplication is not a deal breaker.
> ORC, Parquet, Arrow, etc. are all different components in the big data
> space serving their own functions.
>
> If you still strongly feel that the only way forward is to clone the
> parquet-cpp repo and part ways, I will withdraw my concern. Having two
> parquet-cpp repos is no way a better approach.
>
>
>
>
> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <wesmck...@gmail.com> wrote:
>
>> @Antoine
>>
>> > By the way, one concern with the monorepo approach: it would slightly
>> increase Arrow CI times (which are already too large).
>>
>> A typical CI run in Arrow is taking about 45 minutes:
>> https://travis-ci.org/apache/arrow/builds/410119750
>>
>> Parquet run takes about 28
>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
>>
>> Inevitably we will need to create some kind of bot to run certain
>> builds on-demand based on commit / PR metadata or on request.
>>
>> The slowest build in Arrow (the Arrow C++/Python one) build could be
>> made substantially shorter by moving some of the slower parts (like
>> the Python ASV benchmarks) from being tested every-commit to nightly
>> or on demand. Using ASAN instead of valgrind in Travis would also
>> improve build times (valgrind build could be moved to a nightly
>> exhaustive test run)
>>
>> - Wes
>>
>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <wesmck...@gmail.com>
>> wrote:
>> >> I would like to point out that arrow's use of orc is a great example of
>> how it would be possible to manage parquet-cpp as a separate codebase. That
>> gives me hope that the projects could be managed separately some day.
>> >
>> > Well, I don't know that ORC is the best example. The ORC C++ codebase
>> > features several areas of duplicated logic which could be replaced by
>> > components from the Arrow platform for better platform-wide
>> > interoperability:
>> >
>> >
>> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/OrcFile.hh#L37
>> > https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
>> >
>> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/MemoryPool.hh
>> > https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
>> > https://github.com/apache/orc/blob/master/c%2B%2B/src/io/OutputStream.hh
>> >
>> > ORC's use of symbols from Protocol Buffers was actually a cause of
>> > bugs that we had to fix in Arrow's build system to prevent them from
>> > leaking to third party linkers when statically linked (ORC is only
>> > available for static linking at the moment AFAIK).
>> >
>> > I question whether it's worth the community's time long term to wear
>> > ourselves out defining custom "ports" / virtual interfaces in each
>> > library to plug components together rather than utilizing common
>> > platform APIs.
>> >
>> > - Wes
>> >
>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <joshuasto...@gmail.com>
>> wrote:
>> >> You're point about the constraints of the ASF release process are well
>> >> taken and as a developer who's trying to work in the current
>> environment I
>> >> would be much happier if the codebases were merged. The main issues I
>> worry
>> >> about when you put codebases like these together are:
>> >>
>> >> 1. The delineation of API's become blurred and the code becomes too
>> coupled
>> >> 2. Release of artifacts that are lower in the dependency tree are
>> delayed
>> >> by artifacts higher in the dependency tree
>> >>
>> >> If the project/release management is structured well and someone keeps
>> an
>> >> eye on the coupling, then I don't have any concerns.
>> >>
>> >> I would like to point out that arrow's use of orc is a great example of
>> how
>> >> it would be possible to manage parquet-cpp as a separate codebase. That
>> >> gives me hope that the projects could be managed separately some day.
>> >>
>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <wesmck...@gmail.com>
>> wrote:
>> >>
>> >>> hi Josh,
>> >>>
>> >>> > I can imagine use cases for parquet that don't involve arrow and
>> tying
>> >>> them together seems like the wrong choice.
>> >>>
>> >>> Apache is "Community over Code"; right now it's the same people
>> >>> building these projects -- my argument (which I think you agree with?)
>> >>> is that we should work more closely together until the community grows
>> >>> large enough to support larger-scope process than we have now. As
>> >>> you've seen, our process isn't serving developers of these projects.
>> >>>
>> >>> > I also think build tooling should be pulled into its own codebase.
>> >>>
>> >>> I don't see how this can possibly be practical taking into
>> >>> consideration the constraints imposed by the combination of the GitHub
>> >>> platform and the ASF release process. I'm all for being idealistic,
>> >>> but right now we need to be practical. Unless we can devise a
>> >>> practical procedure that can accommodate at least 1 patch per day
>> >>> which may touch both code and build system simultaneously without
>> >>> being a hindrance to contributor or maintainer, I don't see how we can
>> >>> move forward.
>> >>>
>> >>> > That being said, I think it makes sense to merge the codebases in the
>> >>> short term with the express purpose of separating them in the near
>> term.
>> >>>
>> >>> I would agree but only if separation can be demonstrated to be
>> >>> practical and result in net improvements in productivity and community
>> >>> growth. I think experience has clearly demonstrated that the current
>> >>> separation is impractical, and is causing problems.
>> >>>
>> >>> Per Julian's and Ted's comments, I think we need to consider
>> >>> development process and ASF releases separately. My argument is as
>> >>> follows:
>> >>>
>> >>> * Monorepo for development (for practicality)
>> >>> * Releases structured according to the desires of the PMCs
>> >>>
>> >>> - Wes
>> >>>
>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <joshuasto...@gmail.com
>> >
>> >>> wrote:
>> >>> > I recently worked on an issue that had to be implemented in
>> parquet-cpp
>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow (ARROW-2585,
>> >>> > ARROW-2586). I found the circular dependencies confusing and hard to
>> work
>> >>> > with. For example, I still have a PR open in parquet-cpp (created on
>> May
>> >>> > 10) because of a PR that it depended on in arrow that was recently
>> >>> merged.
>> >>> > I couldn't even address any CI issues in the PR because the change in
>> >>> arrow
>> >>> > was not yet in master. In a separate PR, I changed the
>> >>> run_clang_format.py
>> >>> > script in the arrow project only to find out later that there was an
>> >>> exact
>> >>> > copy of it in parquet-cpp.
>> >>> >
>> >>> > However, I don't think merging the codebases makes sense in the long
>> >>> term.
>> >>> > I can imagine use cases for parquet that don't involve arrow and
>> tying
>> >>> them
>> >>> > together seems like the wrong choice. There will be other formats
>> that
>> >>> > arrow needs to support that will be kept separate (e.g. - Orc), so I
>> >>> don't
>> >>> > see why parquet should be special. I also think build tooling should
>> be
>> >>> > pulled into its own codebase. GNU has had a long history of
>> developing
>> >>> open
>> >>> > source C/C++ projects that way and made projects like
>> >>> > autoconf/automake/make to support them. I don't think CI is a good
>> >>> > counter-example since there have been lots of successful open source
>> >>> > projects that have used nightly build systems that pinned versions of
>> >>> > dependent software.
>> >>> >
>> >>> > That being said, I think it makes sense to merge the codebases in the
>> >>> short
>> >>> > term with the express purpose of separating them in the near  term.
>> My
>> >>> > reasoning is as follows. By putting the codebases together, you can
>> more
>> >>> > easily delineate the boundaries between the API's with a single PR.
>> >>> Second,
>> >>> > it will force the build tooling to converge instead of diverge,
>> which has
>> >>> > already happened. Once the boundaries and tooling have been sorted
>> out,
>> >>> it
>> >>> > should be easy to separate them back into their own codebases.
>> >>> >
>> >>> > If the codebases are merged, I would ask that the C++ codebases for
>> arrow
>> >>> > be separated from other languages. Looking at it from the
>> perspective of
>> >>> a
>> >>> > parquet-cpp library user, having a dependency on Java is a large tax
>> to
>> >>> pay
>> >>> > if you don't need it. For example, there were 25 JIRA's in the 0.10.0
>> >>> > release of arrow, many of which were holding up the release. I hope
>> that
>> >>> > seems like a reasonable compromise, and I think it will help reduce
>> the
>> >>> > complexity of the build/release tooling.
>> >>> >
>> >>> >
>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <ted.dunn...@gmail.com>
>> >>> wrote:
>> >>> >
>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <wesmck...@gmail.com>
>> >>> wrote:
>> >>> >>
>> >>> >> >
>> >>> >> > > The community will be less willing to accept large
>> >>> >> > > changes that require multiple rounds of patches for stability
>> and
>> >>> API
>> >>> >> > > convergence. Our contributions to Libhdfs++ in the HDFS
>> community
>> >>> took
>> >>> >> a
>> >>> >> > > significantly long time for the very same reason.
>> >>> >> >
>> >>> >> > Please don't use bad experiences from another open source
>> community as
>> >>> >> > leverage in this discussion. I'm sorry that things didn't go the
>> way
>> >>> >> > you wanted in Apache Hadoop but this is a distinct community which
>> >>> >> > happens to operate under a similar open governance model.
>> >>> >>
>> >>> >>
>> >>> >> There are some more radical and community building options as well.
>> Take
>> >>> >> the subversion project as a precedent. With subversion, any Apache
>> >>> >> committer can request and receive a commit bit on some large
>> fraction of
>> >>> >> subversion.
>> >>> >>
>> >>> >> So why not take this a bit further and give every parquet committer
>> a
>> >>> >> commit bit in Arrow? Or even make them be first class committers in
>> >>> Arrow?
>> >>> >> Possibly even make it policy that every Parquet committer who asks
>> will
>> >>> be
>> >>> >> given committer status in Arrow.
>> >>> >>
>> >>> >> That relieves a lot of the social anxiety here. Parquet committers
>> >>> can't be
>> >>> >> worried at that point whether their patches will get merged; they
>> can
>> >>> just
>> >>> >> merge them.  Arrow shouldn't worry much about inviting in the
>> Parquet
>> >>> >> committers. After all, Arrow already depends a lot on parquet so
>> why not
>> >>> >> invite them in?
>> >>> >>
>> >>>
>>
>
>
> --
> regards,
> Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Reply via email to