Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Wes McKinney Tue, 31 Jul 2018 12:17:53 -0700

> The current Arrow adaptor code for parquet should live in the arrow repo. 
> That will remove a majority of the dependency issues. Joshua's work would not 
> have been blocked in parquet-cpp if that adapter was in the arrow repo.  This 
> will be similar to the ORC adaptor.


This has been suggested before, but I don't see how it would alleviate
any issues because of the significant dependencies on other parts of
the Arrow codebase. What you are proposing is:

- (Arrow) arrow platform
- (Parquet) parquet core
- (Arrow) arrow columnar-parquet adapter interface
- (Arrow) Python bindings

To make this work, somehow Arrow core / libarrow would have to be
built before invoking the Parquet core part of the build system. You
would need to pass dependent targets across different CMake build
systems; I don't know if it's possible (I spent some time looking into
it earlier this year). This is what I meant by the lack of a "concrete
and actionable plan". The only thing that would really work would be
for the Parquet core to be "included" in the Arrow build system
somehow rather than using ExternalProject. Currently Parquet builds
Arrow using ExternalProject, and Parquet is unknown to the Arrow build
system because it's only depended upon by the Python bindings.

And even if a solution could be devised, it would not wholly resolve
the CI workflow issues.

You could make Parquet completely independent of the Arrow codebase,
but at that point there is little reason to maintain a relationship
between the projects or their communities. We have spent a great deal
of effort refactoring the two projects to enable as much code sharing
as there is now.

- Wes

On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <wesmck...@gmail.com> wrote:
>> If you still strongly feel that the only way forward is to clone the 
>> parquet-cpp repo and part ways, I will withdraw my concern. Having two 
>> parquet-cpp repos is no way a better approach.
>
> Yes, indeed. In my view, the next best option after a monorepo is to
> fork. That would obviously be a bad outcome for the community.
>
> It doesn't look like I will be able to convince you that a monorepo is
> a good idea; what I would ask instead is that you be willing to give
> it a shot, and if it turns out in the way you're describing (which I
> don't think it will) then I suggest that we fork at that point.
>
> - Wes
>
> On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <majeti.dee...@gmail.com> 
> wrote:
>> Wes,
>>
>> Unfortunately, I cannot show you any practical fact-based problems of a
>> non-existent Arrow-Parquet mono-repo.
>> Bringing in related Apache community experiences are more meaningful than
>> how mono-repos work at Google and other big organizations.
>> We solely depend on volunteers and cannot hire full-time developers.
>> You are very well aware of how difficult it has been to find more
>> contributors and maintainers for Arrow. parquet-cpp already has a low
>> contribution rate to its core components.
>>
>> We should target to ensure that new volunteers who want to contribute
>> bug-fixes/features should spend the least amount of time in figuring out
>> the project repo. We can never come up with an automated build system that
>> caters to every possible environment.
>> My only concern is if the mono-repo will make it harder for new developers
>> to work on parquet-cpp core just due to the additional code, build and test
>> dependencies.
>> I am not saying that the Arrow community/committers will be less
>> co-operative.
>> I just don't think the mono-repo structure model will be sustainable in an
>> open source community unless there are long-term vested interests. We can't
>> predict that.
>>
>> The current circular dependency problems between Arrow and Parquet is a
>> major problem for the community and it is important.
>>
>> The current Arrow adaptor code for parquet should live in the arrow repo.
>> That will remove a majority of the dependency issues.
>> Joshua's work would not have been blocked in parquet-cpp if that adapter
>> was in the arrow repo.  This will be similar to the ORC adaptor.
>>
>> The platform API code is pretty stable at this point. Minor changes in the
>> future to this code should not be the main reason to combine the arrow
>> parquet repos.
>>
>> "
>> *I question whether it's worth the community's time long term to wear*
>>
>>
>> *ourselves out defining custom "ports" / virtual interfaces in eachlibrary
>> to plug components together rather than utilizing commonplatform APIs.*"
>>
>> My answer to your question below would be "Yes". Modularity/separation is
>> very important in an open source community where priorities of contributors
>> are often short term.
>> The retention is low and therefore the acquisition costs should be low as
>> well. This is the community over code approach according to me. Minor code
>> duplication is not a deal breaker.
>> ORC, Parquet, Arrow, etc. are all different components in the big data
>> space serving their own functions.
>>
>> If you still strongly feel that the only way forward is to clone the
>> parquet-cpp repo and part ways, I will withdraw my concern. Having two
>> parquet-cpp repos is no way a better approach.
>>
>>
>>
>>
>> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <wesmck...@gmail.com> wrote:
>>
>>> @Antoine
>>>
>>> > By the way, one concern with the monorepo approach: it would slightly
>>> increase Arrow CI times (which are already too large).
>>>
>>> A typical CI run in Arrow is taking about 45 minutes:
>>> https://travis-ci.org/apache/arrow/builds/410119750
>>>
>>> Parquet run takes about 28
>>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
>>>
>>> Inevitably we will need to create some kind of bot to run certain
>>> builds on-demand based on commit / PR metadata or on request.
>>>
>>> The slowest build in Arrow (the Arrow C++/Python one) build could be
>>> made substantially shorter by moving some of the slower parts (like
>>> the Python ASV benchmarks) from being tested every-commit to nightly
>>> or on demand. Using ASAN instead of valgrind in Travis would also
>>> improve build times (valgrind build could be moved to a nightly
>>> exhaustive test run)
>>>
>>> - Wes
>>>
>>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <wesmck...@gmail.com>
>>> wrote:
>>> >> I would like to point out that arrow's use of orc is a great example of
>>> how it would be possible to manage parquet-cpp as a separate codebase. That
>>> gives me hope that the projects could be managed separately some day.
>>> >
>>> > Well, I don't know that ORC is the best example. The ORC C++ codebase
>>> > features several areas of duplicated logic which could be replaced by
>>> > components from the Arrow platform for better platform-wide
>>> > interoperability:
>>> >
>>> >
>>> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/OrcFile.hh#L37
>>> > https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
>>> >
>>> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/MemoryPool.hh
>>> > https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
>>> > https://github.com/apache/orc/blob/master/c%2B%2B/src/io/OutputStream.hh
>>> >
>>> > ORC's use of symbols from Protocol Buffers was actually a cause of
>>> > bugs that we had to fix in Arrow's build system to prevent them from
>>> > leaking to third party linkers when statically linked (ORC is only
>>> > available for static linking at the moment AFAIK).
>>> >
>>> > I question whether it's worth the community's time long term to wear
>>> > ourselves out defining custom "ports" / virtual interfaces in each
>>> > library to plug components together rather than utilizing common
>>> > platform APIs.
>>> >
>>> > - Wes
>>> >
>>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <joshuasto...@gmail.com>
>>> wrote:
>>> >> You're point about the constraints of the ASF release process are well
>>> >> taken and as a developer who's trying to work in the current
>>> environment I
>>> >> would be much happier if the codebases were merged. The main issues I
>>> worry
>>> >> about when you put codebases like these together are:
>>> >>
>>> >> 1. The delineation of API's become blurred and the code becomes too
>>> coupled
>>> >> 2. Release of artifacts that are lower in the dependency tree are
>>> delayed
>>> >> by artifacts higher in the dependency tree
>>> >>
>>> >> If the project/release management is structured well and someone keeps
>>> an
>>> >> eye on the coupling, then I don't have any concerns.
>>> >>
>>> >> I would like to point out that arrow's use of orc is a great example of
>>> how
>>> >> it would be possible to manage parquet-cpp as a separate codebase. That
>>> >> gives me hope that the projects could be managed separately some day.
>>> >>
>>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <wesmck...@gmail.com>
>>> wrote:
>>> >>
>>> >>> hi Josh,
>>> >>>
>>> >>> > I can imagine use cases for parquet that don't involve arrow and
>>> tying
>>> >>> them together seems like the wrong choice.
>>> >>>
>>> >>> Apache is "Community over Code"; right now it's the same people
>>> >>> building these projects -- my argument (which I think you agree with?)
>>> >>> is that we should work more closely together until the community grows
>>> >>> large enough to support larger-scope process than we have now. As
>>> >>> you've seen, our process isn't serving developers of these projects.
>>> >>>
>>> >>> > I also think build tooling should be pulled into its own codebase.
>>> >>>
>>> >>> I don't see how this can possibly be practical taking into
>>> >>> consideration the constraints imposed by the combination of the GitHub
>>> >>> platform and the ASF release process. I'm all for being idealistic,
>>> >>> but right now we need to be practical. Unless we can devise a
>>> >>> practical procedure that can accommodate at least 1 patch per day
>>> >>> which may touch both code and build system simultaneously without
>>> >>> being a hindrance to contributor or maintainer, I don't see how we can
>>> >>> move forward.
>>> >>>
>>> >>> > That being said, I think it makes sense to merge the codebases in the
>>> >>> short term with the express purpose of separating them in the near
>>> term.
>>> >>>
>>> >>> I would agree but only if separation can be demonstrated to be
>>> >>> practical and result in net improvements in productivity and community
>>> >>> growth. I think experience has clearly demonstrated that the current
>>> >>> separation is impractical, and is causing problems.
>>> >>>
>>> >>> Per Julian's and Ted's comments, I think we need to consider
>>> >>> development process and ASF releases separately. My argument is as
>>> >>> follows:
>>> >>>
>>> >>> * Monorepo for development (for practicality)
>>> >>> * Releases structured according to the desires of the PMCs
>>> >>>
>>> >>> - Wes
>>> >>>
>>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <joshuasto...@gmail.com
>>> >
>>> >>> wrote:
>>> >>> > I recently worked on an issue that had to be implemented in
>>> parquet-cpp
>>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow (ARROW-2585,
>>> >>> > ARROW-2586). I found the circular dependencies confusing and hard to
>>> work
>>> >>> > with. For example, I still have a PR open in parquet-cpp (created on
>>> May
>>> >>> > 10) because of a PR that it depended on in arrow that was recently
>>> >>> merged.
>>> >>> > I couldn't even address any CI issues in the PR because the change in
>>> >>> arrow
>>> >>> > was not yet in master. In a separate PR, I changed the
>>> >>> run_clang_format.py
>>> >>> > script in the arrow project only to find out later that there was an
>>> >>> exact
>>> >>> > copy of it in parquet-cpp.
>>> >>> >
>>> >>> > However, I don't think merging the codebases makes sense in the long
>>> >>> term.
>>> >>> > I can imagine use cases for parquet that don't involve arrow and
>>> tying
>>> >>> them
>>> >>> > together seems like the wrong choice. There will be other formats
>>> that
>>> >>> > arrow needs to support that will be kept separate (e.g. - Orc), so I
>>> >>> don't
>>> >>> > see why parquet should be special. I also think build tooling should
>>> be
>>> >>> > pulled into its own codebase. GNU has had a long history of
>>> developing
>>> >>> open
>>> >>> > source C/C++ projects that way and made projects like
>>> >>> > autoconf/automake/make to support them. I don't think CI is a good
>>> >>> > counter-example since there have been lots of successful open source
>>> >>> > projects that have used nightly build systems that pinned versions of
>>> >>> > dependent software.
>>> >>> >
>>> >>> > That being said, I think it makes sense to merge the codebases in the
>>> >>> short
>>> >>> > term with the express purpose of separating them in the near  term.
>>> My
>>> >>> > reasoning is as follows. By putting the codebases together, you can
>>> more
>>> >>> > easily delineate the boundaries between the API's with a single PR.
>>> >>> Second,
>>> >>> > it will force the build tooling to converge instead of diverge,
>>> which has
>>> >>> > already happened. Once the boundaries and tooling have been sorted
>>> out,
>>> >>> it
>>> >>> > should be easy to separate them back into their own codebases.
>>> >>> >
>>> >>> > If the codebases are merged, I would ask that the C++ codebases for
>>> arrow
>>> >>> > be separated from other languages. Looking at it from the
>>> perspective of
>>> >>> a
>>> >>> > parquet-cpp library user, having a dependency on Java is a large tax
>>> to
>>> >>> pay
>>> >>> > if you don't need it. For example, there were 25 JIRA's in the 0.10.0
>>> >>> > release of arrow, many of which were holding up the release. I hope
>>> that
>>> >>> > seems like a reasonable compromise, and I think it will help reduce
>>> the
>>> >>> > complexity of the build/release tooling.
>>> >>> >
>>> >>> >
>>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <ted.dunn...@gmail.com>
>>> >>> wrote:
>>> >>> >
>>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <wesmck...@gmail.com>
>>> >>> wrote:
>>> >>> >>
>>> >>> >> >
>>> >>> >> > > The community will be less willing to accept large
>>> >>> >> > > changes that require multiple rounds of patches for stability
>>> and
>>> >>> API
>>> >>> >> > > convergence. Our contributions to Libhdfs++ in the HDFS
>>> community
>>> >>> took
>>> >>> >> a
>>> >>> >> > > significantly long time for the very same reason.
>>> >>> >> >
>>> >>> >> > Please don't use bad experiences from another open source
>>> community as
>>> >>> >> > leverage in this discussion. I'm sorry that things didn't go the
>>> way
>>> >>> >> > you wanted in Apache Hadoop but this is a distinct community which
>>> >>> >> > happens to operate under a similar open governance model.
>>> >>> >>
>>> >>> >>
>>> >>> >> There are some more radical and community building options as well.
>>> Take
>>> >>> >> the subversion project as a precedent. With subversion, any Apache
>>> >>> >> committer can request and receive a commit bit on some large
>>> fraction of
>>> >>> >> subversion.
>>> >>> >>
>>> >>> >> So why not take this a bit further and give every parquet committer
>>> a
>>> >>> >> commit bit in Arrow? Or even make them be first class committers in
>>> >>> Arrow?
>>> >>> >> Possibly even make it policy that every Parquet committer who asks
>>> will
>>> >>> be
>>> >>> >> given committer status in Arrow.
>>> >>> >>
>>> >>> >> That relieves a lot of the social anxiety here. Parquet committers
>>> >>> can't be
>>> >>> >> worried at that point whether their patches will get merged; they
>>> can
>>> >>> just
>>> >>> >> merge them.  Arrow shouldn't worry much about inviting in the
>>> Parquet
>>> >>> >> committers. After all, Arrow already depends a lot on parquet so
>>> why not
>>> >>> >> invite them in?
>>> >>> >>
>>> >>>
>>>
>>
>>
>> --
>> regards,
>> Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Reply via email to