Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Wes McKinney Tue, 07 Aug 2018 05:58:18 -0700

Do other people have opinions? I would like to undertake this work in
the near future (the next 8-10 weeks); I would be OK with taking
responsibility for the primary codebase surgery.


Some logistical questions:

* We have a handful of pull requests in flight in parquet-cpp that
would need to be resolved / merged
* We should probably cut a status-quo cpp-1.5.0 release, with future
releases cut out of the new structure
* Management of shared commit rights (I can discuss with the Arrow
PMC; I believe that approving any committer who has actively
maintained parquet-cpp should be a reasonable approach per Ted's
comments)

If working more closely together proves to not be working out after
some period of time, I will be fully supportive of a fork or something
like it

Thanks,
Wes

On Wed, Aug 1, 2018 at 3:39 PM, Wes McKinney <[email protected]> wrote:
> Thanks Tim.
>
> Indeed, it's not very simple. Just today Antoine cleaned up some
> platform code intending to improve the performance of bit-packing in
> Parquet writes, and we resulted with 2 interdependent PRs
>
> * https://github.com/apache/parquet-cpp/pull/483
> * https://github.com/apache/arrow/pull/2355
>
> Changes that impact the Python interface to Parquet are even more complex.
>
> Adding options to Arrow's CMake build system to only build
> Parquet-related code and dependencies (in a monorepo framework) would
> not be difficult, and amount to writing "make parquet".
>
> See e.g. https://stackoverflow.com/a/17201375. The desired commands to
> build and install the Parquet core libraries and their dependencies
> would be:
>
> ninja parquet && ninja install
>
> - Wes
>
> On Wed, Aug 1, 2018 at 2:34 PM, Tim Armstrong
> <[email protected]> wrote:
>> I don't have a direct stake in this beyond wanting to see Parquet be
>> successful, but I thought I'd give my two cents.
>>
>> For me, the thing that makes the biggest difference in contributing to a
>> new codebase is the number of steps in the workflow for writing, testing,
>> posting and iterating on a commit and also the number of opportunities for
>> missteps. The size of the repo and build/test times matter but are
>> secondary so long as the workflow is simple and reliable.
>>
>> I don't really know what the current state of things is, but it sounds like
>> it's not as simple as check out -> build -> test if you're doing a
>> cross-repo change. Circular dependencies are a real headache.
>>
>> On Tue, Jul 31, 2018 at 2:44 PM, Wes McKinney <[email protected]> wrote:
>>
>>> hi,
>>>
>>> On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti <[email protected]>
>>> wrote:
>>> > I think the circular dependency can be broken if we build a new library
>>> for
>>> > the platform code. This will also make it easy for other projects such as
>>> > ORC to use it.
>>> > I also remember your proposal a while ago of having a separate project
>>> for
>>> > the platform code.  That project can live in the arrow repo. However, one
>>> > has to clone the entire apache arrow repo but can just build the platform
>>> > code. This will be temporary until we can find a new home for it.
>>> >
>>> > The dependency will look like:
>>> > libarrow(arrow core / bindings) <- libparquet (parquet core) <-
>>> > libplatform(platform api)
>>> >
>>> > CI workflow will clone the arrow project twice, once for the platform
>>> > library and once for the arrow-core/bindings library.
>>>
>>> This seems like an interesting proposal; the best place to work toward
>>> this goal (if it is even possible; the build system interactions and
>>> ASF release management are the hard problems) is to have all of the
>>> code in a single repository. ORC could already be using Arrow if it
>>> wanted, but the ORC contributors aren't active in Arrow.
>>>
>>> >
>>> > There is no doubt that the collaborations between the Arrow and Parquet
>>> > communities so far have been very successful.
>>> > The reason to maintain this relationship moving forward is to continue to
>>> > reap the mutual benefits.
>>> > We should continue to take advantage of sharing code as well. However, I
>>> > don't see any code sharing opportunities between arrow-core and the
>>> > parquet-core. Both have different functions.
>>>
>>> I think you mean the Arrow columnar format. The Arrow columnar format
>>> is only one part of a project that has become quite large already
>>> (https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-
>>> platform-for-inmemory-data-105427919).
>>>
>>> >
>>> > We are at a point where the parquet-cpp public API is pretty stable. We
>>> > already passed that difficult stage. My take at arrow and parquet is to
>>> > keep them nimble since we can.
>>>
>>> I believe that parquet-core has progress to make yet ahead of it. We
>>> have done little work in asynchronous IO and concurrency which would
>>> yield both improved read and write throughput. This aligns well with
>>> other concurrency and async-IO work planned in the Arrow platform. I
>>> believe that more development will happen on parquet-core once the
>>> development process issues are resolved by having a single codebase,
>>> single build system, and a single CI framework.
>>>
>>> I have some gripes about design decisions made early in parquet-cpp,
>>> like the use of C++ exceptions. So while "stability" is a reasonable
>>> goal I think we should still be open to making significant changes in
>>> the interest of long term progress.
>>>
>>> Having now worked on these projects for more than 2 and a half years
>>> and the most frequent contributor to both codebases, I'm sadly far
>>> past the "breaking point" and not willing to continue contributing in
>>> a significant way to parquet-cpp if the projects remained structured
>>> as they are now. It's hampering progress and not serving the
>>> community.
>>>
>>> - Wes
>>>
>>> >
>>> >
>>> >
>>> >
>>> > On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney <[email protected]>
>>> wrote:
>>> >
>>> >> > The current Arrow adaptor code for parquet should live in the arrow
>>> >> repo. That will remove a majority of the dependency issues. Joshua's
>>> work
>>> >> would not have been blocked in parquet-cpp if that adapter was in the
>>> arrow
>>> >> repo.  This will be similar to the ORC adaptor.
>>> >>
>>> >> This has been suggested before, but I don't see how it would alleviate
>>> >> any issues because of the significant dependencies on other parts of
>>> >> the Arrow codebase. What you are proposing is:
>>> >>
>>> >> - (Arrow) arrow platform
>>> >> - (Parquet) parquet core
>>> >> - (Arrow) arrow columnar-parquet adapter interface
>>> >> - (Arrow) Python bindings
>>> >>
>>> >> To make this work, somehow Arrow core / libarrow would have to be
>>> >> built before invoking the Parquet core part of the build system. You
>>> >> would need to pass dependent targets across different CMake build
>>> >> systems; I don't know if it's possible (I spent some time looking into
>>> >> it earlier this year). This is what I meant by the lack of a "concrete
>>> >> and actionable plan". The only thing that would really work would be
>>> >> for the Parquet core to be "included" in the Arrow build system
>>> >> somehow rather than using ExternalProject. Currently Parquet builds
>>> >> Arrow using ExternalProject, and Parquet is unknown to the Arrow build
>>> >> system because it's only depended upon by the Python bindings.
>>> >>
>>> >> And even if a solution could be devised, it would not wholly resolve
>>> >> the CI workflow issues.
>>> >>
>>> >> You could make Parquet completely independent of the Arrow codebase,
>>> >> but at that point there is little reason to maintain a relationship
>>> >> between the projects or their communities. We have spent a great deal
>>> >> of effort refactoring the two projects to enable as much code sharing
>>> >> as there is now.
>>> >>
>>> >> - Wes
>>> >>
>>> >> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <[email protected]>
>>> wrote:
>>> >> >> If you still strongly feel that the only way forward is to clone the
>>> >> parquet-cpp repo and part ways, I will withdraw my concern. Having two
>>> >> parquet-cpp repos is no way a better approach.
>>> >> >
>>> >> > Yes, indeed. In my view, the next best option after a monorepo is to
>>> >> > fork. That would obviously be a bad outcome for the community.
>>> >> >
>>> >> > It doesn't look like I will be able to convince you that a monorepo is
>>> >> > a good idea; what I would ask instead is that you be willing to give
>>> >> > it a shot, and if it turns out in the way you're describing (which I
>>> >> > don't think it will) then I suggest that we fork at that point.
>>> >> >
>>> >> > - Wes
>>> >> >
>>> >> > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <
>>> [email protected]>
>>> >> wrote:
>>> >> >> Wes,
>>> >> >>
>>> >> >> Unfortunately, I cannot show you any practical fact-based problems
>>> of a
>>> >> >> non-existent Arrow-Parquet mono-repo.
>>> >> >> Bringing in related Apache community experiences are more meaningful
>>> >> than
>>> >> >> how mono-repos work at Google and other big organizations.
>>> >> >> We solely depend on volunteers and cannot hire full-time developers.
>>> >> >> You are very well aware of how difficult it has been to find more
>>> >> >> contributors and maintainers for Arrow. parquet-cpp already has a low
>>> >> >> contribution rate to its core components.
>>> >> >>
>>> >> >> We should target to ensure that new volunteers who want to contribute
>>> >> >> bug-fixes/features should spend the least amount of time in figuring
>>> out
>>> >> >> the project repo. We can never come up with an automated build system
>>> >> that
>>> >> >> caters to every possible environment.
>>> >> >> My only concern is if the mono-repo will make it harder for new
>>> >> developers
>>> >> >> to work on parquet-cpp core just due to the additional code, build
>>> and
>>> >> test
>>> >> >> dependencies.
>>> >> >> I am not saying that the Arrow community/committers will be less
>>> >> >> co-operative.
>>> >> >> I just don't think the mono-repo structure model will be sustainable
>>> in
>>> >> an
>>> >> >> open source community unless there are long-term vested interests. We
>>> >> can't
>>> >> >> predict that.
>>> >> >>
>>> >> >> The current circular dependency problems between Arrow and Parquet
>>> is a
>>> >> >> major problem for the community and it is important.
>>> >> >>
>>> >> >> The current Arrow adaptor code for parquet should live in the arrow
>>> >> repo.
>>> >> >> That will remove a majority of the dependency issues.
>>> >> >> Joshua's work would not have been blocked in parquet-cpp if that
>>> adapter
>>> >> >> was in the arrow repo.  This will be similar to the ORC adaptor.
>>> >> >>
>>> >> >> The platform API code is pretty stable at this point. Minor changes
>>> in
>>> >> the
>>> >> >> future to this code should not be the main reason to combine the
>>> arrow
>>> >> >> parquet repos.
>>> >> >>
>>> >> >> "
>>> >> >> *I question whether it's worth the community's time long term to
>>> wear*
>>> >> >>
>>> >> >>
>>> >> >> *ourselves out defining custom "ports" / virtual interfaces in
>>> >> eachlibrary
>>> >> >> to plug components together rather than utilizing commonplatform
>>> APIs.*"
>>> >> >>
>>> >> >> My answer to your question below would be "Yes".
>>> Modularity/separation
>>> >> is
>>> >> >> very important in an open source community where priorities of
>>> >> contributors
>>> >> >> are often short term.
>>> >> >> The retention is low and therefore the acquisition costs should be
>>> low
>>> >> as
>>> >> >> well. This is the community over code approach according to me. Minor
>>> >> code
>>> >> >> duplication is not a deal breaker.
>>> >> >> ORC, Parquet, Arrow, etc. are all different components in the big
>>> data
>>> >> >> space serving their own functions.
>>> >> >>
>>> >> >> If you still strongly feel that the only way forward is to clone the
>>> >> >> parquet-cpp repo and part ways, I will withdraw my concern. Having
>>> two
>>> >> >> parquet-cpp repos is no way a better approach.
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <[email protected]>
>>> >> wrote:
>>> >> >>
>>> >> >>> @Antoine
>>> >> >>>
>>> >> >>> > By the way, one concern with the monorepo approach: it would
>>> slightly
>>> >> >>> increase Arrow CI times (which are already too large).
>>> >> >>>
>>> >> >>> A typical CI run in Arrow is taking about 45 minutes:
>>> >> >>> https://travis-ci.org/apache/arrow/builds/410119750
>>> >> >>>
>>> >> >>> Parquet run takes about 28
>>> >> >>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
>>> >> >>>
>>> >> >>> Inevitably we will need to create some kind of bot to run certain
>>> >> >>> builds on-demand based on commit / PR metadata or on request.
>>> >> >>>
>>> >> >>> The slowest build in Arrow (the Arrow C++/Python one) build could be
>>> >> >>> made substantially shorter by moving some of the slower parts (like
>>> >> >>> the Python ASV benchmarks) from being tested every-commit to nightly
>>> >> >>> or on demand. Using ASAN instead of valgrind in Travis would also
>>> >> >>> improve build times (valgrind build could be moved to a nightly
>>> >> >>> exhaustive test run)
>>> >> >>>
>>> >> >>> - Wes
>>> >> >>>
>>> >> >>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <[email protected]
>>> >
>>> >> >>> wrote:
>>> >> >>> >> I would like to point out that arrow's use of orc is a great
>>> >> example of
>>> >> >>> how it would be possible to manage parquet-cpp as a separate
>>> codebase.
>>> >> That
>>> >> >>> gives me hope that the projects could be managed separately some
>>> day.
>>> >> >>> >
>>> >> >>> > Well, I don't know that ORC is the best example. The ORC C++
>>> codebase
>>> >> >>> > features several areas of duplicated logic which could be
>>> replaced by
>>> >> >>> > components from the Arrow platform for better platform-wide
>>> >> >>> > interoperability:
>>> >> >>> >
>>> >> >>> >
>>> >> >>>
>>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
>>> orc/OrcFile.hh#L37
>>> >> >>> >
>>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
>>> >> >>> >
>>> >> >>>
>>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
>>> orc/MemoryPool.hh
>>> >> >>> >
>>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
>>> >> >>> >
>>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/
>>> OutputStream.hh
>>> >> >>> >
>>> >> >>> > ORC's use of symbols from Protocol Buffers was actually a cause of
>>> >> >>> > bugs that we had to fix in Arrow's build system to prevent them
>>> from
>>> >> >>> > leaking to third party linkers when statically linked (ORC is only
>>> >> >>> > available for static linking at the moment AFAIK).
>>> >> >>> >
>>> >> >>> > I question whether it's worth the community's time long term to
>>> wear
>>> >> >>> > ourselves out defining custom "ports" / virtual interfaces in each
>>> >> >>> > library to plug components together rather than utilizing common
>>> >> >>> > platform APIs.
>>> >> >>> >
>>> >> >>> > - Wes
>>> >> >>> >
>>> >> >>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <
>>> >> [email protected]>
>>> >> >>> wrote:
>>> >> >>> >> You're point about the constraints of the ASF release process are
>>> >> well
>>> >> >>> >> taken and as a developer who's trying to work in the current
>>> >> >>> environment I
>>> >> >>> >> would be much happier if the codebases were merged. The main
>>> issues
>>> >> I
>>> >> >>> worry
>>> >> >>> >> about when you put codebases like these together are:
>>> >> >>> >>
>>> >> >>> >> 1. The delineation of API's become blurred and the code becomes
>>> too
>>> >> >>> coupled
>>> >> >>> >> 2. Release of artifacts that are lower in the dependency tree are
>>> >> >>> delayed
>>> >> >>> >> by artifacts higher in the dependency tree
>>> >> >>> >>
>>> >> >>> >> If the project/release management is structured well and someone
>>> >> keeps
>>> >> >>> an
>>> >> >>> >> eye on the coupling, then I don't have any concerns.
>>> >> >>> >>
>>> >> >>> >> I would like to point out that arrow's use of orc is a great
>>> >> example of
>>> >> >>> how
>>> >> >>> >> it would be possible to manage parquet-cpp as a separate
>>> codebase.
>>> >> That
>>> >> >>> >> gives me hope that the projects could be managed separately some
>>> >> day.
>>> >> >>> >>
>>> >> >>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <
>>> [email protected]>
>>> >> >>> wrote:
>>> >> >>> >>
>>> >> >>> >>> hi Josh,
>>> >> >>> >>>
>>> >> >>> >>> > I can imagine use cases for parquet that don't involve arrow
>>> and
>>> >> >>> tying
>>> >> >>> >>> them together seems like the wrong choice.
>>> >> >>> >>>
>>> >> >>> >>> Apache is "Community over Code"; right now it's the same people
>>> >> >>> >>> building these projects -- my argument (which I think you agree
>>> >> with?)
>>> >> >>> >>> is that we should work more closely together until the community
>>> >> grows
>>> >> >>> >>> large enough to support larger-scope process than we have now.
>>> As
>>> >> >>> >>> you've seen, our process isn't serving developers of these
>>> >> projects.
>>> >> >>> >>>
>>> >> >>> >>> > I also think build tooling should be pulled into its own
>>> >> codebase.
>>> >> >>> >>>
>>> >> >>> >>> I don't see how this can possibly be practical taking into
>>> >> >>> >>> consideration the constraints imposed by the combination of the
>>> >> GitHub
>>> >> >>> >>> platform and the ASF release process. I'm all for being
>>> idealistic,
>>> >> >>> >>> but right now we need to be practical. Unless we can devise a
>>> >> >>> >>> practical procedure that can accommodate at least 1 patch per
>>> day
>>> >> >>> >>> which may touch both code and build system simultaneously
>>> without
>>> >> >>> >>> being a hindrance to contributor or maintainer, I don't see how
>>> we
>>> >> can
>>> >> >>> >>> move forward.
>>> >> >>> >>>
>>> >> >>> >>> > That being said, I think it makes sense to merge the codebases
>>> >> in the
>>> >> >>> >>> short term with the express purpose of separating them in the
>>> near
>>> >> >>> term.
>>> >> >>> >>>
>>> >> >>> >>> I would agree but only if separation can be demonstrated to be
>>> >> >>> >>> practical and result in net improvements in productivity and
>>> >> community
>>> >> >>> >>> growth. I think experience has clearly demonstrated that the
>>> >> current
>>> >> >>> >>> separation is impractical, and is causing problems.
>>> >> >>> >>>
>>> >> >>> >>> Per Julian's and Ted's comments, I think we need to consider
>>> >> >>> >>> development process and ASF releases separately. My argument is
>>> as
>>> >> >>> >>> follows:
>>> >> >>> >>>
>>> >> >>> >>> * Monorepo for development (for practicality)
>>> >> >>> >>> * Releases structured according to the desires of the PMCs
>>> >> >>> >>>
>>> >> >>> >>> - Wes
>>> >> >>> >>>
>>> >> >>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <
>>> >> [email protected]
>>> >> >>> >
>>> >> >>> >>> wrote:
>>> >> >>> >>> > I recently worked on an issue that had to be implemented in
>>> >> >>> parquet-cpp
>>> >> >>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow
>>> >> (ARROW-2585,
>>> >> >>> >>> > ARROW-2586). I found the circular dependencies confusing and
>>> >> hard to
>>> >> >>> work
>>> >> >>> >>> > with. For example, I still have a PR open in parquet-cpp
>>> >> (created on
>>> >> >>> May
>>> >> >>> >>> > 10) because of a PR that it depended on in arrow that was
>>> >> recently
>>> >> >>> >>> merged.
>>> >> >>> >>> > I couldn't even address any CI issues in the PR because the
>>> >> change in
>>> >> >>> >>> arrow
>>> >> >>> >>> > was not yet in master. In a separate PR, I changed the
>>> >> >>> >>> run_clang_format.py
>>> >> >>> >>> > script in the arrow project only to find out later that there
>>> >> was an
>>> >> >>> >>> exact
>>> >> >>> >>> > copy of it in parquet-cpp.
>>> >> >>> >>> >
>>> >> >>> >>> > However, I don't think merging the codebases makes sense in
>>> the
>>> >> long
>>> >> >>> >>> term.
>>> >> >>> >>> > I can imagine use cases for parquet that don't involve arrow
>>> and
>>> >> >>> tying
>>> >> >>> >>> them
>>> >> >>> >>> > together seems like the wrong choice. There will be other
>>> formats
>>> >> >>> that
>>> >> >>> >>> > arrow needs to support that will be kept separate (e.g. -
>>> Orc),
>>> >> so I
>>> >> >>> >>> don't
>>> >> >>> >>> > see why parquet should be special. I also think build tooling
>>> >> should
>>> >> >>> be
>>> >> >>> >>> > pulled into its own codebase. GNU has had a long history of
>>> >> >>> developing
>>> >> >>> >>> open
>>> >> >>> >>> > source C/C++ projects that way and made projects like
>>> >> >>> >>> > autoconf/automake/make to support them. I don't think CI is a
>>> >> good
>>> >> >>> >>> > counter-example since there have been lots of successful open
>>> >> source
>>> >> >>> >>> > projects that have used nightly build systems that pinned
>>> >> versions of
>>> >> >>> >>> > dependent software.
>>> >> >>> >>> >
>>> >> >>> >>> > That being said, I think it makes sense to merge the codebases
>>> >> in the
>>> >> >>> >>> short
>>> >> >>> >>> > term with the express purpose of separating them in the near
>>> >> term.
>>> >> >>> My
>>> >> >>> >>> > reasoning is as follows. By putting the codebases together,
>>> you
>>> >> can
>>> >> >>> more
>>> >> >>> >>> > easily delineate the boundaries between the API's with a
>>> single
>>> >> PR.
>>> >> >>> >>> Second,
>>> >> >>> >>> > it will force the build tooling to converge instead of
>>> diverge,
>>> >> >>> which has
>>> >> >>> >>> > already happened. Once the boundaries and tooling have been
>>> >> sorted
>>> >> >>> out,
>>> >> >>> >>> it
>>> >> >>> >>> > should be easy to separate them back into their own codebases.
>>> >> >>> >>> >
>>> >> >>> >>> > If the codebases are merged, I would ask that the C++
>>> codebases
>>> >> for
>>> >> >>> arrow
>>> >> >>> >>> > be separated from other languages. Looking at it from the
>>> >> >>> perspective of
>>> >> >>> >>> a
>>> >> >>> >>> > parquet-cpp library user, having a dependency on Java is a
>>> large
>>> >> tax
>>> >> >>> to
>>> >> >>> >>> pay
>>> >> >>> >>> > if you don't need it. For example, there were 25 JIRA's in the
>>> >> 0.10.0
>>> >> >>> >>> > release of arrow, many of which were holding up the release. I
>>> >> hope
>>> >> >>> that
>>> >> >>> >>> > seems like a reasonable compromise, and I think it will help
>>> >> reduce
>>> >> >>> the
>>> >> >>> >>> > complexity of the build/release tooling.
>>> >> >>> >>> >
>>> >> >>> >>> >
>>> >> >>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <
>>> >> [email protected]>
>>> >> >>> >>> wrote:
>>> >> >>> >>> >
>>> >> >>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <
>>> >> [email protected]>
>>> >> >>> >>> wrote:
>>> >> >>> >>> >>
>>> >> >>> >>> >> >
>>> >> >>> >>> >> > > The community will be less willing to accept large
>>> >> >>> >>> >> > > changes that require multiple rounds of patches for
>>> >> stability
>>> >> >>> and
>>> >> >>> >>> API
>>> >> >>> >>> >> > > convergence. Our contributions to Libhdfs++ in the HDFS
>>> >> >>> community
>>> >> >>> >>> took
>>> >> >>> >>> >> a
>>> >> >>> >>> >> > > significantly long time for the very same reason.
>>> >> >>> >>> >> >
>>> >> >>> >>> >> > Please don't use bad experiences from another open source
>>> >> >>> community as
>>> >> >>> >>> >> > leverage in this discussion. I'm sorry that things didn't
>>> go
>>> >> the
>>> >> >>> way
>>> >> >>> >>> >> > you wanted in Apache Hadoop but this is a distinct
>>> community
>>> >> which
>>> >> >>> >>> >> > happens to operate under a similar open governance model.
>>> >> >>> >>> >>
>>> >> >>> >>> >>
>>> >> >>> >>> >> There are some more radical and community building options as
>>> >> well.
>>> >> >>> Take
>>> >> >>> >>> >> the subversion project as a precedent. With subversion, any
>>> >> Apache
>>> >> >>> >>> >> committer can request and receive a commit bit on some large
>>> >> >>> fraction of
>>> >> >>> >>> >> subversion.
>>> >> >>> >>> >>
>>> >> >>> >>> >> So why not take this a bit further and give every parquet
>>> >> committer
>>> >> >>> a
>>> >> >>> >>> >> commit bit in Arrow? Or even make them be first class
>>> >> committers in
>>> >> >>> >>> Arrow?
>>> >> >>> >>> >> Possibly even make it policy that every Parquet committer who
>>> >> asks
>>> >> >>> will
>>> >> >>> >>> be
>>> >> >>> >>> >> given committer status in Arrow.
>>> >> >>> >>> >>
>>> >> >>> >>> >> That relieves a lot of the social anxiety here. Parquet
>>> >> committers
>>> >> >>> >>> can't be
>>> >> >>> >>> >> worried at that point whether their patches will get merged;
>>> >> they
>>> >> >>> can
>>> >> >>> >>> just
>>> >> >>> >>> >> merge them.  Arrow shouldn't worry much about inviting in the
>>> >> >>> Parquet
>>> >> >>> >>> >> committers. After all, Arrow already depends a lot on
>>> parquet so
>>> >> >>> why not
>>> >> >>> >>> >> invite them in?
>>> >> >>> >>> >>
>>> >> >>> >>>
>>> >> >>>
>>> >> >>
>>> >> >>
>>> >> >> --
>>> >> >> regards,
>>> >> >> Deepak Majeti
>>> >>
>>> >
>>> >
>>> > --
>>> > regards,
>>> > Deepak Majeti
>>>

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Reply via email to