Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Ryan Blue Tue, 07 Aug 2018 09:13:54 -0700

I don't have an opinion here, but could someone send a summary of what is
decided to the dev list once there is consensus? This is a long thread for
parts of the project I don't work on, so I haven't followed it very closely.


On Tue, Aug 7, 2018 at 8:22 AM Wes McKinney <wesmck...@gmail.com> wrote:

> > It will be difficult to track parquet-cpp changes if they get mixed with
> Arrow changes. Will we establish some guidelines for filing Parquet JIRAs?
> Can we enforce that parquet-cpp changes will not be committed without a
> corresponding Parquet JIRA?
>
> I think we would use the following policy:
>
> * use PARQUET-XXX for issues relating to Parquet core
> * use ARROW-XXX for issues relation to Arrow's consumption of Parquet
> core (e.g. changes that are in parquet/arrow right now)
>
> We've already been dealing with annoyances relating to issues
> straddling the two projects (debugging an issue on Arrow side to find
> that it has to be fixed on Parquet side); this would make things
> simpler for us
>
> > I would also like to keep changes to parquet-cpp on a separate commit to
> simplify forking later (if needed) and be able to maintain the commit
> history.  I don't know if its possible to squash parquet-cpp commits and
> arrow commits separately before merging.
>
> This seems rather onerous for both contributors and maintainers and
> not in line with the goal of improving productivity. In the event that
> we fork I see it as a traumatic event for the community. If it does
> happen, then we can write a script (using git filter-branch and other
> such tools) to extract commits related to the forked code.
>
> - Wes
>
> On Tue, Aug 7, 2018 at 10:37 AM, Deepak Majeti <majeti.dee...@gmail.com>
> wrote:
> > I have a few more logistical questions to add.
> >
> > It will be difficult to track parquet-cpp changes if they get mixed with
> > Arrow changes. Will we establish some guidelines for filing Parquet
> JIRAs?
> > Can we enforce that parquet-cpp changes will not be committed without a
> > corresponding Parquet JIRA?
> >
> > I would also like to keep changes to parquet-cpp on a separate commit to
> > simplify forking later (if needed) and be able to maintain the commit
> > history.  I don't know if its possible to squash parquet-cpp commits and
> > arrow commits separately before merging.
> >
> >
> > On Tue, Aug 7, 2018 at 8:57 AM Wes McKinney <wesmck...@gmail.com> wrote:
> >
> >> Do other people have opinions? I would like to undertake this work in
> >> the near future (the next 8-10 weeks); I would be OK with taking
> >> responsibility for the primary codebase surgery.
> >>
> >> Some logistical questions:
> >>
> >> * We have a handful of pull requests in flight in parquet-cpp that
> >> would need to be resolved / merged
> >> * We should probably cut a status-quo cpp-1.5.0 release, with future
> >> releases cut out of the new structure
> >> * Management of shared commit rights (I can discuss with the Arrow
> >> PMC; I believe that approving any committer who has actively
> >> maintained parquet-cpp should be a reasonable approach per Ted's
> >> comments)
> >>
> >> If working more closely together proves to not be working out after
> >> some period of time, I will be fully supportive of a fork or something
> >> like it
> >>
> >> Thanks,
> >> Wes
> >>
> >> On Wed, Aug 1, 2018 at 3:39 PM, Wes McKinney <wesmck...@gmail.com>
> wrote:
> >> > Thanks Tim.
> >> >
> >> > Indeed, it's not very simple. Just today Antoine cleaned up some
> >> > platform code intending to improve the performance of bit-packing in
> >> > Parquet writes, and we resulted with 2 interdependent PRs
> >> >
> >> > * https://github.com/apache/parquet-cpp/pull/483
> >> > * https://github.com/apache/arrow/pull/2355
> >> >
> >> > Changes that impact the Python interface to Parquet are even more
> >> complex.
> >> >
> >> > Adding options to Arrow's CMake build system to only build
> >> > Parquet-related code and dependencies (in a monorepo framework) would
> >> > not be difficult, and amount to writing "make parquet".
> >> >
> >> > See e.g. https://stackoverflow.com/a/17201375. The desired commands
> to
> >> > build and install the Parquet core libraries and their dependencies
> >> > would be:
> >> >
> >> > ninja parquet && ninja install
> >> >
> >> > - Wes
> >> >
> >> > On Wed, Aug 1, 2018 at 2:34 PM, Tim Armstrong
> >> > <tarmstr...@cloudera.com.invalid> wrote:
> >> >> I don't have a direct stake in this beyond wanting to see Parquet be
> >> >> successful, but I thought I'd give my two cents.
> >> >>
> >> >> For me, the thing that makes the biggest difference in contributing
> to a
> >> >> new codebase is the number of steps in the workflow for writing,
> >> testing,
> >> >> posting and iterating on a commit and also the number of
> opportunities
> >> for
> >> >> missteps. The size of the repo and build/test times matter but are
> >> >> secondary so long as the workflow is simple and reliable.
> >> >>
> >> >> I don't really know what the current state of things is, but it
> sounds
> >> like
> >> >> it's not as simple as check out -> build -> test if you're doing a
> >> >> cross-repo change. Circular dependencies are a real headache.
> >> >>
> >> >> On Tue, Jul 31, 2018 at 2:44 PM, Wes McKinney <wesmck...@gmail.com>
> >> wrote:
> >> >>
> >> >>> hi,
> >> >>>
> >> >>> On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti <
> >> majeti.dee...@gmail.com>
> >> >>> wrote:
> >> >>> > I think the circular dependency can be broken if we build a new
> >> library
> >> >>> for
> >> >>> > the platform code. This will also make it easy for other projects
> >> such as
> >> >>> > ORC to use it.
> >> >>> > I also remember your proposal a while ago of having a separate
> >> project
> >> >>> for
> >> >>> > the platform code.  That project can live in the arrow repo.
> >> However, one
> >> >>> > has to clone the entire apache arrow repo but can just build the
> >> platform
> >> >>> > code. This will be temporary until we can find a new home for it.
> >> >>> >
> >> >>> > The dependency will look like:
> >> >>> > libarrow(arrow core / bindings) <- libparquet (parquet core) <-
> >> >>> > libplatform(platform api)
> >> >>> >
> >> >>> > CI workflow will clone the arrow project twice, once for the
> platform
> >> >>> > library and once for the arrow-core/bindings library.
> >> >>>
> >> >>> This seems like an interesting proposal; the best place to work
> toward
> >> >>> this goal (if it is even possible; the build system interactions and
> >> >>> ASF release management are the hard problems) is to have all of the
> >> >>> code in a single repository. ORC could already be using Arrow if it
> >> >>> wanted, but the ORC contributors aren't active in Arrow.
> >> >>>
> >> >>> >
> >> >>> > There is no doubt that the collaborations between the Arrow and
> >> Parquet
> >> >>> > communities so far have been very successful.
> >> >>> > The reason to maintain this relationship moving forward is to
> >> continue to
> >> >>> > reap the mutual benefits.
> >> >>> > We should continue to take advantage of sharing code as well.
> >> However, I
> >> >>> > don't see any code sharing opportunities between arrow-core and
> the
> >> >>> > parquet-core. Both have different functions.
> >> >>>
> >> >>> I think you mean the Arrow columnar format. The Arrow columnar
> format
> >> >>> is only one part of a project that has become quite large already
> >> >>> (
> >> https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-
> >> >>> platform-for-inmemory-data-105427919).
> >> >>>
> >> >>> >
> >> >>> > We are at a point where the parquet-cpp public API is pretty
> stable.
> >> We
> >> >>> > already passed that difficult stage. My take at arrow and parquet
> is
> >> to
> >> >>> > keep them nimble since we can.
> >> >>>
> >> >>> I believe that parquet-core has progress to make yet ahead of it. We
> >> >>> have done little work in asynchronous IO and concurrency which would
> >> >>> yield both improved read and write throughput. This aligns well with
> >> >>> other concurrency and async-IO work planned in the Arrow platform. I
> >> >>> believe that more development will happen on parquet-core once the
> >> >>> development process issues are resolved by having a single codebase,
> >> >>> single build system, and a single CI framework.
> >> >>>
> >> >>> I have some gripes about design decisions made early in parquet-cpp,
> >> >>> like the use of C++ exceptions. So while "stability" is a reasonable
> >> >>> goal I think we should still be open to making significant changes
> in
> >> >>> the interest of long term progress.
> >> >>>
> >> >>> Having now worked on these projects for more than 2 and a half years
> >> >>> and the most frequent contributor to both codebases, I'm sadly far
> >> >>> past the "breaking point" and not willing to continue contributing
> in
> >> >>> a significant way to parquet-cpp if the projects remained structured
> >> >>> as they are now. It's hampering progress and not serving the
> >> >>> community.
> >> >>>
> >> >>> - Wes
> >> >>>
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>> > On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney <wesmck...@gmail.com
> >
> >> >>> wrote:
> >> >>> >
> >> >>> >> > The current Arrow adaptor code for parquet should live in the
> >> arrow
> >> >>> >> repo. That will remove a majority of the dependency issues.
> Joshua's
> >> >>> work
> >> >>> >> would not have been blocked in parquet-cpp if that adapter was in
> >> the
> >> >>> arrow
> >> >>> >> repo.  This will be similar to the ORC adaptor.
> >> >>> >>
> >> >>> >> This has been suggested before, but I don't see how it would
> >> alleviate
> >> >>> >> any issues because of the significant dependencies on other
> parts of
> >> >>> >> the Arrow codebase. What you are proposing is:
> >> >>> >>
> >> >>> >> - (Arrow) arrow platform
> >> >>> >> - (Parquet) parquet core
> >> >>> >> - (Arrow) arrow columnar-parquet adapter interface
> >> >>> >> - (Arrow) Python bindings
> >> >>> >>
> >> >>> >> To make this work, somehow Arrow core / libarrow would have to be
> >> >>> >> built before invoking the Parquet core part of the build system.
> You
> >> >>> >> would need to pass dependent targets across different CMake build
> >> >>> >> systems; I don't know if it's possible (I spent some time looking
> >> into
> >> >>> >> it earlier this year). This is what I meant by the lack of a
> >> "concrete
> >> >>> >> and actionable plan". The only thing that would really work
> would be
> >> >>> >> for the Parquet core to be "included" in the Arrow build system
> >> >>> >> somehow rather than using ExternalProject. Currently Parquet
> builds
> >> >>> >> Arrow using ExternalProject, and Parquet is unknown to the Arrow
> >> build
> >> >>> >> system because it's only depended upon by the Python bindings.
> >> >>> >>
> >> >>> >> And even if a solution could be devised, it would not wholly
> resolve
> >> >>> >> the CI workflow issues.
> >> >>> >>
> >> >>> >> You could make Parquet completely independent of the Arrow
> codebase,
> >> >>> >> but at that point there is little reason to maintain a
> relationship
> >> >>> >> between the projects or their communities. We have spent a great
> >> deal
> >> >>> >> of effort refactoring the two projects to enable as much code
> >> sharing
> >> >>> >> as there is now.
> >> >>> >>
> >> >>> >> - Wes
> >> >>> >>
> >> >>> >> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <
> wesmck...@gmail.com>
> >> >>> wrote:
> >> >>> >> >> If you still strongly feel that the only way forward is to
> clone
> >> the
> >> >>> >> parquet-cpp repo and part ways, I will withdraw my concern.
> Having
> >> two
> >> >>> >> parquet-cpp repos is no way a better approach.
> >> >>> >> >
> >> >>> >> > Yes, indeed. In my view, the next best option after a monorepo
> is
> >> to
> >> >>> >> > fork. That would obviously be a bad outcome for the community.
> >> >>> >> >
> >> >>> >> > It doesn't look like I will be able to convince you that a
> >> monorepo is
> >> >>> >> > a good idea; what I would ask instead is that you be willing to
> >> give
> >> >>> >> > it a shot, and if it turns out in the way you're describing
> >> (which I
> >> >>> >> > don't think it will) then I suggest that we fork at that point.
> >> >>> >> >
> >> >>> >> > - Wes
> >> >>> >> >
> >> >>> >> > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <
> >> >>> majeti.dee...@gmail.com>
> >> >>> >> wrote:
> >> >>> >> >> Wes,
> >> >>> >> >>
> >> >>> >> >> Unfortunately, I cannot show you any practical fact-based
> >> problems
> >> >>> of a
> >> >>> >> >> non-existent Arrow-Parquet mono-repo.
> >> >>> >> >> Bringing in related Apache community experiences are more
> >> meaningful
> >> >>> >> than
> >> >>> >> >> how mono-repos work at Google and other big organizations.
> >> >>> >> >> We solely depend on volunteers and cannot hire full-time
> >> developers.
> >> >>> >> >> You are very well aware of how difficult it has been to find
> more
> >> >>> >> >> contributors and maintainers for Arrow. parquet-cpp already
> has
> >> a low
> >> >>> >> >> contribution rate to its core components.
> >> >>> >> >>
> >> >>> >> >> We should target to ensure that new volunteers who want to
> >> contribute
> >> >>> >> >> bug-fixes/features should spend the least amount of time in
> >> figuring
> >> >>> out
> >> >>> >> >> the project repo. We can never come up with an automated build
> >> system
> >> >>> >> that
> >> >>> >> >> caters to every possible environment.
> >> >>> >> >> My only concern is if the mono-repo will make it harder for
> new
> >> >>> >> developers
> >> >>> >> >> to work on parquet-cpp core just due to the additional code,
> >> build
> >> >>> and
> >> >>> >> test
> >> >>> >> >> dependencies.
> >> >>> >> >> I am not saying that the Arrow community/committers will be
> less
> >> >>> >> >> co-operative.
> >> >>> >> >> I just don't think the mono-repo structure model will be
> >> sustainable
> >> >>> in
> >> >>> >> an
> >> >>> >> >> open source community unless there are long-term vested
> >> interests. We
> >> >>> >> can't
> >> >>> >> >> predict that.
> >> >>> >> >>
> >> >>> >> >> The current circular dependency problems between Arrow and
> >> Parquet
> >> >>> is a
> >> >>> >> >> major problem for the community and it is important.
> >> >>> >> >>
> >> >>> >> >> The current Arrow adaptor code for parquet should live in the
> >> arrow
> >> >>> >> repo.
> >> >>> >> >> That will remove a majority of the dependency issues.
> >> >>> >> >> Joshua's work would not have been blocked in parquet-cpp if
> that
> >> >>> adapter
> >> >>> >> >> was in the arrow repo.  This will be similar to the ORC
> adaptor.
> >> >>> >> >>
> >> >>> >> >> The platform API code is pretty stable at this point. Minor
> >> changes
> >> >>> in
> >> >>> >> the
> >> >>> >> >> future to this code should not be the main reason to combine
> the
> >> >>> arrow
> >> >>> >> >> parquet repos.
> >> >>> >> >>
> >> >>> >> >> "
> >> >>> >> >> *I question whether it's worth the community's time long term
> to
> >> >>> wear*
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >> *ourselves out defining custom "ports" / virtual interfaces in
> >> >>> >> eachlibrary
> >> >>> >> >> to plug components together rather than utilizing
> commonplatform
> >> >>> APIs.*"
> >> >>> >> >>
> >> >>> >> >> My answer to your question below would be "Yes".
> >> >>> Modularity/separation
> >> >>> >> is
> >> >>> >> >> very important in an open source community where priorities of
> >> >>> >> contributors
> >> >>> >> >> are often short term.
> >> >>> >> >> The retention is low and therefore the acquisition costs
> should
> >> be
> >> >>> low
> >> >>> >> as
> >> >>> >> >> well. This is the community over code approach according to
> me.
> >> Minor
> >> >>> >> code
> >> >>> >> >> duplication is not a deal breaker.
> >> >>> >> >> ORC, Parquet, Arrow, etc. are all different components in the
> big
> >> >>> data
> >> >>> >> >> space serving their own functions.
> >> >>> >> >>
> >> >>> >> >> If you still strongly feel that the only way forward is to
> clone
> >> the
> >> >>> >> >> parquet-cpp repo and part ways, I will withdraw my concern.
> >> Having
> >> >>> two
> >> >>> >> >> parquet-cpp repos is no way a better approach.
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <
> >> wesmck...@gmail.com>
> >> >>> >> wrote:
> >> >>> >> >>
> >> >>> >> >>> @Antoine
> >> >>> >> >>>
> >> >>> >> >>> > By the way, one concern with the monorepo approach: it
> would
> >> >>> slightly
> >> >>> >> >>> increase Arrow CI times (which are already too large).
> >> >>> >> >>>
> >> >>> >> >>> A typical CI run in Arrow is taking about 45 minutes:
> >> >>> >> >>> https://travis-ci.org/apache/arrow/builds/410119750
> >> >>> >> >>>
> >> >>> >> >>> Parquet run takes about 28
> >> >>> >> >>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
> >> >>> >> >>>
> >> >>> >> >>> Inevitably we will need to create some kind of bot to run
> >> certain
> >> >>> >> >>> builds on-demand based on commit / PR metadata or on request.
> >> >>> >> >>>
> >> >>> >> >>> The slowest build in Arrow (the Arrow C++/Python one) build
> >> could be
> >> >>> >> >>> made substantially shorter by moving some of the slower parts
> >> (like
> >> >>> >> >>> the Python ASV benchmarks) from being tested every-commit to
> >> nightly
> >> >>> >> >>> or on demand. Using ASAN instead of valgrind in Travis would
> >> also
> >> >>> >> >>> improve build times (valgrind build could be moved to a
> nightly
> >> >>> >> >>> exhaustive test run)
> >> >>> >> >>>
> >> >>> >> >>> - Wes
> >> >>> >> >>>
> >> >>> >> >>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <
> >> wesmck...@gmail.com
> >> >>> >
> >> >>> >> >>> wrote:
> >> >>> >> >>> >> I would like to point out that arrow's use of orc is a
> great
> >> >>> >> example of
> >> >>> >> >>> how it would be possible to manage parquet-cpp as a separate
> >> >>> codebase.
> >> >>> >> That
> >> >>> >> >>> gives me hope that the projects could be managed separately
> some
> >> >>> day.
> >> >>> >> >>> >
> >> >>> >> >>> > Well, I don't know that ORC is the best example. The ORC
> C++
> >> >>> codebase
> >> >>> >> >>> > features several areas of duplicated logic which could be
> >> >>> replaced by
> >> >>> >> >>> > components from the Arrow platform for better platform-wide
> >> >>> >> >>> > interoperability:
> >> >>> >> >>> >
> >> >>> >> >>> >
> >> >>> >> >>>
> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
> >> >>> orc/OrcFile.hh#L37
> >> >>> >> >>> >
> >> >>> >>
> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
> >> >>> >> >>> >
> >> >>> >> >>>
> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
> >> >>> orc/MemoryPool.hh
> >> >>> >> >>> >
> >> >>> >>
> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
> >> >>> >> >>> >
> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/
> >> >>> OutputStream.hh
> >> >>> >> >>> >
> >> >>> >> >>> > ORC's use of symbols from Protocol Buffers was actually a
> >> cause of
> >> >>> >> >>> > bugs that we had to fix in Arrow's build system to prevent
> >> them
> >> >>> from
> >> >>> >> >>> > leaking to third party linkers when statically linked (ORC
> is
> >> only
> >> >>> >> >>> > available for static linking at the moment AFAIK).
> >> >>> >> >>> >
> >> >>> >> >>> > I question whether it's worth the community's time long
> term
> >> to
> >> >>> wear
> >> >>> >> >>> > ourselves out defining custom "ports" / virtual interfaces
> in
> >> each
> >> >>> >> >>> > library to plug components together rather than utilizing
> >> common
> >> >>> >> >>> > platform APIs.
> >> >>> >> >>> >
> >> >>> >> >>> > - Wes
> >> >>> >> >>> >
> >> >>> >> >>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <
> >> >>> >> joshuasto...@gmail.com>
> >> >>> >> >>> wrote:
> >> >>> >> >>> >> You're point about the constraints of the ASF release
> >> process are
> >> >>> >> well
> >> >>> >> >>> >> taken and as a developer who's trying to work in the
> current
> >> >>> >> >>> environment I
> >> >>> >> >>> >> would be much happier if the codebases were merged. The
> main
> >> >>> issues
> >> >>> >> I
> >> >>> >> >>> worry
> >> >>> >> >>> >> about when you put codebases like these together are:
> >> >>> >> >>> >>
> >> >>> >> >>> >> 1. The delineation of API's become blurred and the code
> >> becomes
> >> >>> too
> >> >>> >> >>> coupled
> >> >>> >> >>> >> 2. Release of artifacts that are lower in the dependency
> >> tree are
> >> >>> >> >>> delayed
> >> >>> >> >>> >> by artifacts higher in the dependency tree
> >> >>> >> >>> >>
> >> >>> >> >>> >> If the project/release management is structured well and
> >> someone
> >> >>> >> keeps
> >> >>> >> >>> an
> >> >>> >> >>> >> eye on the coupling, then I don't have any concerns.
> >> >>> >> >>> >>
> >> >>> >> >>> >> I would like to point out that arrow's use of orc is a
> great
> >> >>> >> example of
> >> >>> >> >>> how
> >> >>> >> >>> >> it would be possible to manage parquet-cpp as a separate
> >> >>> codebase.
> >> >>> >> That
> >> >>> >> >>> >> gives me hope that the projects could be managed
> separately
> >> some
> >> >>> >> day.
> >> >>> >> >>> >>
> >> >>> >> >>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <
> >> >>> wesmck...@gmail.com>
> >> >>> >> >>> wrote:
> >> >>> >> >>> >>
> >> >>> >> >>> >>> hi Josh,
> >> >>> >> >>> >>>
> >> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve
> >> arrow
> >> >>> and
> >> >>> >> >>> tying
> >> >>> >> >>> >>> them together seems like the wrong choice.
> >> >>> >> >>> >>>
> >> >>> >> >>> >>> Apache is "Community over Code"; right now it's the same
> >> people
> >> >>> >> >>> >>> building these projects -- my argument (which I think you
> >> agree
> >> >>> >> with?)
> >> >>> >> >>> >>> is that we should work more closely together until the
> >> community
> >> >>> >> grows
> >> >>> >> >>> >>> large enough to support larger-scope process than we have
> >> now.
> >> >>> As
> >> >>> >> >>> >>> you've seen, our process isn't serving developers of
> these
> >> >>> >> projects.
> >> >>> >> >>> >>>
> >> >>> >> >>> >>> > I also think build tooling should be pulled into its
> own
> >> >>> >> codebase.
> >> >>> >> >>> >>>
> >> >>> >> >>> >>> I don't see how this can possibly be practical taking
> into
> >> >>> >> >>> >>> consideration the constraints imposed by the combination
> of
> >> the
> >> >>> >> GitHub
> >> >>> >> >>> >>> platform and the ASF release process. I'm all for being
> >> >>> idealistic,
> >> >>> >> >>> >>> but right now we need to be practical. Unless we can
> devise
> >> a
> >> >>> >> >>> >>> practical procedure that can accommodate at least 1 patch
> >> per
> >> >>> day
> >> >>> >> >>> >>> which may touch both code and build system simultaneously
> >> >>> without
> >> >>> >> >>> >>> being a hindrance to contributor or maintainer, I don't
> see
> >> how
> >> >>> we
> >> >>> >> can
> >> >>> >> >>> >>> move forward.
> >> >>> >> >>> >>>
> >> >>> >> >>> >>> > That being said, I think it makes sense to merge the
> >> codebases
> >> >>> >> in the
> >> >>> >> >>> >>> short term with the express purpose of separating them in
> >> the
> >> >>> near
> >> >>> >> >>> term.
> >> >>> >> >>> >>>
> >> >>> >> >>> >>> I would agree but only if separation can be demonstrated
> to
> >> be
> >> >>> >> >>> >>> practical and result in net improvements in productivity
> and
> >> >>> >> community
> >> >>> >> >>> >>> growth. I think experience has clearly demonstrated that
> the
> >> >>> >> current
> >> >>> >> >>> >>> separation is impractical, and is causing problems.
> >> >>> >> >>> >>>
> >> >>> >> >>> >>> Per Julian's and Ted's comments, I think we need to
> consider
> >> >>> >> >>> >>> development process and ASF releases separately. My
> >> argument is
> >> >>> as
> >> >>> >> >>> >>> follows:
> >> >>> >> >>> >>>
> >> >>> >> >>> >>> * Monorepo for development (for practicality)
> >> >>> >> >>> >>> * Releases structured according to the desires of the
> PMCs
> >> >>> >> >>> >>>
> >> >>> >> >>> >>> - Wes
> >> >>> >> >>> >>>
> >> >>> >> >>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <
> >> >>> >> joshuasto...@gmail.com
> >> >>> >> >>> >
> >> >>> >> >>> >>> wrote:
> >> >>> >> >>> >>> > I recently worked on an issue that had to be
> implemented
> >> in
> >> >>> >> >>> parquet-cpp
> >> >>> >> >>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow
> >> >>> >> (ARROW-2585,
> >> >>> >> >>> >>> > ARROW-2586). I found the circular dependencies
> confusing
> >> and
> >> >>> >> hard to
> >> >>> >> >>> work
> >> >>> >> >>> >>> > with. For example, I still have a PR open in
> parquet-cpp
> >> >>> >> (created on
> >> >>> >> >>> May
> >> >>> >> >>> >>> > 10) because of a PR that it depended on in arrow that
> was
> >> >>> >> recently
> >> >>> >> >>> >>> merged.
> >> >>> >> >>> >>> > I couldn't even address any CI issues in the PR because
> >> the
> >> >>> >> change in
> >> >>> >> >>> >>> arrow
> >> >>> >> >>> >>> > was not yet in master. In a separate PR, I changed the
> >> >>> >> >>> >>> run_clang_format.py
> >> >>> >> >>> >>> > script in the arrow project only to find out later that
> >> there
> >> >>> >> was an
> >> >>> >> >>> >>> exact
> >> >>> >> >>> >>> > copy of it in parquet-cpp.
> >> >>> >> >>> >>> >
> >> >>> >> >>> >>> > However, I don't think merging the codebases makes
> sense
> >> in
> >> >>> the
> >> >>> >> long
> >> >>> >> >>> >>> term.
> >> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve
> >> arrow
> >> >>> and
> >> >>> >> >>> tying
> >> >>> >> >>> >>> them
> >> >>> >> >>> >>> > together seems like the wrong choice. There will be
> other
> >> >>> formats
> >> >>> >> >>> that
> >> >>> >> >>> >>> > arrow needs to support that will be kept separate
> (e.g. -
> >> >>> Orc),
> >> >>> >> so I
> >> >>> >> >>> >>> don't
> >> >>> >> >>> >>> > see why parquet should be special. I also think build
> >> tooling
> >> >>> >> should
> >> >>> >> >>> be
> >> >>> >> >>> >>> > pulled into its own codebase. GNU has had a long
> history
> >> of
> >> >>> >> >>> developing
> >> >>> >> >>> >>> open
> >> >>> >> >>> >>> > source C/C++ projects that way and made projects like
> >> >>> >> >>> >>> > autoconf/automake/make to support them. I don't think
> CI
> >> is a
> >> >>> >> good
> >> >>> >> >>> >>> > counter-example since there have been lots of
> successful
> >> open
> >> >>> >> source
> >> >>> >> >>> >>> > projects that have used nightly build systems that
> pinned
> >> >>> >> versions of
> >> >>> >> >>> >>> > dependent software.
> >> >>> >> >>> >>> >
> >> >>> >> >>> >>> > That being said, I think it makes sense to merge the
> >> codebases
> >> >>> >> in the
> >> >>> >> >>> >>> short
> >> >>> >> >>> >>> > term with the express purpose of separating them in the
> >> near
> >> >>> >> term.
> >> >>> >> >>> My
> >> >>> >> >>> >>> > reasoning is as follows. By putting the codebases
> >> together,
> >> >>> you
> >> >>> >> can
> >> >>> >> >>> more
> >> >>> >> >>> >>> > easily delineate the boundaries between the API's with
> a
> >> >>> single
> >> >>> >> PR.
> >> >>> >> >>> >>> Second,
> >> >>> >> >>> >>> > it will force the build tooling to converge instead of
> >> >>> diverge,
> >> >>> >> >>> which has
> >> >>> >> >>> >>> > already happened. Once the boundaries and tooling have
> >> been
> >> >>> >> sorted
> >> >>> >> >>> out,
> >> >>> >> >>> >>> it
> >> >>> >> >>> >>> > should be easy to separate them back into their own
> >> codebases.
> >> >>> >> >>> >>> >
> >> >>> >> >>> >>> > If the codebases are merged, I would ask that the C++
> >> >>> codebases
> >> >>> >> for
> >> >>> >> >>> arrow
> >> >>> >> >>> >>> > be separated from other languages. Looking at it from
> the
> >> >>> >> >>> perspective of
> >> >>> >> >>> >>> a
> >> >>> >> >>> >>> > parquet-cpp library user, having a dependency on Java
> is a
> >> >>> large
> >> >>> >> tax
> >> >>> >> >>> to
> >> >>> >> >>> >>> pay
> >> >>> >> >>> >>> > if you don't need it. For example, there were 25 JIRA's
> >> in the
> >> >>> >> 0.10.0
> >> >>> >> >>> >>> > release of arrow, many of which were holding up the
> >> release. I
> >> >>> >> hope
> >> >>> >> >>> that
> >> >>> >> >>> >>> > seems like a reasonable compromise, and I think it will
> >> help
> >> >>> >> reduce
> >> >>> >> >>> the
> >> >>> >> >>> >>> > complexity of the build/release tooling.
> >> >>> >> >>> >>> >
> >> >>> >> >>> >>> >
> >> >>> >> >>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <
> >> >>> >> ted.dunn...@gmail.com>
> >> >>> >> >>> >>> wrote:
> >> >>> >> >>> >>> >
> >> >>> >> >>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <
> >> >>> >> wesmck...@gmail.com>
> >> >>> >> >>> >>> wrote:
> >> >>> >> >>> >>> >>
> >> >>> >> >>> >>> >> >
> >> >>> >> >>> >>> >> > > The community will be less willing to accept large
> >> >>> >> >>> >>> >> > > changes that require multiple rounds of patches
> for
> >> >>> >> stability
> >> >>> >> >>> and
> >> >>> >> >>> >>> API
> >> >>> >> >>> >>> >> > > convergence. Our contributions to Libhdfs++ in the
> >> HDFS
> >> >>> >> >>> community
> >> >>> >> >>> >>> took
> >> >>> >> >>> >>> >> a
> >> >>> >> >>> >>> >> > > significantly long time for the very same reason.
> >> >>> >> >>> >>> >> >
> >> >>> >> >>> >>> >> > Please don't use bad experiences from another open
> >> source
> >> >>> >> >>> community as
> >> >>> >> >>> >>> >> > leverage in this discussion. I'm sorry that things
> >> didn't
> >> >>> go
> >> >>> >> the
> >> >>> >> >>> way
> >> >>> >> >>> >>> >> > you wanted in Apache Hadoop but this is a distinct
> >> >>> community
> >> >>> >> which
> >> >>> >> >>> >>> >> > happens to operate under a similar open governance
> >> model.
> >> >>> >> >>> >>> >>
> >> >>> >> >>> >>> >>
> >> >>> >> >>> >>> >> There are some more radical and community building
> >> options as
> >> >>> >> well.
> >> >>> >> >>> Take
> >> >>> >> >>> >>> >> the subversion project as a precedent. With
> subversion,
> >> any
> >> >>> >> Apache
> >> >>> >> >>> >>> >> committer can request and receive a commit bit on some
> >> large
> >> >>> >> >>> fraction of
> >> >>> >> >>> >>> >> subversion.
> >> >>> >> >>> >>> >>
> >> >>> >> >>> >>> >> So why not take this a bit further and give every
> parquet
> >> >>> >> committer
> >> >>> >> >>> a
> >> >>> >> >>> >>> >> commit bit in Arrow? Or even make them be first class
> >> >>> >> committers in
> >> >>> >> >>> >>> Arrow?
> >> >>> >> >>> >>> >> Possibly even make it policy that every Parquet
> >> committer who
> >> >>> >> asks
> >> >>> >> >>> will
> >> >>> >> >>> >>> be
> >> >>> >> >>> >>> >> given committer status in Arrow.
> >> >>> >> >>> >>> >>
> >> >>> >> >>> >>> >> That relieves a lot of the social anxiety here.
> Parquet
> >> >>> >> committers
> >> >>> >> >>> >>> can't be
> >> >>> >> >>> >>> >> worried at that point whether their patches will get
> >> merged;
> >> >>> >> they
> >> >>> >> >>> can
> >> >>> >> >>> >>> just
> >> >>> >> >>> >>> >> merge them.  Arrow shouldn't worry much about inviting
> >> in the
> >> >>> >> >>> Parquet
> >> >>> >> >>> >>> >> committers. After all, Arrow already depends a lot on
> >> >>> parquet so
> >> >>> >> >>> why not
> >> >>> >> >>> >>> >> invite them in?
> >> >>> >> >>> >>> >>
> >> >>> >> >>> >>>
> >> >>> >> >>>
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >> --
> >> >>> >> >> regards,
> >> >>> >> >> Deepak Majeti
> >> >>> >>
> >> >>> >
> >> >>> >
> >> >>> > --
> >> >>> > regards,
> >> >>> > Deepak Majeti
> >> >>>
> >>
> >
> >
> > --
> > regards,
> > Deepak Majeti
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Reply via email to