Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Uwe L. Korn Sun, 19 Aug 2018 05:37:53 -0700

Back from vacation, I also want to finally raise my voice.

With the current state of the Parquet<->Arrow development, I see a benefit in 
merging the code base for now, but not necessarily forever.


Parquet C++ is the main code base of an artefact for which an Arrow C++ adapter 
is built and that uses some of the more standard-library features of Arrow. It 
is the go-to place where also the same toolchain and CI setup is used. Here we 
also directly apply all improvements that we make in Arrow itself. These are 
the points that make it special in comparison to other tools providing Arrow 
adapters like Turbodbc.

Thus, I think that the current move to merge the code bases is ok for me. I 
must say that I'm not 100% certain that this is the best move but currently I 
lack better alternatives. As previously mentioned, we should take extra care 
that we can still do separate releases and also provide a path for a future 
where we split parquet-cpp into its own project/repository again.

An important point that we should keep in (and why I was a bit concerned in the 
previous times this discussion was raised) is that we have to be careful to not 
pull everything that touches Arrow into the Arrow repository. Having separate 
repositories for projects with each its own release cycle is for me still the 
aim for the longterm. I expect that there will be many more projects that will 
use Arrow's I/O libraries as well as will omit Arrow structures. These 
libraries should be also usable in Python/C++/Ruby/R/… These libraries are then 
hopefully not all developed by the same core group of Arrow/Parquet developers 
we have currently. For this to function really well, we will need a more stable 
API in Arrow as well as a good set of build tooling that other libraries can 
build up when using Arrow functionality. In addition to being stable, the API 
must also provide a good UX in the abstraction layers the Arrow functions are 
provided so that high-performance applications are not high-maintenance due to 
frequent API changes in Arrow. That said, this is currently is wish for the 
future. We are currently building and iterating heavily on these APIs to form a 
good basis for future developments. Thus the repo merge will hopefully improve 
the development speed so that we have to spent less time on toolchain 
maintenance and can focus on the user-facing APIs.

Uwe

On Tue, Aug 7, 2018, at 10:45 PM, Wes McKinney wrote:
> Thanks Ryan, will do. The people I'd still like to hear from are:
> 
> * Phillip Cloud
> * Uwe Korn
> 
> As ASF contributors we are responsible to both be pragmatic as well as
> act in the best interests of the community's health and productivity.
> 
> 
> 
> On Tue, Aug 7, 2018 at 12:12 PM, Ryan Blue <rb...@netflix.com.invalid> wrote:
> > I don't have an opinion here, but could someone send a summary of what is
> > decided to the dev list once there is consensus? This is a long thread for
> > parts of the project I don't work on, so I haven't followed it very closely.
> >
> > On Tue, Aug 7, 2018 at 8:22 AM Wes McKinney <wesmck...@gmail.com> wrote:
> >
> >> > It will be difficult to track parquet-cpp changes if they get mixed with
> >> Arrow changes. Will we establish some guidelines for filing Parquet JIRAs?
> >> Can we enforce that parquet-cpp changes will not be committed without a
> >> corresponding Parquet JIRA?
> >>
> >> I think we would use the following policy:
> >>
> >> * use PARQUET-XXX for issues relating to Parquet core
> >> * use ARROW-XXX for issues relation to Arrow's consumption of Parquet
> >> core (e.g. changes that are in parquet/arrow right now)
> >>
> >> We've already been dealing with annoyances relating to issues
> >> straddling the two projects (debugging an issue on Arrow side to find
> >> that it has to be fixed on Parquet side); this would make things
> >> simpler for us
> >>
> >> > I would also like to keep changes to parquet-cpp on a separate commit to
> >> simplify forking later (if needed) and be able to maintain the commit
> >> history.  I don't know if its possible to squash parquet-cpp commits and
> >> arrow commits separately before merging.
> >>
> >> This seems rather onerous for both contributors and maintainers and
> >> not in line with the goal of improving productivity. In the event that
> >> we fork I see it as a traumatic event for the community. If it does
> >> happen, then we can write a script (using git filter-branch and other
> >> such tools) to extract commits related to the forked code.
> >>
> >> - Wes
> >>
> >> On Tue, Aug 7, 2018 at 10:37 AM, Deepak Majeti <majeti.dee...@gmail.com>
> >> wrote:
> >> > I have a few more logistical questions to add.
> >> >
> >> > It will be difficult to track parquet-cpp changes if they get mixed with
> >> > Arrow changes. Will we establish some guidelines for filing Parquet
> >> JIRAs?
> >> > Can we enforce that parquet-cpp changes will not be committed without a
> >> > corresponding Parquet JIRA?
> >> >
> >> > I would also like to keep changes to parquet-cpp on a separate commit to
> >> > simplify forking later (if needed) and be able to maintain the commit
> >> > history.  I don't know if its possible to squash parquet-cpp commits and
> >> > arrow commits separately before merging.
> >> >
> >> >
> >> > On Tue, Aug 7, 2018 at 8:57 AM Wes McKinney <wesmck...@gmail.com> wrote:
> >> >
> >> >> Do other people have opinions? I would like to undertake this work in
> >> >> the near future (the next 8-10 weeks); I would be OK with taking
> >> >> responsibility for the primary codebase surgery.
> >> >>
> >> >> Some logistical questions:
> >> >>
> >> >> * We have a handful of pull requests in flight in parquet-cpp that
> >> >> would need to be resolved / merged
> >> >> * We should probably cut a status-quo cpp-1.5.0 release, with future
> >> >> releases cut out of the new structure
> >> >> * Management of shared commit rights (I can discuss with the Arrow
> >> >> PMC; I believe that approving any committer who has actively
> >> >> maintained parquet-cpp should be a reasonable approach per Ted's
> >> >> comments)
> >> >>
> >> >> If working more closely together proves to not be working out after
> >> >> some period of time, I will be fully supportive of a fork or something
> >> >> like it
> >> >>
> >> >> Thanks,
> >> >> Wes
> >> >>
> >> >> On Wed, Aug 1, 2018 at 3:39 PM, Wes McKinney <wesmck...@gmail.com>
> >> wrote:
> >> >> > Thanks Tim.
> >> >> >
> >> >> > Indeed, it's not very simple. Just today Antoine cleaned up some
> >> >> > platform code intending to improve the performance of bit-packing in
> >> >> > Parquet writes, and we resulted with 2 interdependent PRs
> >> >> >
> >> >> > * https://github.com/apache/parquet-cpp/pull/483
> >> >> > * https://github.com/apache/arrow/pull/2355
> >> >> >
> >> >> > Changes that impact the Python interface to Parquet are even more
> >> >> complex.
> >> >> >
> >> >> > Adding options to Arrow's CMake build system to only build
> >> >> > Parquet-related code and dependencies (in a monorepo framework) would
> >> >> > not be difficult, and amount to writing "make parquet".
> >> >> >
> >> >> > See e.g. https://stackoverflow.com/a/17201375. The desired commands
> >> to
> >> >> > build and install the Parquet core libraries and their dependencies
> >> >> > would be:
> >> >> >
> >> >> > ninja parquet && ninja install
> >> >> >
> >> >> > - Wes
> >> >> >
> >> >> > On Wed, Aug 1, 2018 at 2:34 PM, Tim Armstrong
> >> >> > <tarmstr...@cloudera.com.invalid> wrote:
> >> >> >> I don't have a direct stake in this beyond wanting to see Parquet be
> >> >> >> successful, but I thought I'd give my two cents.
> >> >> >>
> >> >> >> For me, the thing that makes the biggest difference in contributing
> >> to a
> >> >> >> new codebase is the number of steps in the workflow for writing,
> >> >> testing,
> >> >> >> posting and iterating on a commit and also the number of
> >> opportunities
> >> >> for
> >> >> >> missteps. The size of the repo and build/test times matter but are
> >> >> >> secondary so long as the workflow is simple and reliable.
> >> >> >>
> >> >> >> I don't really know what the current state of things is, but it
> >> sounds
> >> >> like
> >> >> >> it's not as simple as check out -> build -> test if you're doing a
> >> >> >> cross-repo change. Circular dependencies are a real headache.
> >> >> >>
> >> >> >> On Tue, Jul 31, 2018 at 2:44 PM, Wes McKinney <wesmck...@gmail.com>
> >> >> wrote:
> >> >> >>
> >> >> >>> hi,
> >> >> >>>
> >> >> >>> On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti <
> >> >> majeti.dee...@gmail.com>
> >> >> >>> wrote:
> >> >> >>> > I think the circular dependency can be broken if we build a new
> >> >> library
> >> >> >>> for
> >> >> >>> > the platform code. This will also make it easy for other projects
> >> >> such as
> >> >> >>> > ORC to use it.
> >> >> >>> > I also remember your proposal a while ago of having a separate
> >> >> project
> >> >> >>> for
> >> >> >>> > the platform code.  That project can live in the arrow repo.
> >> >> However, one
> >> >> >>> > has to clone the entire apache arrow repo but can just build the
> >> >> platform
> >> >> >>> > code. This will be temporary until we can find a new home for it.
> >> >> >>> >
> >> >> >>> > The dependency will look like:
> >> >> >>> > libarrow(arrow core / bindings) <- libparquet (parquet core) <-
> >> >> >>> > libplatform(platform api)
> >> >> >>> >
> >> >> >>> > CI workflow will clone the arrow project twice, once for the
> >> platform
> >> >> >>> > library and once for the arrow-core/bindings library.
> >> >> >>>
> >> >> >>> This seems like an interesting proposal; the best place to work
> >> toward
> >> >> >>> this goal (if it is even possible; the build system interactions and
> >> >> >>> ASF release management are the hard problems) is to have all of the
> >> >> >>> code in a single repository. ORC could already be using Arrow if it
> >> >> >>> wanted, but the ORC contributors aren't active in Arrow.
> >> >> >>>
> >> >> >>> >
> >> >> >>> > There is no doubt that the collaborations between the Arrow and
> >> >> Parquet
> >> >> >>> > communities so far have been very successful.
> >> >> >>> > The reason to maintain this relationship moving forward is to
> >> >> continue to
> >> >> >>> > reap the mutual benefits.
> >> >> >>> > We should continue to take advantage of sharing code as well.
> >> >> However, I
> >> >> >>> > don't see any code sharing opportunities between arrow-core and
> >> the
> >> >> >>> > parquet-core. Both have different functions.
> >> >> >>>
> >> >> >>> I think you mean the Arrow columnar format. The Arrow columnar
> >> format
> >> >> >>> is only one part of a project that has become quite large already
> >> >> >>> (
> >> >> https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-
> >> >> >>> platform-for-inmemory-data-105427919).
> >> >> >>>
> >> >> >>> >
> >> >> >>> > We are at a point where the parquet-cpp public API is pretty
> >> stable.
> >> >> We
> >> >> >>> > already passed that difficult stage. My take at arrow and parquet
> >> is
> >> >> to
> >> >> >>> > keep them nimble since we can.
> >> >> >>>
> >> >> >>> I believe that parquet-core has progress to make yet ahead of it. We
> >> >> >>> have done little work in asynchronous IO and concurrency which would
> >> >> >>> yield both improved read and write throughput. This aligns well with
> >> >> >>> other concurrency and async-IO work planned in the Arrow platform. I
> >> >> >>> believe that more development will happen on parquet-core once the
> >> >> >>> development process issues are resolved by having a single codebase,
> >> >> >>> single build system, and a single CI framework.
> >> >> >>>
> >> >> >>> I have some gripes about design decisions made early in parquet-cpp,
> >> >> >>> like the use of C++ exceptions. So while "stability" is a reasonable
> >> >> >>> goal I think we should still be open to making significant changes
> >> in
> >> >> >>> the interest of long term progress.
> >> >> >>>
> >> >> >>> Having now worked on these projects for more than 2 and a half years
> >> >> >>> and the most frequent contributor to both codebases, I'm sadly far
> >> >> >>> past the "breaking point" and not willing to continue contributing
> >> in
> >> >> >>> a significant way to parquet-cpp if the projects remained structured
> >> >> >>> as they are now. It's hampering progress and not serving the
> >> >> >>> community.
> >> >> >>>
> >> >> >>> - Wes
> >> >> >>>
> >> >> >>> >
> >> >> >>> >
> >> >> >>> >
> >> >> >>> >
> >> >> >>> > On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney <wesmck...@gmail.com
> >> >
> >> >> >>> wrote:
> >> >> >>> >
> >> >> >>> >> > The current Arrow adaptor code for parquet should live in the
> >> >> arrow
> >> >> >>> >> repo. That will remove a majority of the dependency issues.
> >> Joshua's
> >> >> >>> work
> >> >> >>> >> would not have been blocked in parquet-cpp if that adapter was in
> >> >> the
> >> >> >>> arrow
> >> >> >>> >> repo.  This will be similar to the ORC adaptor.
> >> >> >>> >>
> >> >> >>> >> This has been suggested before, but I don't see how it would
> >> >> alleviate
> >> >> >>> >> any issues because of the significant dependencies on other
> >> parts of
> >> >> >>> >> the Arrow codebase. What you are proposing is:
> >> >> >>> >>
> >> >> >>> >> - (Arrow) arrow platform
> >> >> >>> >> - (Parquet) parquet core
> >> >> >>> >> - (Arrow) arrow columnar-parquet adapter interface
> >> >> >>> >> - (Arrow) Python bindings
> >> >> >>> >>
> >> >> >>> >> To make this work, somehow Arrow core / libarrow would have to be
> >> >> >>> >> built before invoking the Parquet core part of the build system.
> >> You
> >> >> >>> >> would need to pass dependent targets across different CMake build
> >> >> >>> >> systems; I don't know if it's possible (I spent some time looking
> >> >> into
> >> >> >>> >> it earlier this year). This is what I meant by the lack of a
> >> >> "concrete
> >> >> >>> >> and actionable plan". The only thing that would really work
> >> would be
> >> >> >>> >> for the Parquet core to be "included" in the Arrow build system
> >> >> >>> >> somehow rather than using ExternalProject. Currently Parquet
> >> builds
> >> >> >>> >> Arrow using ExternalProject, and Parquet is unknown to the Arrow
> >> >> build
> >> >> >>> >> system because it's only depended upon by the Python bindings.
> >> >> >>> >>
> >> >> >>> >> And even if a solution could be devised, it would not wholly
> >> resolve
> >> >> >>> >> the CI workflow issues.
> >> >> >>> >>
> >> >> >>> >> You could make Parquet completely independent of the Arrow
> >> codebase,
> >> >> >>> >> but at that point there is little reason to maintain a
> >> relationship
> >> >> >>> >> between the projects or their communities. We have spent a great
> >> >> deal
> >> >> >>> >> of effort refactoring the two projects to enable as much code
> >> >> sharing
> >> >> >>> >> as there is now.
> >> >> >>> >>
> >> >> >>> >> - Wes
> >> >> >>> >>
> >> >> >>> >> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <
> >> wesmck...@gmail.com>
> >> >> >>> wrote:
> >> >> >>> >> >> If you still strongly feel that the only way forward is to
> >> clone
> >> >> the
> >> >> >>> >> parquet-cpp repo and part ways, I will withdraw my concern.
> >> Having
> >> >> two
> >> >> >>> >> parquet-cpp repos is no way a better approach.
> >> >> >>> >> >
> >> >> >>> >> > Yes, indeed. In my view, the next best option after a monorepo
> >> is
> >> >> to
> >> >> >>> >> > fork. That would obviously be a bad outcome for the community.
> >> >> >>> >> >
> >> >> >>> >> > It doesn't look like I will be able to convince you that a
> >> >> monorepo is
> >> >> >>> >> > a good idea; what I would ask instead is that you be willing to
> >> >> give
> >> >> >>> >> > it a shot, and if it turns out in the way you're describing
> >> >> (which I
> >> >> >>> >> > don't think it will) then I suggest that we fork at that point.
> >> >> >>> >> >
> >> >> >>> >> > - Wes
> >> >> >>> >> >
> >> >> >>> >> > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <
> >> >> >>> majeti.dee...@gmail.com>
> >> >> >>> >> wrote:
> >> >> >>> >> >> Wes,
> >> >> >>> >> >>
> >> >> >>> >> >> Unfortunately, I cannot show you any practical fact-based
> >> >> problems
> >> >> >>> of a
> >> >> >>> >> >> non-existent Arrow-Parquet mono-repo.
> >> >> >>> >> >> Bringing in related Apache community experiences are more
> >> >> meaningful
> >> >> >>> >> than
> >> >> >>> >> >> how mono-repos work at Google and other big organizations.
> >> >> >>> >> >> We solely depend on volunteers and cannot hire full-time
> >> >> developers.
> >> >> >>> >> >> You are very well aware of how difficult it has been to find
> >> more
> >> >> >>> >> >> contributors and maintainers for Arrow. parquet-cpp already
> >> has
> >> >> a low
> >> >> >>> >> >> contribution rate to its core components.
> >> >> >>> >> >>
> >> >> >>> >> >> We should target to ensure that new volunteers who want to
> >> >> contribute
> >> >> >>> >> >> bug-fixes/features should spend the least amount of time in
> >> >> figuring
> >> >> >>> out
> >> >> >>> >> >> the project repo. We can never come up with an automated build
> >> >> system
> >> >> >>> >> that
> >> >> >>> >> >> caters to every possible environment.
> >> >> >>> >> >> My only concern is if the mono-repo will make it harder for
> >> new
> >> >> >>> >> developers
> >> >> >>> >> >> to work on parquet-cpp core just due to the additional code,
> >> >> build
> >> >> >>> and
> >> >> >>> >> test
> >> >> >>> >> >> dependencies.
> >> >> >>> >> >> I am not saying that the Arrow community/committers will be
> >> less
> >> >> >>> >> >> co-operative.
> >> >> >>> >> >> I just don't think the mono-repo structure model will be
> >> >> sustainable
> >> >> >>> in
> >> >> >>> >> an
> >> >> >>> >> >> open source community unless there are long-term vested
> >> >> interests. We
> >> >> >>> >> can't
> >> >> >>> >> >> predict that.
> >> >> >>> >> >>
> >> >> >>> >> >> The current circular dependency problems between Arrow and
> >> >> Parquet
> >> >> >>> is a
> >> >> >>> >> >> major problem for the community and it is important.
> >> >> >>> >> >>
> >> >> >>> >> >> The current Arrow adaptor code for parquet should live in the
> >> >> arrow
> >> >> >>> >> repo.
> >> >> >>> >> >> That will remove a majority of the dependency issues.
> >> >> >>> >> >> Joshua's work would not have been blocked in parquet-cpp if
> >> that
> >> >> >>> adapter
> >> >> >>> >> >> was in the arrow repo.  This will be similar to the ORC
> >> adaptor.
> >> >> >>> >> >>
> >> >> >>> >> >> The platform API code is pretty stable at this point. Minor
> >> >> changes
> >> >> >>> in
> >> >> >>> >> the
> >> >> >>> >> >> future to this code should not be the main reason to combine
> >> the
> >> >> >>> arrow
> >> >> >>> >> >> parquet repos.
> >> >> >>> >> >>
> >> >> >>> >> >> "
> >> >> >>> >> >> *I question whether it's worth the community's time long term
> >> to
> >> >> >>> wear*
> >> >> >>> >> >>
> >> >> >>> >> >>
> >> >> >>> >> >> *ourselves out defining custom "ports" / virtual interfaces in
> >> >> >>> >> eachlibrary
> >> >> >>> >> >> to plug components together rather than utilizing
> >> commonplatform
> >> >> >>> APIs.*"
> >> >> >>> >> >>
> >> >> >>> >> >> My answer to your question below would be "Yes".
> >> >> >>> Modularity/separation
> >> >> >>> >> is
> >> >> >>> >> >> very important in an open source community where priorities of
> >> >> >>> >> contributors
> >> >> >>> >> >> are often short term.
> >> >> >>> >> >> The retention is low and therefore the acquisition costs
> >> should
> >> >> be
> >> >> >>> low
> >> >> >>> >> as
> >> >> >>> >> >> well. This is the community over code approach according to
> >> me.
> >> >> Minor
> >> >> >>> >> code
> >> >> >>> >> >> duplication is not a deal breaker.
> >> >> >>> >> >> ORC, Parquet, Arrow, etc. are all different components in the
> >> big
> >> >> >>> data
> >> >> >>> >> >> space serving their own functions.
> >> >> >>> >> >>
> >> >> >>> >> >> If you still strongly feel that the only way forward is to
> >> clone
> >> >> the
> >> >> >>> >> >> parquet-cpp repo and part ways, I will withdraw my concern.
> >> >> Having
> >> >> >>> two
> >> >> >>> >> >> parquet-cpp repos is no way a better approach.
> >> >> >>> >> >>
> >> >> >>> >> >>
> >> >> >>> >> >>
> >> >> >>> >> >>
> >> >> >>> >> >> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <
> >> >> wesmck...@gmail.com>
> >> >> >>> >> wrote:
> >> >> >>> >> >>
> >> >> >>> >> >>> @Antoine
> >> >> >>> >> >>>
> >> >> >>> >> >>> > By the way, one concern with the monorepo approach: it
> >> would
> >> >> >>> slightly
> >> >> >>> >> >>> increase Arrow CI times (which are already too large).
> >> >> >>> >> >>>
> >> >> >>> >> >>> A typical CI run in Arrow is taking about 45 minutes:
> >> >> >>> >> >>> https://travis-ci.org/apache/arrow/builds/410119750
> >> >> >>> >> >>>
> >> >> >>> >> >>> Parquet run takes about 28
> >> >> >>> >> >>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
> >> >> >>> >> >>>
> >> >> >>> >> >>> Inevitably we will need to create some kind of bot to run
> >> >> certain
> >> >> >>> >> >>> builds on-demand based on commit / PR metadata or on request.
> >> >> >>> >> >>>
> >> >> >>> >> >>> The slowest build in Arrow (the Arrow C++/Python one) build
> >> >> could be
> >> >> >>> >> >>> made substantially shorter by moving some of the slower parts
> >> >> (like
> >> >> >>> >> >>> the Python ASV benchmarks) from being tested every-commit to
> >> >> nightly
> >> >> >>> >> >>> or on demand. Using ASAN instead of valgrind in Travis would
> >> >> also
> >> >> >>> >> >>> improve build times (valgrind build could be moved to a
> >> nightly
> >> >> >>> >> >>> exhaustive test run)
> >> >> >>> >> >>>
> >> >> >>> >> >>> - Wes
> >> >> >>> >> >>>
> >> >> >>> >> >>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <
> >> >> wesmck...@gmail.com
> >> >> >>> >
> >> >> >>> >> >>> wrote:
> >> >> >>> >> >>> >> I would like to point out that arrow's use of orc is a
> >> great
> >> >> >>> >> example of
> >> >> >>> >> >>> how it would be possible to manage parquet-cpp as a separate
> >> >> >>> codebase.
> >> >> >>> >> That
> >> >> >>> >> >>> gives me hope that the projects could be managed separately
> >> some
> >> >> >>> day.
> >> >> >>> >> >>> >
> >> >> >>> >> >>> > Well, I don't know that ORC is the best example. The ORC
> >> C++
> >> >> >>> codebase
> >> >> >>> >> >>> > features several areas of duplicated logic which could be
> >> >> >>> replaced by
> >> >> >>> >> >>> > components from the Arrow platform for better platform-wide
> >> >> >>> >> >>> > interoperability:
> >> >> >>> >> >>> >
> >> >> >>> >> >>> >
> >> >> >>> >> >>>
> >> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
> >> >> >>> orc/OrcFile.hh#L37
> >> >> >>> >> >>> >
> >> >> >>> >>
> >> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
> >> >> >>> >> >>> >
> >> >> >>> >> >>>
> >> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
> >> >> >>> orc/MemoryPool.hh
> >> >> >>> >> >>> >
> >> >> >>> >>
> >> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
> >> >> >>> >> >>> >
> >> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/
> >> >> >>> OutputStream.hh
> >> >> >>> >> >>> >
> >> >> >>> >> >>> > ORC's use of symbols from Protocol Buffers was actually a
> >> >> cause of
> >> >> >>> >> >>> > bugs that we had to fix in Arrow's build system to prevent
> >> >> them
> >> >> >>> from
> >> >> >>> >> >>> > leaking to third party linkers when statically linked (ORC
> >> is
> >> >> only
> >> >> >>> >> >>> > available for static linking at the moment AFAIK).
> >> >> >>> >> >>> >
> >> >> >>> >> >>> > I question whether it's worth the community's time long
> >> term
> >> >> to
> >> >> >>> wear
> >> >> >>> >> >>> > ourselves out defining custom "ports" / virtual interfaces
> >> in
> >> >> each
> >> >> >>> >> >>> > library to plug components together rather than utilizing
> >> >> common
> >> >> >>> >> >>> > platform APIs.
> >> >> >>> >> >>> >
> >> >> >>> >> >>> > - Wes
> >> >> >>> >> >>> >
> >> >> >>> >> >>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <
> >> >> >>> >> joshuasto...@gmail.com>
> >> >> >>> >> >>> wrote:
> >> >> >>> >> >>> >> You're point about the constraints of the ASF release
> >> >> process are
> >> >> >>> >> well
> >> >> >>> >> >>> >> taken and as a developer who's trying to work in the
> >> current
> >> >> >>> >> >>> environment I
> >> >> >>> >> >>> >> would be much happier if the codebases were merged. The
> >> main
> >> >> >>> issues
> >> >> >>> >> I
> >> >> >>> >> >>> worry
> >> >> >>> >> >>> >> about when you put codebases like these together are:
> >> >> >>> >> >>> >>
> >> >> >>> >> >>> >> 1. The delineation of API's become blurred and the code
> >> >> becomes
> >> >> >>> too
> >> >> >>> >> >>> coupled
> >> >> >>> >> >>> >> 2. Release of artifacts that are lower in the dependency
> >> >> tree are
> >> >> >>> >> >>> delayed
> >> >> >>> >> >>> >> by artifacts higher in the dependency tree
> >> >> >>> >> >>> >>
> >> >> >>> >> >>> >> If the project/release management is structured well and
> >> >> someone
> >> >> >>> >> keeps
> >> >> >>> >> >>> an
> >> >> >>> >> >>> >> eye on the coupling, then I don't have any concerns.
> >> >> >>> >> >>> >>
> >> >> >>> >> >>> >> I would like to point out that arrow's use of orc is a
> >> great
> >> >> >>> >> example of
> >> >> >>> >> >>> how
> >> >> >>> >> >>> >> it would be possible to manage parquet-cpp as a separate
> >> >> >>> codebase.
> >> >> >>> >> That
> >> >> >>> >> >>> >> gives me hope that the projects could be managed
> >> separately
> >> >> some
> >> >> >>> >> day.
> >> >> >>> >> >>> >>
> >> >> >>> >> >>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <
> >> >> >>> wesmck...@gmail.com>
> >> >> >>> >> >>> wrote:
> >> >> >>> >> >>> >>
> >> >> >>> >> >>> >>> hi Josh,
> >> >> >>> >> >>> >>>
> >> >> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve
> >> >> arrow
> >> >> >>> and
> >> >> >>> >> >>> tying
> >> >> >>> >> >>> >>> them together seems like the wrong choice.
> >> >> >>> >> >>> >>>
> >> >> >>> >> >>> >>> Apache is "Community over Code"; right now it's the same
> >> >> people
> >> >> >>> >> >>> >>> building these projects -- my argument (which I think you
> >> >> agree
> >> >> >>> >> with?)
> >> >> >>> >> >>> >>> is that we should work more closely together until the
> >> >> community
> >> >> >>> >> grows
> >> >> >>> >> >>> >>> large enough to support larger-scope process than we have
> >> >> now.
> >> >> >>> As
> >> >> >>> >> >>> >>> you've seen, our process isn't serving developers of
> >> these
> >> >> >>> >> projects.
> >> >> >>> >> >>> >>>
> >> >> >>> >> >>> >>> > I also think build tooling should be pulled into its
> >> own
> >> >> >>> >> codebase.
> >> >> >>> >> >>> >>>
> >> >> >>> >> >>> >>> I don't see how this can possibly be practical taking
> >> into
> >> >> >>> >> >>> >>> consideration the constraints imposed by the combination
> >> of
> >> >> the
> >> >> >>> >> GitHub
> >> >> >>> >> >>> >>> platform and the ASF release process. I'm all for being
> >> >> >>> idealistic,
> >> >> >>> >> >>> >>> but right now we need to be practical. Unless we can
> >> devise
> >> >> a
> >> >> >>> >> >>> >>> practical procedure that can accommodate at least 1 patch
> >> >> per
> >> >> >>> day
> >> >> >>> >> >>> >>> which may touch both code and build system simultaneously
> >> >> >>> without
> >> >> >>> >> >>> >>> being a hindrance to contributor or maintainer, I don't
> >> see
> >> >> how
> >> >> >>> we
> >> >> >>> >> can
> >> >> >>> >> >>> >>> move forward.
> >> >> >>> >> >>> >>>
> >> >> >>> >> >>> >>> > That being said, I think it makes sense to merge the
> >> >> codebases
> >> >> >>> >> in the
> >> >> >>> >> >>> >>> short term with the express purpose of separating them in
> >> >> the
> >> >> >>> near
> >> >> >>> >> >>> term.
> >> >> >>> >> >>> >>>
> >> >> >>> >> >>> >>> I would agree but only if separation can be demonstrated
> >> to
> >> >> be
> >> >> >>> >> >>> >>> practical and result in net improvements in productivity
> >> and
> >> >> >>> >> community
> >> >> >>> >> >>> >>> growth. I think experience has clearly demonstrated that
> >> the
> >> >> >>> >> current
> >> >> >>> >> >>> >>> separation is impractical, and is causing problems.
> >> >> >>> >> >>> >>>
> >> >> >>> >> >>> >>> Per Julian's and Ted's comments, I think we need to
> >> consider
> >> >> >>> >> >>> >>> development process and ASF releases separately. My
> >> >> argument is
> >> >> >>> as
> >> >> >>> >> >>> >>> follows:
> >> >> >>> >> >>> >>>
> >> >> >>> >> >>> >>> * Monorepo for development (for practicality)
> >> >> >>> >> >>> >>> * Releases structured according to the desires of the
> >> PMCs
> >> >> >>> >> >>> >>>
> >> >> >>> >> >>> >>> - Wes
> >> >> >>> >> >>> >>>
> >> >> >>> >> >>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <
> >> >> >>> >> joshuasto...@gmail.com
> >> >> >>> >> >>> >
> >> >> >>> >> >>> >>> wrote:
> >> >> >>> >> >>> >>> > I recently worked on an issue that had to be
> >> implemented
> >> >> in
> >> >> >>> >> >>> parquet-cpp
> >> >> >>> >> >>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow
> >> >> >>> >> (ARROW-2585,
> >> >> >>> >> >>> >>> > ARROW-2586). I found the circular dependencies
> >> confusing
> >> >> and
> >> >> >>> >> hard to
> >> >> >>> >> >>> work
> >> >> >>> >> >>> >>> > with. For example, I still have a PR open in
> >> parquet-cpp
> >> >> >>> >> (created on
> >> >> >>> >> >>> May
> >> >> >>> >> >>> >>> > 10) because of a PR that it depended on in arrow that
> >> was
> >> >> >>> >> recently
> >> >> >>> >> >>> >>> merged.
> >> >> >>> >> >>> >>> > I couldn't even address any CI issues in the PR because
> >> >> the
> >> >> >>> >> change in
> >> >> >>> >> >>> >>> arrow
> >> >> >>> >> >>> >>> > was not yet in master. In a separate PR, I changed the
> >> >> >>> >> >>> >>> run_clang_format.py
> >> >> >>> >> >>> >>> > script in the arrow project only to find out later that
> >> >> there
> >> >> >>> >> was an
> >> >> >>> >> >>> >>> exact
> >> >> >>> >> >>> >>> > copy of it in parquet-cpp.
> >> >> >>> >> >>> >>> >
> >> >> >>> >> >>> >>> > However, I don't think merging the codebases makes
> >> sense
> >> >> in
> >> >> >>> the
> >> >> >>> >> long
> >> >> >>> >> >>> >>> term.
> >> >> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve
> >> >> arrow
> >> >> >>> and
> >> >> >>> >> >>> tying
> >> >> >>> >> >>> >>> them
> >> >> >>> >> >>> >>> > together seems like the wrong choice. There will be
> >> other
> >> >> >>> formats
> >> >> >>> >> >>> that
> >> >> >>> >> >>> >>> > arrow needs to support that will be kept separate
> >> (e.g. -
> >> >> >>> Orc),
> >> >> >>> >> so I
> >> >> >>> >> >>> >>> don't
> >> >> >>> >> >>> >>> > see why parquet should be special. I also think build
> >> >> tooling
> >> >> >>> >> should
> >> >> >>> >> >>> be
> >> >> >>> >> >>> >>> > pulled into its own codebase. GNU has had a long
> >> history
> >> >> of
> >> >> >>> >> >>> developing
> >> >> >>> >> >>> >>> open
> >> >> >>> >> >>> >>> > source C/C++ projects that way and made projects like
> >> >> >>> >> >>> >>> > autoconf/automake/make to support them. I don't think
> >> CI
> >> >> is a
> >> >> >>> >> good
> >> >> >>> >> >>> >>> > counter-example since there have been lots of
> >> successful
> >> >> open
> >> >> >>> >> source
> >> >> >>> >> >>> >>> > projects that have used nightly build systems that
> >> pinned
> >> >> >>> >> versions of
> >> >> >>> >> >>> >>> > dependent software.
> >> >> >>> >> >>> >>> >
> >> >> >>> >> >>> >>> > That being said, I think it makes sense to merge the
> >> >> codebases
> >> >> >>> >> in the
> >> >> >>> >> >>> >>> short
> >> >> >>> >> >>> >>> > term with the express purpose of separating them in the
> >> >> near
> >> >> >>> >> term.
> >> >> >>> >> >>> My
> >> >> >>> >> >>> >>> > reasoning is as follows. By putting the codebases
> >> >> together,
> >> >> >>> you
> >> >> >>> >> can
> >> >> >>> >> >>> more
> >> >> >>> >> >>> >>> > easily delineate the boundaries between the API's with
> >> a
> >> >> >>> single
> >> >> >>> >> PR.
> >> >> >>> >> >>> >>> Second,
> >> >> >>> >> >>> >>> > it will force the build tooling to converge instead of
> >> >> >>> diverge,
> >> >> >>> >> >>> which has
> >> >> >>> >> >>> >>> > already happened. Once the boundaries and tooling have
> >> >> been
> >> >> >>> >> sorted
> >> >> >>> >> >>> out,
> >> >> >>> >> >>> >>> it
> >> >> >>> >> >>> >>> > should be easy to separate them back into their own
> >> >> codebases.
> >> >> >>> >> >>> >>> >
> >> >> >>> >> >>> >>> > If the codebases are merged, I would ask that the C++
> >> >> >>> codebases
> >> >> >>> >> for
> >> >> >>> >> >>> arrow
> >> >> >>> >> >>> >>> > be separated from other languages. Looking at it from
> >> the
> >> >> >>> >> >>> perspective of
> >> >> >>> >> >>> >>> a
> >> >> >>> >> >>> >>> > parquet-cpp library user, having a dependency on Java
> >> is a
> >> >> >>> large
> >> >> >>> >> tax
> >> >> >>> >> >>> to
> >> >> >>> >> >>> >>> pay
> >> >> >>> >> >>> >>> > if you don't need it. For example, there were 25 JIRA's
> >> >> in the
> >> >> >>> >> 0.10.0
> >> >> >>> >> >>> >>> > release of arrow, many of which were holding up the
> >> >> release. I
> >> >> >>> >> hope
> >> >> >>> >> >>> that
> >> >> >>> >> >>> >>> > seems like a reasonable compromise, and I think it will
> >> >> help
> >> >> >>> >> reduce
> >> >> >>> >> >>> the
> >> >> >>> >> >>> >>> > complexity of the build/release tooling.
> >> >> >>> >> >>> >>> >
> >> >> >>> >> >>> >>> >
> >> >> >>> >> >>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <
> >> >> >>> >> ted.dunn...@gmail.com>
> >> >> >>> >> >>> >>> wrote:
> >> >> >>> >> >>> >>> >
> >> >> >>> >> >>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <
> >> >> >>> >> wesmck...@gmail.com>
> >> >> >>> >> >>> >>> wrote:
> >> >> >>> >> >>> >>> >>
> >> >> >>> >> >>> >>> >> >
> >> >> >>> >> >>> >>> >> > > The community will be less willing to accept large
> >> >> >>> >> >>> >>> >> > > changes that require multiple rounds of patches
> >> for
> >> >> >>> >> stability
> >> >> >>> >> >>> and
> >> >> >>> >> >>> >>> API
> >> >> >>> >> >>> >>> >> > > convergence. Our contributions to Libhdfs++ in the
> >> >> HDFS
> >> >> >>> >> >>> community
> >> >> >>> >> >>> >>> took
> >> >> >>> >> >>> >>> >> a
> >> >> >>> >> >>> >>> >> > > significantly long time for the very same reason.
> >> >> >>> >> >>> >>> >> >
> >> >> >>> >> >>> >>> >> > Please don't use bad experiences from another open
> >> >> source
> >> >> >>> >> >>> community as
> >> >> >>> >> >>> >>> >> > leverage in this discussion. I'm sorry that things
> >> >> didn't
> >> >> >>> go
> >> >> >>> >> the
> >> >> >>> >> >>> way
> >> >> >>> >> >>> >>> >> > you wanted in Apache Hadoop but this is a distinct
> >> >> >>> community
> >> >> >>> >> which
> >> >> >>> >> >>> >>> >> > happens to operate under a similar open governance
> >> >> model.
> >> >> >>> >> >>> >>> >>
> >> >> >>> >> >>> >>> >>
> >> >> >>> >> >>> >>> >> There are some more radical and community building
> >> >> options as
> >> >> >>> >> well.
> >> >> >>> >> >>> Take
> >> >> >>> >> >>> >>> >> the subversion project as a precedent. With
> >> subversion,
> >> >> any
> >> >> >>> >> Apache
> >> >> >>> >> >>> >>> >> committer can request and receive a commit bit on some
> >> >> large
> >> >> >>> >> >>> fraction of
> >> >> >>> >> >>> >>> >> subversion.
> >> >> >>> >> >>> >>> >>
> >> >> >>> >> >>> >>> >> So why not take this a bit further and give every
> >> parquet
> >> >> >>> >> committer
> >> >> >>> >> >>> a
> >> >> >>> >> >>> >>> >> commit bit in Arrow? Or even make them be first class
> >> >> >>> >> committers in
> >> >> >>> >> >>> >>> Arrow?
> >> >> >>> >> >>> >>> >> Possibly even make it policy that every Parquet
> >> >> committer who
> >> >> >>> >> asks
> >> >> >>> >> >>> will
> >> >> >>> >> >>> >>> be
> >> >> >>> >> >>> >>> >> given committer status in Arrow.
> >> >> >>> >> >>> >>> >>
> >> >> >>> >> >>> >>> >> That relieves a lot of the social anxiety here.
> >> Parquet
> >> >> >>> >> committers
> >> >> >>> >> >>> >>> can't be
> >> >> >>> >> >>> >>> >> worried at that point whether their patches will get
> >> >> merged;
> >> >> >>> >> they
> >> >> >>> >> >>> can
> >> >> >>> >> >>> >>> just
> >> >> >>> >> >>> >>> >> merge them.  Arrow shouldn't worry much about inviting
> >> >> in the
> >> >> >>> >> >>> Parquet
> >> >> >>> >> >>> >>> >> committers. After all, Arrow already depends a lot on
> >> >> >>> parquet so
> >> >> >>> >> >>> why not
> >> >> >>> >> >>> >>> >> invite them in?
> >> >> >>> >> >>> >>> >>
> >> >> >>> >> >>> >>>
> >> >> >>> >> >>>
> >> >> >>> >> >>
> >> >> >>> >> >>
> >> >> >>> >> >> --
> >> >> >>> >> >> regards,
> >> >> >>> >> >> Deepak Majeti
> >> >> >>> >>
> >> >> >>> >
> >> >> >>> >
> >> >> >>> > --
> >> >> >>> > regards,
> >> >> >>> > Deepak Majeti
> >> >> >>>
> >> >>
> >> >
> >> >
> >> > --
> >> > regards,
> >> > Deepak Majeti
> >>
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Reply via email to