Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Wes McKinney Mon, 27 Feb 2017 06:42:24 -0800

Responding to Todd's e-mail:

1) Open source release model


My expectation is that this library would release about once a month,
with occasional faster releases for critical fixes.

2) Governance/review model

Beyond having centralized code reviews, it's hard to predict how the
governance would play out. I understand that OSS projects behave
differently in their planning / design / review process, so work on a
common need may require more of a negotiation than the prior
"unilateral" process.

I think it says something for our communities that we would make a
commitment in our collaboration on this to the success of the
"consumer" projects. So if the Arrow or Parquet communities were
contemplating a change that might impact Kudu, for example, it would
be in our best interest to be careful and communicate proactively.

This all makes sense. From an Arrow and Parquet perspective, we do not
add very much testing burden because our continuous integration suites
do not take long to run.

3) Pre-commit/test mechanics

One thing that would help would be community-maintained
Dockerfiles/Docker images (or equivalent) to assist with validation
and testing for developers.

I am happy to comply with a pre-commit testing protocol that works for
the Kudu and Impala teams.

4) Integration mechanics for breaking changes

> One option is that each "user" of the libraries manually "rolls" to new 
> versions when they feel like it, but there's still now a case where a common 
> change "pushes work onto" the consumers to update call sites, etc.

Breaking API changes will create extra work, because any automated
testing that we create will not be able to validate the patch to the
common library. Perhaps we can configure a manual way (in Jenkins,
say) to test two patches together.

In the event that a community member has a patch containing an API
break that impacts a project that they are not a contributor for,
there should be some expectation to either work with the affected
project on a coordinated patch or obtain their +1 to merge the patch
even though it will may require a follow up patch if the roll-forward
in the consumer project exposes bugs in the common library. There may
be situations like:

* Kudu changes API in $COMMON that impacts Arrow
* Arrow says +1, we will roll forward $COMMON later
* Patch merged
* Arrow rolls forward, discovers bug caused by patch in $COMMON
* Arrow proposes patch to $COMMON
* ...

This is the worst case scenario, of course, but I actually think it is
good because it would indicate that the unit testing in $COMMON needs
to be improved. Unit testing in the common library, therefore, would
take on more of a "defensive" quality than currently.

In any case, I'm keen to move forward to coming up with a concrete
plan if we can reach consensus on the particulars.

Thanks
Wes

On Sun, Feb 26, 2017 at 10:18 PM, Leif Walsh <[email protected]> wrote:
> I also support the idea of creating an "apache commons modern c++" style
> library, maybe tailored toward the needs of columnar data processing
> tools.  I think APR is the wrong project but I think that *style* of
> project is the right direction to aim.
>
> I agree this adds test and release process complexity across products but I
> think the benefits of a shared, well-tested library outweigh that, and
> creating such test infrastructure will have long-term benefits as well.
>
> I'd be happy to lend a hand wherever it's needed.
>
> On Sun, Feb 26, 2017 at 4:03 PM Todd Lipcon <[email protected]> wrote:
>
>> Hey folks,
>>
>> As Henry mentioned, Impala is starting to share more code with Kudu (most
>> notably our RPC system, but that pulls in a fair bit of utility code as
>> well), so we've been chatting periodically offline about the best way to do
>> this. Having more projects potentially interested in collaborating is
>> definitely welcome, though I think does also increase the complexity of
>> whatever solution we come up with.
>>
>> I think the potential benefits of collaboration are fairly self-evident, so
>> I'll focus on my concerns here, which somewhat echo Henry's.
>>
>> 1) Open source release model
>>
>> The ASF is very much against having projects which do not do releases. So,
>> if we were to create some new ASF project to hold this code, we'd be
>> expected to do frequent releases thereof. Wes volunteered above to lead
>> frequent releases, but we actually need at least 3 PMC members to vote on
>> each release, and given people can come and go, we'd probably need at least
>> 5-8 people who are actively committed to helping with the release process
>> of this "commons" project.
>>
>> Unlike our existing projects, which seem to release every 2-3 months, if
>> that, I think this one would have to release _much_ more frequently, if we
>> expect downstream projects to depend on released versions rather than just
>> pulling in some recent (or even trunk) git hash. Since the ASF requires the
>> normal voting period and process for every release, I don't think we could
>> do something like have "daily automatic releases", etc.
>>
>> We could probably campaign the ASF membership to treat this project
>> differently, either as (a) a repository of code that never releases, in
>> which case the "downstream" projects are responsible for vetting IP, etc,
>> as part of their own release processes, or (b) a project which does
>> automatic releases voted upon by robots. I'm guessing that (a) is more
>> palatable from an IP perspective, and also from the perspective of the
>> downstream projects.
>>
>>
>> 2) Governance/review model
>>
>> The more projects there are sharing this common code, the more difficult it
>> is to know whether a change would break something, or even whether a change
>> is considered desirable for all of the projects. I don't want to get into
>> some world where any change to a central library requires a multi-week
>> proposal/design-doc/review across 3+ different groups of committers, all of
>> whom may have different near-term priorities. On the other hand, it would
>> be pretty frustrating if the week before we're trying to cut a Kudu release
>> branch, someone in another community decides to make a potentially
>> destabilizing change to the RPC library.
>>
>>
>> 3) Pre-commit/test mechanics
>>
>> Semi-related to the above: we currently feel pretty confident when we make
>> a change to a central library like kudu/util/thread.cc that nothing broke
>> because we run the full suite of Kudu tests. Of course the central
>> libraries have some unit test coverage, but I wouldn't be confident with
>> any sort of model where shared code can change without verification by a
>> larger suite of tests.
>>
>> On the other hand, I also don't want to move to a model where any change to
>> shared code requires a 6+-hour precommit spanning several projects, each of
>> which may have its own set of potentially-flaky pre-commit tests, etc. I
>> can imagine that if an Arrow developer made some change to "thread.cc" and
>> saw that TabletServerStressTest failed their precommit, they'd have no idea
>> how to triage it, etc. That could be a strong disincentive to continued
>> innovation in these areas of common code, which we'll need a good way to
>> avoid.
>>
>> I think some of the above could be ameliorated with really good
>> infrastructure -- eg on a test failure, automatically re-run the failed
>> test on both pre-patch and post-patch, do a t-test to check statistical
>> significance in flakiness level, etc. But, that's a lot of infrastructure
>> that doesn't currently exist.
>>
>>
>> 4) Integration mechanics for breaking changes
>>
>> Currently these common libraries are treated as components of monolithic
>> projects. That means it's no extra overhead for us to make some kind of
>> change which breaks an API in src/kudu/util/ and at the same time updates
>> all call sites. The internal libraries have no semblance of API
>> compatibility guarantees, etc, and adding one is not without cost.
>>
>> Before sharing code, we should figure out how exactly we'll manage the
>> cases where we want to make some change in a common library that breaks an
>> API used by other projects, given there's no way to make an atomic commit
>> across many repositories. One option is that each "user" of the libraries
>> manually "rolls" to new versions when they feel like it, but there's still
>> now a case where a common change "pushes work onto" the consumers to update
>> call sites, etc.
>>
>> Admittedly, the number of breaking API changes in these common libraries is
>> relatively small, but would still be good to understand how we would plan
>> to manage them.
>>
>> -Todd
>>
>> On Sun, Feb 26, 2017 at 10:12 AM, Wes McKinney <[email protected]>
>> wrote:
>>
>> > hi Henry,
>> >
>> > Thank you for these comments.
>> >
>> > I think having a kind of "Apache Commons for [Modern] C++" would be an
>> > ideal (though perhaps initially more labor intensive) solution.
>> > There's code in Arrow that I would move into this project if it
>> > existed. I am happy to help make this happen if there is interest from
>> > the Kudu and Impala communities. I am not sure logistically what would
>> > be the most expedient way to establish the project, whether as an ASF
>> > Incubator project or possibly as a new TLP that could be created by
>> > spinning IP out of Apache Kudu.
>> >
>> > I'm interested to hear the opinions of others, and possible next steps.
>> >
>> > Thanks
>> > Wes
>> >
>> > On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <[email protected]>
>> wrote:
>> > > Thanks for bringing this up, Wes.
>> > >
>> > > On 25 February 2017 at 14:18, Wes McKinney <[email protected]>
>> wrote:
>> > >
>> > >> Dear Apache Kudu and Apache Impala (incubating) communities,
>> > >>
>> > >> (I'm not sure the best way to have a cross-list discussion, so I
>> > >> apologize if this does not work well)
>> > >>
>> > >> On the recent Apache Parquet sync call, we discussed C++ code sharing
>> > >> between the codebases in Apache Arrow and Apache Parquet, and
>> > >> opportunities for more code sharing with Kudu and Impala as well.
>> > >>
>> > >> As context
>> > >>
>> > >> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
>> > >> first C++ release within Apache Parquet. I got involved with this
>> > >> project a little over a year ago and was faced with the unpleasant
>> > >> decision to copy and paste a significant amount of code out of
>> > >> Impala's codebase to bootstrap the project.
>> > >>
>> > >> * In parallel, we begin the Apache Arrow project, which is designed to
>> > >> be a complementary library for file formats (like Parquet), storage
>> > >> engines (like Kudu), and compute engines (like Impala and pandas).
>> > >>
>> > >> * As Arrow and parquet-cpp matured, an increasing amount of code
>> > >> overlap crept up surrounding buffer memory management and IO
>> > >> interface. We recently decided in PARQUET-818
>> > >> (https://github.com/apache/parquet-cpp/commit/
>> > >> 2154e873d5aa7280314189a2683fb1e12a590c02)
>> > >> to remove some of the obvious code overlap in Parquet and make
>> > >> libarrow.a/so a hard compile and link-time dependency for
>> > >> libparquet.a/so.
>> > >>
>> > >> * There is still quite a bit of code in parquet-cpp that would better
>> > >> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
>> > >> compression, bit utilities, and so forth. Much of this code originated
>> > >> from Impala
>> > >>
>> > >> This brings me to a next set of points:
>> > >>
>> > >> * parquet-cpp contains quite a bit of code that was extracted from
>> > >> Impala. This is mostly self-contained in
>> > >> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
>> > >>
>> > >> * My understanding is that Kudu extracted certain computational
>> > >> utilities from Impala in its early days, but these tools have likely
>> > >> diverged as the needs of the projects have evolved.
>> > >>
>> > >> Since all of these projects are quite different in their end goals
>> > >> (runtime systems vs. libraries), touching code that is tightly coupled
>> > >> to either Kudu or Impala's runtimes is probably not worth discussing.
>> > >> However, I think there is a strong basis for collaboration on
>> > >> computational utilities and vectorized array processing. Some obvious
>> > >> areas that come to mind:
>> > >>
>> > >> * SIMD utilities (for hashing or processing of preallocated contiguous
>> > >> memory)
>> > >> * Array encoding utilities: RLE / Dictionary, etc.
>> > >> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
>> > >> contributed a patch to parquet-cpp around this)
>> > >> * Date and time utilities
>> > >> * Compression utilities
>> > >>
>> > >
>> > > Between Kudu and Impala (at least) there are many more opportunities
>> for
>> > > sharing. Threads, logging, metrics, concurrent primitives - the list is
>> > > quite long.
>> > >
>> > >
>> > >>
>> > >> I hope the benefits are obvious: consolidating efforts on unit
>> > >> testing, benchmarking, performance optimizations, continuous
>> > >> integration, and platform compatibility.
>> > >>
>> > >> Logistically speaking, one possible avenue might be to use Apache
>> > >> Arrow as the place to assemble this code. Its thirdparty toolchain is
>> > >> small, and it builds and installs fast. It is intended as a library to
>> > >> have its headers used and linked against other applications. (As an
>> > >> aside, I'm very interested in building optional support for Arrow
>> > >> columnar messages into the kudu client).
>> > >>
>> > >
>> > > In principle I'm in favour of code sharing, and it seems very much in
>> > > keeping with the Apache way. However, practically speaking I'm of the
>> > > opinion that it only makes sense to house shared support code in a
>> > > separate, dedicated project.
>> > >
>> > > Embedding the shared libraries in, e.g., Arrow naturally limits the
>> scope
>> > > of sharing to utilities that Arrow is interested in. It would make no
>> > sense
>> > > to add a threading library to Arrow if it was never used natively.
>> > Muddying
>> > > the waters of the project's charter seems likely to lead to user, and
>> > > developer, confusion. Similarly, we should not necessarily couple
>> Arrow's
>> > > design goals to those it inherits from Kudu and Impala's source code.
>> > >
>> > > I think I'd rather see a new Apache project than re-use a current one
>> for
>> > > two independent purposes.
>> > >
>> > >
>> > >>
>> > >> The downside of code sharing, which may have prevented it so far, are
>> > >> the logistics of coordinating ASF release cycles and keeping build
>> > >> toolchains in sync. It's taken us the past year to stabilize the
>> > >> design of Arrow for its intended use cases, so at this point if we
>> > >> went down this road I would be OK with helping the community commit to
>> > >> a regular release cadence that would be faster than Impala, Kudu, and
>> > >> Parquet's respective release cadences. Since members of the Kudu and
>> > >> Impala PMC are also on the Arrow PMC, I trust we would be able to
>> > >> collaborate to each other's mutual benefit and success.
>> > >>
>> > >> Note that Arrow does not throw C++ exceptions and similarly follows
>> > >> Google C++ style guide to the same extent at Kudu and Impala.
>> > >>
>> > >> If this is something that either the Kudu or Impala communities would
>> > >> like to pursue in earnest, I would be happy to work with you on next
>> > >> steps. I would suggest that we start with something small so that we
>> > >> could address the necessary build toolchain changes, and develop a
>> > >> workflow for moving around code and tests, a protocol for code reviews
>> > >> (e.g. Gerrit), and coordinating ASF releases.
>> > >>
>> > >
>> > > I think, if I'm reading this correctly, that you're assuming
>> integration
>> > > with the 'downstream' projects (e.g. Impala and Kudu) would be done via
>> > > their toolchains. For something as fast moving as utility code - and
>> > > critical, where you want the latency between adding a fix and including
>> > it
>> > > in your build to be ~0 - that's a non-starter to me, at least with how
>> > the
>> > > toolchains are currently realised.
>> > >
>> > > I'd rather have the source code directly imported into Impala's tree -
>> > > whether by git submodule or other mechanism. That way the coupling is
>> > > looser, and we can move more quickly. I think that's important to other
>> > > projects as well.
>> > >
>> > > Henry
>> > >
>> > >
>> > >
>> > >>
>> > >> Let me know what you think.
>> > >>
>> > >> best
>> > >> Wes
>> > >>
>> >
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
> --
> --
> Cheers,
> Leif

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Reply via email to