Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Leif Walsh Mon, 27 Feb 2017 10:51:58 -0800

Julian, are you proposing the arrow project ship two artifacts,
arrow-common and arrow, where arrow depends on arrow-common?
On Mon, Feb 27, 2017 at 11:51 Julian Hyde <jh...@apache.org> wrote:


> “Commons” projects are often problematic. It is difficult to tell what is
> in scope and out of scope. If the scope is drawn too wide, there is a real
> problem of orphaned features, because people contribute one feature and
> then disappear.
>
> Let’s remember the Apache mantra: community over code. If you create a
> sustainable community, the code will get looked after. Would this project
> form a new community, or just a new piece of code? As I read the current
> proposal, it would be the intersection of some existing communities, not a
> new community.
>
> I think it would take a considerable effort to create a new project and
> community around the idea of “c++ commons” (or is it “database-related c++
> commons”?). I think you already have such a community, to a first
> approximation, in the Arrow project, because Kudu and Impala developers are
> already part of the Arrow community. There’s no reason why Arrow cannot
> contain new modules that have different release schedules than the rest of
> Arrow. As a TLP, releases are less burdensome, and can happen in a little
> over 3 days if the component is kept stable.
>
> Lastly, the code is fungible. It can be marked “experimental” within Arrow
> and moved to another project, or into a new project, as it matures. The
> Apache license and the ASF CLA makes this very easy. We are doing something
> like this in Calcite: the Avatica sub-project [1] has a community that
> intersect’s with Calcite’s, is disconnected at a code level, and may over
> time evolve into a separate project. In the mean time, being part of an
> established project is helpful, because there are PMC members to vote.
>
> Julian
>
> [1] https://calcite.apache.org/avatica/ <
> https://calcite.apache.org/avatica/>
>
> > On Feb 27, 2017, at 6:41 AM, Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > Responding to Todd's e-mail:
> >
> > 1) Open source release model
> >
> > My expectation is that this library would release about once a month,
> > with occasional faster releases for critical fixes.
> >
> > 2) Governance/review model
> >
> > Beyond having centralized code reviews, it's hard to predict how the
> > governance would play out. I understand that OSS projects behave
> > differently in their planning / design / review process, so work on a
> > common need may require more of a negotiation than the prior
> > "unilateral" process.
> >
> > I think it says something for our communities that we would make a
> > commitment in our collaboration on this to the success of the
> > "consumer" projects. So if the Arrow or Parquet communities were
> > contemplating a change that might impact Kudu, for example, it would
> > be in our best interest to be careful and communicate proactively.
> >
> > This all makes sense. From an Arrow and Parquet perspective, we do not
> > add very much testing burden because our continuous integration suites
> > do not take long to run.
> >
> > 3) Pre-commit/test mechanics
> >
> > One thing that would help would be community-maintained
> > Dockerfiles/Docker images (or equivalent) to assist with validation
> > and testing for developers.
> >
> > I am happy to comply with a pre-commit testing protocol that works for
> > the Kudu and Impala teams.
> >
> > 4) Integration mechanics for breaking changes
> >
> >> One option is that each "user" of the libraries manually "rolls" to new
> versions when they feel like it, but there's still now a case where a
> common change "pushes work onto" the consumers to update call sites, etc.
> >
> > Breaking API changes will create extra work, because any automated
> > testing that we create will not be able to validate the patch to the
> > common library. Perhaps we can configure a manual way (in Jenkins,
> > say) to test two patches together.
> >
> > In the event that a community member has a patch containing an API
> > break that impacts a project that they are not a contributor for,
> > there should be some expectation to either work with the affected
> > project on a coordinated patch or obtain their +1 to merge the patch
> > even though it will may require a follow up patch if the roll-forward
> > in the consumer project exposes bugs in the common library. There may
> > be situations like:
> >
> > * Kudu changes API in $COMMON that impacts Arrow
> > * Arrow says +1, we will roll forward $COMMON later
> > * Patch merged
> > * Arrow rolls forward, discovers bug caused by patch in $COMMON
> > * Arrow proposes patch to $COMMON
> > * ...
> >
> > This is the worst case scenario, of course, but I actually think it is
> > good because it would indicate that the unit testing in $COMMON needs
> > to be improved. Unit testing in the common library, therefore, would
> > take on more of a "defensive" quality than currently.
> >
> > In any case, I'm keen to move forward to coming up with a concrete
> > plan if we can reach consensus on the particulars.
> >
> > Thanks
> > Wes
> >
> > On Sun, Feb 26, 2017 at 10:18 PM, Leif Walsh <leif.wa...@gmail.com>
> wrote:
> >> I also support the idea of creating an "apache commons modern c++" style
> >> library, maybe tailored toward the needs of columnar data processing
> >> tools.  I think APR is the wrong project but I think that *style* of
> >> project is the right direction to aim.
> >>
> >> I agree this adds test and release process complexity across products
> but I
> >> think the benefits of a shared, well-tested library outweigh that, and
> >> creating such test infrastructure will have long-term benefits as well.
> >>
> >> I'd be happy to lend a hand wherever it's needed.
> >>
> >> On Sun, Feb 26, 2017 at 4:03 PM Todd Lipcon <t...@cloudera.com> wrote:
> >>
> >>> Hey folks,
> >>>
> >>> As Henry mentioned, Impala is starting to share more code with Kudu
> (most
> >>> notably our RPC system, but that pulls in a fair bit of utility code as
> >>> well), so we've been chatting periodically offline about the best way
> to do
> >>> this. Having more projects potentially interested in collaborating is
> >>> definitely welcome, though I think does also increase the complexity of
> >>> whatever solution we come up with.
> >>>
> >>> I think the potential benefits of collaboration are fairly
> self-evident, so
> >>> I'll focus on my concerns here, which somewhat echo Henry's.
> >>>
> >>> 1) Open source release model
> >>>
> >>> The ASF is very much against having projects which do not do releases.
> So,
> >>> if we were to create some new ASF project to hold this code, we'd be
> >>> expected to do frequent releases thereof. Wes volunteered above to lead
> >>> frequent releases, but we actually need at least 3 PMC members to vote
> on
> >>> each release, and given people can come and go, we'd probably need at
> least
> >>> 5-8 people who are actively committed to helping with the release
> process
> >>> of this "commons" project.
> >>>
> >>> Unlike our existing projects, which seem to release every 2-3 months,
> if
> >>> that, I think this one would have to release _much_ more frequently,
> if we
> >>> expect downstream projects to depend on released versions rather than
> just
> >>> pulling in some recent (or even trunk) git hash. Since the ASF
> requires the
> >>> normal voting period and process for every release, I don't think we
> could
> >>> do something like have "daily automatic releases", etc.
> >>>
> >>> We could probably campaign the ASF membership to treat this project
> >>> differently, either as (a) a repository of code that never releases, in
> >>> which case the "downstream" projects are responsible for vetting IP,
> etc,
> >>> as part of their own release processes, or (b) a project which does
> >>> automatic releases voted upon by robots. I'm guessing that (a) is more
> >>> palatable from an IP perspective, and also from the perspective of the
> >>> downstream projects.
> >>>
> >>>
> >>> 2) Governance/review model
> >>>
> >>> The more projects there are sharing this common code, the more
> difficult it
> >>> is to know whether a change would break something, or even whether a
> change
> >>> is considered desirable for all of the projects. I don't want to get
> into
> >>> some world where any change to a central library requires a multi-week
> >>> proposal/design-doc/review across 3+ different groups of committers,
> all of
> >>> whom may have different near-term priorities. On the other hand, it
> would
> >>> be pretty frustrating if the week before we're trying to cut a Kudu
> release
> >>> branch, someone in another community decides to make a potentially
> >>> destabilizing change to the RPC library.
> >>>
> >>>
> >>> 3) Pre-commit/test mechanics
> >>>
> >>> Semi-related to the above: we currently feel pretty confident when we
> make
> >>> a change to a central library like kudu/util/thread.cc that nothing
> broke
> >>> because we run the full suite of Kudu tests. Of course the central
> >>> libraries have some unit test coverage, but I wouldn't be confident
> with
> >>> any sort of model where shared code can change without verification by
> a
> >>> larger suite of tests.
> >>>
> >>> On the other hand, I also don't want to move to a model where any
> change to
> >>> shared code requires a 6+-hour precommit spanning several projects,
> each of
> >>> which may have its own set of potentially-flaky pre-commit tests, etc.
> I
> >>> can imagine that if an Arrow developer made some change to "thread.cc"
> and
> >>> saw that TabletServerStressTest failed their precommit, they'd have no
> idea
> >>> how to triage it, etc. That could be a strong disincentive to continued
> >>> innovation in these areas of common code, which we'll need a good way
> to
> >>> avoid.
> >>>
> >>> I think some of the above could be ameliorated with really good
> >>> infrastructure -- eg on a test failure, automatically re-run the failed
> >>> test on both pre-patch and post-patch, do a t-test to check statistical
> >>> significance in flakiness level, etc. But, that's a lot of
> infrastructure
> >>> that doesn't currently exist.
> >>>
> >>>
> >>> 4) Integration mechanics for breaking changes
> >>>
> >>> Currently these common libraries are treated as components of
> monolithic
> >>> projects. That means it's no extra overhead for us to make some kind of
> >>> change which breaks an API in src/kudu/util/ and at the same time
> updates
> >>> all call sites. The internal libraries have no semblance of API
> >>> compatibility guarantees, etc, and adding one is not without cost.
> >>>
> >>> Before sharing code, we should figure out how exactly we'll manage the
> >>> cases where we want to make some change in a common library that
> breaks an
> >>> API used by other projects, given there's no way to make an atomic
> commit
> >>> across many repositories. One option is that each "user" of the
> libraries
> >>> manually "rolls" to new versions when they feel like it, but there's
> still
> >>> now a case where a common change "pushes work onto" the consumers to
> update
> >>> call sites, etc.
> >>>
> >>> Admittedly, the number of breaking API changes in these common
> libraries is
> >>> relatively small, but would still be good to understand how we would
> plan
> >>> to manage them.
> >>>
> >>> -Todd
> >>>
> >>> On Sun, Feb 26, 2017 at 10:12 AM, Wes McKinney <wesmck...@gmail.com>
> >>> wrote:
> >>>
> >>>> hi Henry,
> >>>>
> >>>> Thank you for these comments.
> >>>>
> >>>> I think having a kind of "Apache Commons for [Modern] C++" would be an
> >>>> ideal (though perhaps initially more labor intensive) solution.
> >>>> There's code in Arrow that I would move into this project if it
> >>>> existed. I am happy to help make this happen if there is interest from
> >>>> the Kudu and Impala communities. I am not sure logistically what would
> >>>> be the most expedient way to establish the project, whether as an ASF
> >>>> Incubator project or possibly as a new TLP that could be created by
> >>>> spinning IP out of Apache Kudu.
> >>>>
> >>>> I'm interested to hear the opinions of others, and possible next
> steps.
> >>>>
> >>>> Thanks
> >>>> Wes
> >>>>
> >>>> On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org>
> >>> wrote:
> >>>>> Thanks for bringing this up, Wes.
> >>>>>
> >>>>> On 25 February 2017 at 14:18, Wes McKinney <wesmck...@gmail.com>
> >>> wrote:
> >>>>>
> >>>>>> Dear Apache Kudu and Apache Impala (incubating) communities,
> >>>>>>
> >>>>>> (I'm not sure the best way to have a cross-list discussion, so I
> >>>>>> apologize if this does not work well)
> >>>>>>
> >>>>>> On the recent Apache Parquet sync call, we discussed C++ code
> sharing
> >>>>>> between the codebases in Apache Arrow and Apache Parquet, and
> >>>>>> opportunities for more code sharing with Kudu and Impala as well.
> >>>>>>
> >>>>>> As context
> >>>>>>
> >>>>>> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
> >>>>>> first C++ release within Apache Parquet. I got involved with this
> >>>>>> project a little over a year ago and was faced with the unpleasant
> >>>>>> decision to copy and paste a significant amount of code out of
> >>>>>> Impala's codebase to bootstrap the project.
> >>>>>>
> >>>>>> * In parallel, we begin the Apache Arrow project, which is designed
> to
> >>>>>> be a complementary library for file formats (like Parquet), storage
> >>>>>> engines (like Kudu), and compute engines (like Impala and pandas).
> >>>>>>
> >>>>>> * As Arrow and parquet-cpp matured, an increasing amount of code
> >>>>>> overlap crept up surrounding buffer memory management and IO
> >>>>>> interface. We recently decided in PARQUET-818
> >>>>>> (https://github.com/apache/parquet-cpp/commit/
> >>>>>> 2154e873d5aa7280314189a2683fb1e12a590c02)
> >>>>>> to remove some of the obvious code overlap in Parquet and make
> >>>>>> libarrow.a/so a hard compile and link-time dependency for
> >>>>>> libparquet.a/so.
> >>>>>>
> >>>>>> * There is still quite a bit of code in parquet-cpp that would
> better
> >>>>>> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary
> encoding,
> >>>>>> compression, bit utilities, and so forth. Much of this code
> originated
> >>>>>> from Impala
> >>>>>>
> >>>>>> This brings me to a next set of points:
> >>>>>>
> >>>>>> * parquet-cpp contains quite a bit of code that was extracted from
> >>>>>> Impala. This is mostly self-contained in
> >>>>>> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
> >>>>>>
> >>>>>> * My understanding is that Kudu extracted certain computational
> >>>>>> utilities from Impala in its early days, but these tools have likely
> >>>>>> diverged as the needs of the projects have evolved.
> >>>>>>
> >>>>>> Since all of these projects are quite different in their end goals
> >>>>>> (runtime systems vs. libraries), touching code that is tightly
> coupled
> >>>>>> to either Kudu or Impala's runtimes is probably not worth
> discussing.
> >>>>>> However, I think there is a strong basis for collaboration on
> >>>>>> computational utilities and vectorized array processing. Some
> obvious
> >>>>>> areas that come to mind:
> >>>>>>
> >>>>>> * SIMD utilities (for hashing or processing of preallocated
> contiguous
> >>>>>> memory)
> >>>>>> * Array encoding utilities: RLE / Dictionary, etc.
> >>>>>> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
> >>>>>> contributed a patch to parquet-cpp around this)
> >>>>>> * Date and time utilities
> >>>>>> * Compression utilities
> >>>>>>
> >>>>>
> >>>>> Between Kudu and Impala (at least) there are many more opportunities
> >>> for
> >>>>> sharing. Threads, logging, metrics, concurrent primitives - the list
> is
> >>>>> quite long.
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> I hope the benefits are obvious: consolidating efforts on unit
> >>>>>> testing, benchmarking, performance optimizations, continuous
> >>>>>> integration, and platform compatibility.
> >>>>>>
> >>>>>> Logistically speaking, one possible avenue might be to use Apache
> >>>>>> Arrow as the place to assemble this code. Its thirdparty toolchain
> is
> >>>>>> small, and it builds and installs fast. It is intended as a library
> to
> >>>>>> have its headers used and linked against other applications. (As an
> >>>>>> aside, I'm very interested in building optional support for Arrow
> >>>>>> columnar messages into the kudu client).
> >>>>>>
> >>>>>
> >>>>> In principle I'm in favour of code sharing, and it seems very much in
> >>>>> keeping with the Apache way. However, practically speaking I'm of the
> >>>>> opinion that it only makes sense to house shared support code in a
> >>>>> separate, dedicated project.
> >>>>>
> >>>>> Embedding the shared libraries in, e.g., Arrow naturally limits the
> >>> scope
> >>>>> of sharing to utilities that Arrow is interested in. It would make no
> >>>> sense
> >>>>> to add a threading library to Arrow if it was never used natively.
> >>>> Muddying
> >>>>> the waters of the project's charter seems likely to lead to user, and
> >>>>> developer, confusion. Similarly, we should not necessarily couple
> >>> Arrow's
> >>>>> design goals to those it inherits from Kudu and Impala's source code.
> >>>>>
> >>>>> I think I'd rather see a new Apache project than re-use a current one
> >>> for
> >>>>> two independent purposes.
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> The downside of code sharing, which may have prevented it so far,
> are
> >>>>>> the logistics of coordinating ASF release cycles and keeping build
> >>>>>> toolchains in sync. It's taken us the past year to stabilize the
> >>>>>> design of Arrow for its intended use cases, so at this point if we
> >>>>>> went down this road I would be OK with helping the community commit
> to
> >>>>>> a regular release cadence that would be faster than Impala, Kudu,
> and
> >>>>>> Parquet's respective release cadences. Since members of the Kudu and
> >>>>>> Impala PMC are also on the Arrow PMC, I trust we would be able to
> >>>>>> collaborate to each other's mutual benefit and success.
> >>>>>>
> >>>>>> Note that Arrow does not throw C++ exceptions and similarly follows
> >>>>>> Google C++ style guide to the same extent at Kudu and Impala.
> >>>>>>
> >>>>>> If this is something that either the Kudu or Impala communities
> would
> >>>>>> like to pursue in earnest, I would be happy to work with you on next
> >>>>>> steps. I would suggest that we start with something small so that we
> >>>>>> could address the necessary build toolchain changes, and develop a
> >>>>>> workflow for moving around code and tests, a protocol for code
> reviews
> >>>>>> (e.g. Gerrit), and coordinating ASF releases.
> >>>>>>
> >>>>>
> >>>>> I think, if I'm reading this correctly, that you're assuming
> >>> integration
> >>>>> with the 'downstream' projects (e.g. Impala and Kudu) would be done
> via
> >>>>> their toolchains. For something as fast moving as utility code - and
> >>>>> critical, where you want the latency between adding a fix and
> including
> >>>> it
> >>>>> in your build to be ~0 - that's a non-starter to me, at least with
> how
> >>>> the
> >>>>> toolchains are currently realised.
> >>>>>
> >>>>> I'd rather have the source code directly imported into Impala's tree
> -
> >>>>> whether by git submodule or other mechanism. That way the coupling is
> >>>>> looser, and we can move more quickly. I think that's important to
> other
> >>>>> projects as well.
> >>>>>
> >>>>> Henry
> >>>>>
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> Let me know what you think.
> >>>>>>
> >>>>>> best
> >>>>>> Wes
> >>>>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Todd Lipcon
> >>> Software Engineer, Cloudera
> >>>
> >> --
> >> --
> >> Cheers,
> >> Leif
>
> --
-- 
Cheers,
Leif

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Reply via email to