Julian, are you proposing the arrow project ship two artifacts, arrow-common and arrow, where arrow depends on arrow-common? On Mon, Feb 27, 2017 at 11:51 Julian Hyde <jh...@apache.org> wrote:
> “Commons” projects are often problematic. It is difficult to tell what is > in scope and out of scope. If the scope is drawn too wide, there is a real > problem of orphaned features, because people contribute one feature and > then disappear. > > Let’s remember the Apache mantra: community over code. If you create a > sustainable community, the code will get looked after. Would this project > form a new community, or just a new piece of code? As I read the current > proposal, it would be the intersection of some existing communities, not a > new community. > > I think it would take a considerable effort to create a new project and > community around the idea of “c++ commons” (or is it “database-related c++ > commons”?). I think you already have such a community, to a first > approximation, in the Arrow project, because Kudu and Impala developers are > already part of the Arrow community. There’s no reason why Arrow cannot > contain new modules that have different release schedules than the rest of > Arrow. As a TLP, releases are less burdensome, and can happen in a little > over 3 days if the component is kept stable. > > Lastly, the code is fungible. It can be marked “experimental” within Arrow > and moved to another project, or into a new project, as it matures. The > Apache license and the ASF CLA makes this very easy. We are doing something > like this in Calcite: the Avatica sub-project [1] has a community that > intersect’s with Calcite’s, is disconnected at a code level, and may over > time evolve into a separate project. In the mean time, being part of an > established project is helpful, because there are PMC members to vote. > > Julian > > [1] https://calcite.apache.org/avatica/ < > https://calcite.apache.org/avatica/> > > > On Feb 27, 2017, at 6:41 AM, Wes McKinney <wesmck...@gmail.com> wrote: > > > > Responding to Todd's e-mail: > > > > 1) Open source release model > > > > My expectation is that this library would release about once a month, > > with occasional faster releases for critical fixes. > > > > 2) Governance/review model > > > > Beyond having centralized code reviews, it's hard to predict how the > > governance would play out. I understand that OSS projects behave > > differently in their planning / design / review process, so work on a > > common need may require more of a negotiation than the prior > > "unilateral" process. > > > > I think it says something for our communities that we would make a > > commitment in our collaboration on this to the success of the > > "consumer" projects. So if the Arrow or Parquet communities were > > contemplating a change that might impact Kudu, for example, it would > > be in our best interest to be careful and communicate proactively. > > > > This all makes sense. From an Arrow and Parquet perspective, we do not > > add very much testing burden because our continuous integration suites > > do not take long to run. > > > > 3) Pre-commit/test mechanics > > > > One thing that would help would be community-maintained > > Dockerfiles/Docker images (or equivalent) to assist with validation > > and testing for developers. > > > > I am happy to comply with a pre-commit testing protocol that works for > > the Kudu and Impala teams. > > > > 4) Integration mechanics for breaking changes > > > >> One option is that each "user" of the libraries manually "rolls" to new > versions when they feel like it, but there's still now a case where a > common change "pushes work onto" the consumers to update call sites, etc. > > > > Breaking API changes will create extra work, because any automated > > testing that we create will not be able to validate the patch to the > > common library. Perhaps we can configure a manual way (in Jenkins, > > say) to test two patches together. > > > > In the event that a community member has a patch containing an API > > break that impacts a project that they are not a contributor for, > > there should be some expectation to either work with the affected > > project on a coordinated patch or obtain their +1 to merge the patch > > even though it will may require a follow up patch if the roll-forward > > in the consumer project exposes bugs in the common library. There may > > be situations like: > > > > * Kudu changes API in $COMMON that impacts Arrow > > * Arrow says +1, we will roll forward $COMMON later > > * Patch merged > > * Arrow rolls forward, discovers bug caused by patch in $COMMON > > * Arrow proposes patch to $COMMON > > * ... > > > > This is the worst case scenario, of course, but I actually think it is > > good because it would indicate that the unit testing in $COMMON needs > > to be improved. Unit testing in the common library, therefore, would > > take on more of a "defensive" quality than currently. > > > > In any case, I'm keen to move forward to coming up with a concrete > > plan if we can reach consensus on the particulars. > > > > Thanks > > Wes > > > > On Sun, Feb 26, 2017 at 10:18 PM, Leif Walsh <leif.wa...@gmail.com> > wrote: > >> I also support the idea of creating an "apache commons modern c++" style > >> library, maybe tailored toward the needs of columnar data processing > >> tools. I think APR is the wrong project but I think that *style* of > >> project is the right direction to aim. > >> > >> I agree this adds test and release process complexity across products > but I > >> think the benefits of a shared, well-tested library outweigh that, and > >> creating such test infrastructure will have long-term benefits as well. > >> > >> I'd be happy to lend a hand wherever it's needed. > >> > >> On Sun, Feb 26, 2017 at 4:03 PM Todd Lipcon <t...@cloudera.com> wrote: > >> > >>> Hey folks, > >>> > >>> As Henry mentioned, Impala is starting to share more code with Kudu > (most > >>> notably our RPC system, but that pulls in a fair bit of utility code as > >>> well), so we've been chatting periodically offline about the best way > to do > >>> this. Having more projects potentially interested in collaborating is > >>> definitely welcome, though I think does also increase the complexity of > >>> whatever solution we come up with. > >>> > >>> I think the potential benefits of collaboration are fairly > self-evident, so > >>> I'll focus on my concerns here, which somewhat echo Henry's. > >>> > >>> 1) Open source release model > >>> > >>> The ASF is very much against having projects which do not do releases. > So, > >>> if we were to create some new ASF project to hold this code, we'd be > >>> expected to do frequent releases thereof. Wes volunteered above to lead > >>> frequent releases, but we actually need at least 3 PMC members to vote > on > >>> each release, and given people can come and go, we'd probably need at > least > >>> 5-8 people who are actively committed to helping with the release > process > >>> of this "commons" project. > >>> > >>> Unlike our existing projects, which seem to release every 2-3 months, > if > >>> that, I think this one would have to release _much_ more frequently, > if we > >>> expect downstream projects to depend on released versions rather than > just > >>> pulling in some recent (or even trunk) git hash. Since the ASF > requires the > >>> normal voting period and process for every release, I don't think we > could > >>> do something like have "daily automatic releases", etc. > >>> > >>> We could probably campaign the ASF membership to treat this project > >>> differently, either as (a) a repository of code that never releases, in > >>> which case the "downstream" projects are responsible for vetting IP, > etc, > >>> as part of their own release processes, or (b) a project which does > >>> automatic releases voted upon by robots. I'm guessing that (a) is more > >>> palatable from an IP perspective, and also from the perspective of the > >>> downstream projects. > >>> > >>> > >>> 2) Governance/review model > >>> > >>> The more projects there are sharing this common code, the more > difficult it > >>> is to know whether a change would break something, or even whether a > change > >>> is considered desirable for all of the projects. I don't want to get > into > >>> some world where any change to a central library requires a multi-week > >>> proposal/design-doc/review across 3+ different groups of committers, > all of > >>> whom may have different near-term priorities. On the other hand, it > would > >>> be pretty frustrating if the week before we're trying to cut a Kudu > release > >>> branch, someone in another community decides to make a potentially > >>> destabilizing change to the RPC library. > >>> > >>> > >>> 3) Pre-commit/test mechanics > >>> > >>> Semi-related to the above: we currently feel pretty confident when we > make > >>> a change to a central library like kudu/util/thread.cc that nothing > broke > >>> because we run the full suite of Kudu tests. Of course the central > >>> libraries have some unit test coverage, but I wouldn't be confident > with > >>> any sort of model where shared code can change without verification by > a > >>> larger suite of tests. > >>> > >>> On the other hand, I also don't want to move to a model where any > change to > >>> shared code requires a 6+-hour precommit spanning several projects, > each of > >>> which may have its own set of potentially-flaky pre-commit tests, etc. > I > >>> can imagine that if an Arrow developer made some change to "thread.cc" > and > >>> saw that TabletServerStressTest failed their precommit, they'd have no > idea > >>> how to triage it, etc. That could be a strong disincentive to continued > >>> innovation in these areas of common code, which we'll need a good way > to > >>> avoid. > >>> > >>> I think some of the above could be ameliorated with really good > >>> infrastructure -- eg on a test failure, automatically re-run the failed > >>> test on both pre-patch and post-patch, do a t-test to check statistical > >>> significance in flakiness level, etc. But, that's a lot of > infrastructure > >>> that doesn't currently exist. > >>> > >>> > >>> 4) Integration mechanics for breaking changes > >>> > >>> Currently these common libraries are treated as components of > monolithic > >>> projects. That means it's no extra overhead for us to make some kind of > >>> change which breaks an API in src/kudu/util/ and at the same time > updates > >>> all call sites. The internal libraries have no semblance of API > >>> compatibility guarantees, etc, and adding one is not without cost. > >>> > >>> Before sharing code, we should figure out how exactly we'll manage the > >>> cases where we want to make some change in a common library that > breaks an > >>> API used by other projects, given there's no way to make an atomic > commit > >>> across many repositories. One option is that each "user" of the > libraries > >>> manually "rolls" to new versions when they feel like it, but there's > still > >>> now a case where a common change "pushes work onto" the consumers to > update > >>> call sites, etc. > >>> > >>> Admittedly, the number of breaking API changes in these common > libraries is > >>> relatively small, but would still be good to understand how we would > plan > >>> to manage them. > >>> > >>> -Todd > >>> > >>> On Sun, Feb 26, 2017 at 10:12 AM, Wes McKinney <wesmck...@gmail.com> > >>> wrote: > >>> > >>>> hi Henry, > >>>> > >>>> Thank you for these comments. > >>>> > >>>> I think having a kind of "Apache Commons for [Modern] C++" would be an > >>>> ideal (though perhaps initially more labor intensive) solution. > >>>> There's code in Arrow that I would move into this project if it > >>>> existed. I am happy to help make this happen if there is interest from > >>>> the Kudu and Impala communities. I am not sure logistically what would > >>>> be the most expedient way to establish the project, whether as an ASF > >>>> Incubator project or possibly as a new TLP that could be created by > >>>> spinning IP out of Apache Kudu. > >>>> > >>>> I'm interested to hear the opinions of others, and possible next > steps. > >>>> > >>>> Thanks > >>>> Wes > >>>> > >>>> On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org> > >>> wrote: > >>>>> Thanks for bringing this up, Wes. > >>>>> > >>>>> On 25 February 2017 at 14:18, Wes McKinney <wesmck...@gmail.com> > >>> wrote: > >>>>> > >>>>>> Dear Apache Kudu and Apache Impala (incubating) communities, > >>>>>> > >>>>>> (I'm not sure the best way to have a cross-list discussion, so I > >>>>>> apologize if this does not work well) > >>>>>> > >>>>>> On the recent Apache Parquet sync call, we discussed C++ code > sharing > >>>>>> between the codebases in Apache Arrow and Apache Parquet, and > >>>>>> opportunities for more code sharing with Kudu and Impala as well. > >>>>>> > >>>>>> As context > >>>>>> > >>>>>> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the > >>>>>> first C++ release within Apache Parquet. I got involved with this > >>>>>> project a little over a year ago and was faced with the unpleasant > >>>>>> decision to copy and paste a significant amount of code out of > >>>>>> Impala's codebase to bootstrap the project. > >>>>>> > >>>>>> * In parallel, we begin the Apache Arrow project, which is designed > to > >>>>>> be a complementary library for file formats (like Parquet), storage > >>>>>> engines (like Kudu), and compute engines (like Impala and pandas). > >>>>>> > >>>>>> * As Arrow and parquet-cpp matured, an increasing amount of code > >>>>>> overlap crept up surrounding buffer memory management and IO > >>>>>> interface. We recently decided in PARQUET-818 > >>>>>> (https://github.com/apache/parquet-cpp/commit/ > >>>>>> 2154e873d5aa7280314189a2683fb1e12a590c02) > >>>>>> to remove some of the obvious code overlap in Parquet and make > >>>>>> libarrow.a/so a hard compile and link-time dependency for > >>>>>> libparquet.a/so. > >>>>>> > >>>>>> * There is still quite a bit of code in parquet-cpp that would > better > >>>>>> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary > encoding, > >>>>>> compression, bit utilities, and so forth. Much of this code > originated > >>>>>> from Impala > >>>>>> > >>>>>> This brings me to a next set of points: > >>>>>> > >>>>>> * parquet-cpp contains quite a bit of code that was extracted from > >>>>>> Impala. This is mostly self-contained in > >>>>>> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util > >>>>>> > >>>>>> * My understanding is that Kudu extracted certain computational > >>>>>> utilities from Impala in its early days, but these tools have likely > >>>>>> diverged as the needs of the projects have evolved. > >>>>>> > >>>>>> Since all of these projects are quite different in their end goals > >>>>>> (runtime systems vs. libraries), touching code that is tightly > coupled > >>>>>> to either Kudu or Impala's runtimes is probably not worth > discussing. > >>>>>> However, I think there is a strong basis for collaboration on > >>>>>> computational utilities and vectorized array processing. Some > obvious > >>>>>> areas that come to mind: > >>>>>> > >>>>>> * SIMD utilities (for hashing or processing of preallocated > contiguous > >>>>>> memory) > >>>>>> * Array encoding utilities: RLE / Dictionary, etc. > >>>>>> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire > >>>>>> contributed a patch to parquet-cpp around this) > >>>>>> * Date and time utilities > >>>>>> * Compression utilities > >>>>>> > >>>>> > >>>>> Between Kudu and Impala (at least) there are many more opportunities > >>> for > >>>>> sharing. Threads, logging, metrics, concurrent primitives - the list > is > >>>>> quite long. > >>>>> > >>>>> > >>>>>> > >>>>>> I hope the benefits are obvious: consolidating efforts on unit > >>>>>> testing, benchmarking, performance optimizations, continuous > >>>>>> integration, and platform compatibility. > >>>>>> > >>>>>> Logistically speaking, one possible avenue might be to use Apache > >>>>>> Arrow as the place to assemble this code. Its thirdparty toolchain > is > >>>>>> small, and it builds and installs fast. It is intended as a library > to > >>>>>> have its headers used and linked against other applications. (As an > >>>>>> aside, I'm very interested in building optional support for Arrow > >>>>>> columnar messages into the kudu client). > >>>>>> > >>>>> > >>>>> In principle I'm in favour of code sharing, and it seems very much in > >>>>> keeping with the Apache way. However, practically speaking I'm of the > >>>>> opinion that it only makes sense to house shared support code in a > >>>>> separate, dedicated project. > >>>>> > >>>>> Embedding the shared libraries in, e.g., Arrow naturally limits the > >>> scope > >>>>> of sharing to utilities that Arrow is interested in. It would make no > >>>> sense > >>>>> to add a threading library to Arrow if it was never used natively. > >>>> Muddying > >>>>> the waters of the project's charter seems likely to lead to user, and > >>>>> developer, confusion. Similarly, we should not necessarily couple > >>> Arrow's > >>>>> design goals to those it inherits from Kudu and Impala's source code. > >>>>> > >>>>> I think I'd rather see a new Apache project than re-use a current one > >>> for > >>>>> two independent purposes. > >>>>> > >>>>> > >>>>>> > >>>>>> The downside of code sharing, which may have prevented it so far, > are > >>>>>> the logistics of coordinating ASF release cycles and keeping build > >>>>>> toolchains in sync. It's taken us the past year to stabilize the > >>>>>> design of Arrow for its intended use cases, so at this point if we > >>>>>> went down this road I would be OK with helping the community commit > to > >>>>>> a regular release cadence that would be faster than Impala, Kudu, > and > >>>>>> Parquet's respective release cadences. Since members of the Kudu and > >>>>>> Impala PMC are also on the Arrow PMC, I trust we would be able to > >>>>>> collaborate to each other's mutual benefit and success. > >>>>>> > >>>>>> Note that Arrow does not throw C++ exceptions and similarly follows > >>>>>> Google C++ style guide to the same extent at Kudu and Impala. > >>>>>> > >>>>>> If this is something that either the Kudu or Impala communities > would > >>>>>> like to pursue in earnest, I would be happy to work with you on next > >>>>>> steps. I would suggest that we start with something small so that we > >>>>>> could address the necessary build toolchain changes, and develop a > >>>>>> workflow for moving around code and tests, a protocol for code > reviews > >>>>>> (e.g. Gerrit), and coordinating ASF releases. > >>>>>> > >>>>> > >>>>> I think, if I'm reading this correctly, that you're assuming > >>> integration > >>>>> with the 'downstream' projects (e.g. Impala and Kudu) would be done > via > >>>>> their toolchains. For something as fast moving as utility code - and > >>>>> critical, where you want the latency between adding a fix and > including > >>>> it > >>>>> in your build to be ~0 - that's a non-starter to me, at least with > how > >>>> the > >>>>> toolchains are currently realised. > >>>>> > >>>>> I'd rather have the source code directly imported into Impala's tree > - > >>>>> whether by git submodule or other mechanism. That way the coupling is > >>>>> looser, and we can move more quickly. I think that's important to > other > >>>>> projects as well. > >>>>> > >>>>> Henry > >>>>> > >>>>> > >>>>> > >>>>>> > >>>>>> Let me know what you think. > >>>>>> > >>>>>> best > >>>>>> Wes > >>>>>> > >>>> > >>> > >>> > >>> > >>> -- > >>> Todd Lipcon > >>> Software Engineer, Cloudera > >>> > >> -- > >> -- > >> Cheers, > >> Leif > > -- -- Cheers, Leif