Re: Toolchain - versioning dependencies with the same version number

2017-02-27 Thread Henry Robinson
Yes, it would force re-downloading. At my office, downloading a toolchain
takes a matter of a few seconds, so I'm not sure the cost is that great.
And if it turned out to be problematic, one could always change the
toolchain directory for different branches. Having something locally that
set IMPALA_TOOLCHAIN_DIR=${IMPALA_HOME}/${IMPALA_TOOLCHAIN_BUILD_ID}/ would
work.

However I wouldn't want to force behaviour that into the toolchain scripts
because of the need for garbage collection it would raise - it wouldn't be
clear when to delete old toolchains programatically.

On 27 February 2017 at 20:51, Tim Armstrong  wrote:

> Maybe I'm misunderstanding, but wouldn't that force re-downloading of the
> entire toolchain every time a developer switches between branches with
> different build IDs?
>
> I know some developers do that frequently, e.g. to try and reproduce bugs
> on older versions or backport patches.
>
> I agree it would be good to fix this, since I've run into this problem
> before, I'm just not quite sure what the best solution is. In the other
> case where I had this issue with LLVM I changed the version number (by
> appending noasserts-) to it, but that's really just a hack.
>
> -Tim
>
> On Mon, Feb 27, 2017 at 4:35 PM, Henry Robinson 
> wrote:
>
> > As Matt said, I have a patch that implements build ID-based versioning at
> > https://gerrit.cloudera.org/#/c/6166/2.
> >
> > Does anyone want to take a look? If we could get this in soon it would
> help
> > smooth over the LZ4 change which is going in shortly.
> >
> > On 27 February 2017 at 14:21, Henry Robinson  wrote:
> >
> > > I agree that that might be useful, and that it's a separately
> addressable
> > > problem.
> > >
> > > On 27 February 2017 at 14:18, Matthew Jacobs  wrote:
> > >
> > >> Just catching up to this e-mail, though I had seen your code reviews
> > >> and I think this approach makes sense. An additional concern would be
> > >> how to identify how a toolchain package was built, and AFAIK this is
> > >> tricky now if only the 'toolchain ID' is known. Before I saw this
> > >> e-mail I was thinking about this problem (which I think we can address
> > >> separately), and that we might want to write the native-toolchain git
> > >> hash with every toolchain build so that the exact build scripts are
> > >> associated with those build artifacts. I filed
> > >> https://issues.cloudera.org/browse/IMPALA-5002 for this related
> > >> problem.
> > >>
> > >> On Sat, Feb 25, 2017 at 10:22 PM, Henry Robinson 
> > >> wrote:
> > >> > As written, the toolchain can't apparently deal with the possibility
> > of
> > >> > build flags changing, but a dependency version remaining the same.
> > >> >
> > >> > LZ4 has never (afaict) been built with optimization enabled. I have
> a
> > >> > commit that enables -O3, but that continues to produce artifacts for
> > >> > lz4-1.7.5 with no version change. This is a problem because
> > >> bootstrapping
> > >> > the toolchain will fail to pick up the new binaries - because the
> > >> > previously downloaded version is still in the local cache, and won't
> > be
> > >> > overwritten because of the version change.
> > >> >
> > >> > I think the simplest way to fix this is to write the toolchain build
> > ID
> > >> to
> > >> > the dependency version file (that's in the local cache only) when
> it's
> > >> > downloaded. If that ID changes, the dependency will be
> re-downloaded.
> > >> >
> > >> > This has the disadvantage that any bump in IMPALA_TOOLCHAIN_BUILD_ID
> > >> will
> > >> > invalidate all dependencies, and bin/bootstrap_toolchain.py will
> > >> > re-download all of them. My feeling is that that cost is better than
> > >> trying
> > >> > to individually determine whether a dependency has changed between
> > >> > toolchain builds.
> > >> >
> > >> > Any thoughts on whether this is the right way to go?
> > >> >
> > >> > Henry
> > >>
> > >
> > >
> > >
> > > --
> > > Henry Robinson
> > > Software Engineer
> > > Cloudera
> > > 415-994-6679 <(415)%20994-6679>
> > >
> >
> >
> >
> > --
> > Henry Robinson
> > Software Engineer
> > Cloudera
> > 415-994-6679
> >
>



-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679 <(415)%20994-6679>


Re: Toolchain - versioning dependencies with the same version number

2017-02-27 Thread Tim Armstrong
Maybe I'm misunderstanding, but wouldn't that force re-downloading of the
entire toolchain every time a developer switches between branches with
different build IDs?

I know some developers do that frequently, e.g. to try and reproduce bugs
on older versions or backport patches.

I agree it would be good to fix this, since I've run into this problem
before, I'm just not quite sure what the best solution is. In the other
case where I had this issue with LLVM I changed the version number (by
appending noasserts-) to it, but that's really just a hack.

-Tim

On Mon, Feb 27, 2017 at 4:35 PM, Henry Robinson  wrote:

> As Matt said, I have a patch that implements build ID-based versioning at
> https://gerrit.cloudera.org/#/c/6166/2.
>
> Does anyone want to take a look? If we could get this in soon it would help
> smooth over the LZ4 change which is going in shortly.
>
> On 27 February 2017 at 14:21, Henry Robinson  wrote:
>
> > I agree that that might be useful, and that it's a separately addressable
> > problem.
> >
> > On 27 February 2017 at 14:18, Matthew Jacobs  wrote:
> >
> >> Just catching up to this e-mail, though I had seen your code reviews
> >> and I think this approach makes sense. An additional concern would be
> >> how to identify how a toolchain package was built, and AFAIK this is
> >> tricky now if only the 'toolchain ID' is known. Before I saw this
> >> e-mail I was thinking about this problem (which I think we can address
> >> separately), and that we might want to write the native-toolchain git
> >> hash with every toolchain build so that the exact build scripts are
> >> associated with those build artifacts. I filed
> >> https://issues.cloudera.org/browse/IMPALA-5002 for this related
> >> problem.
> >>
> >> On Sat, Feb 25, 2017 at 10:22 PM, Henry Robinson 
> >> wrote:
> >> > As written, the toolchain can't apparently deal with the possibility
> of
> >> > build flags changing, but a dependency version remaining the same.
> >> >
> >> > LZ4 has never (afaict) been built with optimization enabled. I have a
> >> > commit that enables -O3, but that continues to produce artifacts for
> >> > lz4-1.7.5 with no version change. This is a problem because
> >> bootstrapping
> >> > the toolchain will fail to pick up the new binaries - because the
> >> > previously downloaded version is still in the local cache, and won't
> be
> >> > overwritten because of the version change.
> >> >
> >> > I think the simplest way to fix this is to write the toolchain build
> ID
> >> to
> >> > the dependency version file (that's in the local cache only) when it's
> >> > downloaded. If that ID changes, the dependency will be re-downloaded.
> >> >
> >> > This has the disadvantage that any bump in IMPALA_TOOLCHAIN_BUILD_ID
> >> will
> >> > invalidate all dependencies, and bin/bootstrap_toolchain.py will
> >> > re-download all of them. My feeling is that that cost is better than
> >> trying
> >> > to individually determine whether a dependency has changed between
> >> > toolchain builds.
> >> >
> >> > Any thoughts on whether this is the right way to go?
> >> >
> >> > Henry
> >>
> >
> >
> >
> > --
> > Henry Robinson
> > Software Engineer
> > Cloudera
> > 415-994-6679 <(415)%20994-6679>
> >
>
>
>
> --
> Henry Robinson
> Software Engineer
> Cloudera
> 415-994-6679
>


Re: Toolchain - versioning dependencies with the same version number

2017-02-27 Thread Henry Robinson
As Matt said, I have a patch that implements build ID-based versioning at
https://gerrit.cloudera.org/#/c/6166/2.

Does anyone want to take a look? If we could get this in soon it would help
smooth over the LZ4 change which is going in shortly.

On 27 February 2017 at 14:21, Henry Robinson  wrote:

> I agree that that might be useful, and that it's a separately addressable
> problem.
>
> On 27 February 2017 at 14:18, Matthew Jacobs  wrote:
>
>> Just catching up to this e-mail, though I had seen your code reviews
>> and I think this approach makes sense. An additional concern would be
>> how to identify how a toolchain package was built, and AFAIK this is
>> tricky now if only the 'toolchain ID' is known. Before I saw this
>> e-mail I was thinking about this problem (which I think we can address
>> separately), and that we might want to write the native-toolchain git
>> hash with every toolchain build so that the exact build scripts are
>> associated with those build artifacts. I filed
>> https://issues.cloudera.org/browse/IMPALA-5002 for this related
>> problem.
>>
>> On Sat, Feb 25, 2017 at 10:22 PM, Henry Robinson 
>> wrote:
>> > As written, the toolchain can't apparently deal with the possibility of
>> > build flags changing, but a dependency version remaining the same.
>> >
>> > LZ4 has never (afaict) been built with optimization enabled. I have a
>> > commit that enables -O3, but that continues to produce artifacts for
>> > lz4-1.7.5 with no version change. This is a problem because
>> bootstrapping
>> > the toolchain will fail to pick up the new binaries - because the
>> > previously downloaded version is still in the local cache, and won't be
>> > overwritten because of the version change.
>> >
>> > I think the simplest way to fix this is to write the toolchain build ID
>> to
>> > the dependency version file (that's in the local cache only) when it's
>> > downloaded. If that ID changes, the dependency will be re-downloaded.
>> >
>> > This has the disadvantage that any bump in IMPALA_TOOLCHAIN_BUILD_ID
>> will
>> > invalidate all dependencies, and bin/bootstrap_toolchain.py will
>> > re-download all of them. My feeling is that that cost is better than
>> trying
>> > to individually determine whether a dependency has changed between
>> > toolchain builds.
>> >
>> > Any thoughts on whether this is the right way to go?
>> >
>> > Henry
>>
>
>
>
> --
> Henry Robinson
> Software Engineer
> Cloudera
> 415-994-6679 <(415)%20994-6679>
>



-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679


Re: Toolchain - versioning dependencies with the same version number

2017-02-27 Thread Henry Robinson
I agree that that might be useful, and that it's a separately addressable
problem.

On 27 February 2017 at 14:18, Matthew Jacobs  wrote:

> Just catching up to this e-mail, though I had seen your code reviews
> and I think this approach makes sense. An additional concern would be
> how to identify how a toolchain package was built, and AFAIK this is
> tricky now if only the 'toolchain ID' is known. Before I saw this
> e-mail I was thinking about this problem (which I think we can address
> separately), and that we might want to write the native-toolchain git
> hash with every toolchain build so that the exact build scripts are
> associated with those build artifacts. I filed
> https://issues.cloudera.org/browse/IMPALA-5002 for this related
> problem.
>
> On Sat, Feb 25, 2017 at 10:22 PM, Henry Robinson  wrote:
> > As written, the toolchain can't apparently deal with the possibility of
> > build flags changing, but a dependency version remaining the same.
> >
> > LZ4 has never (afaict) been built with optimization enabled. I have a
> > commit that enables -O3, but that continues to produce artifacts for
> > lz4-1.7.5 with no version change. This is a problem because bootstrapping
> > the toolchain will fail to pick up the new binaries - because the
> > previously downloaded version is still in the local cache, and won't be
> > overwritten because of the version change.
> >
> > I think the simplest way to fix this is to write the toolchain build ID
> to
> > the dependency version file (that's in the local cache only) when it's
> > downloaded. If that ID changes, the dependency will be re-downloaded.
> >
> > This has the disadvantage that any bump in IMPALA_TOOLCHAIN_BUILD_ID will
> > invalidate all dependencies, and bin/bootstrap_toolchain.py will
> > re-download all of them. My feeling is that that cost is better than
> trying
> > to individually determine whether a dependency has changed between
> > toolchain builds.
> >
> > Any thoughts on whether this is the right way to go?
> >
> > Henry
>



-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679


Re: Toolchain - versioning dependencies with the same version number

2017-02-27 Thread Matthew Jacobs
Just catching up to this e-mail, though I had seen your code reviews
and I think this approach makes sense. An additional concern would be
how to identify how a toolchain package was built, and AFAIK this is
tricky now if only the 'toolchain ID' is known. Before I saw this
e-mail I was thinking about this problem (which I think we can address
separately), and that we might want to write the native-toolchain git
hash with every toolchain build so that the exact build scripts are
associated with those build artifacts. I filed
https://issues.cloudera.org/browse/IMPALA-5002 for this related
problem.

On Sat, Feb 25, 2017 at 10:22 PM, Henry Robinson  wrote:
> As written, the toolchain can't apparently deal with the possibility of
> build flags changing, but a dependency version remaining the same.
>
> LZ4 has never (afaict) been built with optimization enabled. I have a
> commit that enables -O3, but that continues to produce artifacts for
> lz4-1.7.5 with no version change. This is a problem because bootstrapping
> the toolchain will fail to pick up the new binaries - because the
> previously downloaded version is still in the local cache, and won't be
> overwritten because of the version change.
>
> I think the simplest way to fix this is to write the toolchain build ID to
> the dependency version file (that's in the local cache only) when it's
> downloaded. If that ID changes, the dependency will be re-downloaded.
>
> This has the disadvantage that any bump in IMPALA_TOOLCHAIN_BUILD_ID will
> invalidate all dependencies, and bin/bootstrap_toolchain.py will
> re-download all of them. My feeling is that that cost is better than trying
> to individually determine whether a dependency has changed between
> toolchain builds.
>
> Any thoughts on whether this is the right way to go?
>
> Henry


[Toolchain-CR] Add a script to build Kudu using existing toolchain artifacts

2017-02-27 Thread Matthew Jacobs (Code Review)
Matthew Jacobs has abandoned this change.

Change subject: Add a script to build Kudu using existing toolchain artifacts
..


Abandoned

moving to the new project

-- 
To view, visit http://gerrit.cloudera.org:8080/6014
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: abandon
Gerrit-Change-Id: I237580e1545033467a92285ca8bb8db1cf8c804e
Gerrit-PatchSet: 1
Gerrit-Project: Toolchain
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs 
Gerrit-Reviewer: Henry Robinson 


[Toolchain-CR] Add a script to build Kudu using existing toolchain artifacts

2017-02-27 Thread Henry Robinson (Code Review)
Henry Robinson has posted comments on this change.

Change subject: Add a script to build Kudu using existing toolchain artifacts
..


Patch Set 1:

Matt - could you move this to the native-toolchain project please?

-- 
To view, visit http://gerrit.cloudera.org:8080/6014
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I237580e1545033467a92285ca8bb8db1cf8c804e
Gerrit-PatchSet: 1
Gerrit-Project: Toolchain
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs 
Gerrit-Reviewer: Henry Robinson 
Gerrit-HasComments: No


[Toolchain-CR] IMPALA-4983: Compile LZ4 in release mode

2017-02-27 Thread Henry Robinson (Code Review)
Henry Robinson has abandoned this change.

Change subject: IMPALA-4983: Compile LZ4 in release mode
..


Abandoned

Committed to native-toolchain project.

-- 
To view, visit http://gerrit.cloudera.org:8080/6145
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: abandon
Gerrit-Change-Id: I8bd113822dfc4df2d76c4393c4b3b3550066dd18
Gerrit-PatchSet: 1
Gerrit-Project: Toolchain
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Matthew Jacobs 


Heads up: restarting Gerrit in a few minutes

2017-02-27 Thread Henry Robinson



Re: status-benchmark.cc compilation time

2017-02-27 Thread Tim Armstrong
I think for status-benchmark.cc we should just reduce the unrolling - I
don't see a valid reason to unroll a loop that many times unless you're
just testing the compiler. No reason we can't just unroll the loop, say 10
times and run that 100 times and get an equally valid result.

Todd's suggestion about just running the benchmark a couple of iterations
is a reasonable idea, although I think it depends whether the benchmarks
are once-off experiments (in which case it seems ok to let them bit-rot) or
they are actually likely to be reused.

I think if we're going to more actively maintain benchmarks we should also
consider more proactively disabling or removing once-off benchmarks that
aren't likely to be reused.

On Thu, Feb 23, 2017 at 10:26 AM, Henry Robinson  wrote:

> I think the main problem I want to avoid is paying the cost of linking,
> which is expensive for Impala as it often generates multi-hundred-MB
> binaries per benchmark or test.
>
> Building the benchmarks during GVO seems the best solution to that to me.
>
> On 23 February 2017 at 10:23, Todd Lipcon  wrote:
>
> > One thing we've found useful in Kudu to prevent bitrot of benchmarks is
> to
> > actually use gtest and gflags for the benchmark programs.
> >
> > We set some flag like --benchmark_num_rows or --benchmark_num_iterations
> > with a default that's low enough to only run for a second or two, and run
> > it as part of our normal test suite. Rarely catches any bugs, but serves
> to
> > make sure that the code keeps working. Then, when a developer wants to
> > actually test a change for performance, they can run it with
> > --num_iterations=.
> >
> > Doesn't help the weird case of status-benchmark where *compiling* takes
> 10
> > minutes... but I think the manual unrolling of 1000 status calls in there
> > is probably unrealistic anyway regarding how the different options
> perform
> > in a whole-program setting.
> >
> > -Todd
> >
> > On Thu, Feb 23, 2017 at 10:20 AM, Zachary Amsden 
> > wrote:
> >
> > > Yes.  If you take a look at the benchmark, you'll notice the JNI call
> to
> > > initialize the frontend doesn't even have the right signature anymore.
> > > That's one easy way to bitrot while still compiling.
> > >
> > > Even fixing that isn't enough to get it off the ground.
> > >
> > >  - Zach
> > >
> > > On Tue, Feb 21, 2017 at 11:44 AM, Henry Robinson 
> > > wrote:
> > >
> > > > Did you run . bin/set-classpath.sh before running expr-benchmark?
> > > >
> > > > On 21 February 2017 at 11:30, Zachary Amsden 
> > > wrote:
> > > >
> > > > > Unfortunately some of the benchmarks have actually bit-rotted.  For
> > > > > example, expr-benchmark compiles but immediately throws JNI
> > exceptions.
> > > > >
> > > > > On Tue, Feb 21, 2017 at 10:55 AM, Marcel Kornacker <
> > > mar...@cloudera.com>
> > > > > wrote:
> > > > >
> > > > > > I'm also in favor of not compiling it on the standard
> commandline.
> > > > > >
> > > > > > However, I'm very much against allowing the benchmarks to bitrot.
> > As
> > > > > > was pointed out, those benchmarks can be valuable tools during
> > > > > > development, and keeping them in working order shouldn't really
> > > impact
> > > > > > the development process.
> > > > > >
> > > > > > In other words, let's compile them as part of gvo.
> > > > > >
> > > > > > On Tue, Feb 21, 2017 at 10:50 AM, Alex Behm <
> > alex.b...@cloudera.com>
> > > > > > wrote:
> > > > > > > +1 for not compiling the benchmarks in -notests
> > > > > > >
> > > > > > > On Mon, Feb 20, 2017 at 7:55 PM, Jim Apple <
> jbap...@cloudera.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > >> > On which note, would anyone object if we disabled benchmark
> > > > > > compilation
> > > > > > >> by
> > > > > > >> > default when building the BE tests? I mean separating out
> > > -notests
> > > > > > into
> > > > > > >> > -notests and -build_benchmarks (the latter false by
> default).
> > > > > > >>
> > > > > > >> I think this is a great idea.
> > > > > > >>
> > > > > > >> > I don't mind if the benchmarks bitrot as a result, because
> we
> > > > don't
> > > > > > run
> > > > > > >> > them regularly or pay attention to their output except when
> > > > > > developing a
> > > > > > >> > feature. Of course, maybe an 'exhaustive' run should build
> the
> > > > > > benchmarks
> > > > > > >> > as well just to keep us honest, but I'd be happy if 95% of
> > > Jenkins
> > > > > > builds
> > > > > > >> > didn't bother.
> > > > > > >>
> > > > > > >> The pre-merge (aka GVM aka GVO) testing builds
> > > > > > >> http://jenkins.impala.io:8080/job/all-build-options, which
> > builds
> > > > > > >> without the "-notests" flag.
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Henry Robinson
> > > > Software Engineer
> > > > Cloudera
> > > > 415-994-6679
> > > >
> > >
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

2017-02-27 Thread Leif Walsh
Julian, are you proposing the arrow project ship two artifacts,
arrow-common and arrow, where arrow depends on arrow-common?
On Mon, Feb 27, 2017 at 11:51 Julian Hyde  wrote:

> “Commons” projects are often problematic. It is difficult to tell what is
> in scope and out of scope. If the scope is drawn too wide, there is a real
> problem of orphaned features, because people contribute one feature and
> then disappear.
>
> Let’s remember the Apache mantra: community over code. If you create a
> sustainable community, the code will get looked after. Would this project
> form a new community, or just a new piece of code? As I read the current
> proposal, it would be the intersection of some existing communities, not a
> new community.
>
> I think it would take a considerable effort to create a new project and
> community around the idea of “c++ commons” (or is it “database-related c++
> commons”?). I think you already have such a community, to a first
> approximation, in the Arrow project, because Kudu and Impala developers are
> already part of the Arrow community. There’s no reason why Arrow cannot
> contain new modules that have different release schedules than the rest of
> Arrow. As a TLP, releases are less burdensome, and can happen in a little
> over 3 days if the component is kept stable.
>
> Lastly, the code is fungible. It can be marked “experimental” within Arrow
> and moved to another project, or into a new project, as it matures. The
> Apache license and the ASF CLA makes this very easy. We are doing something
> like this in Calcite: the Avatica sub-project [1] has a community that
> intersect’s with Calcite’s, is disconnected at a code level, and may over
> time evolve into a separate project. In the mean time, being part of an
> established project is helpful, because there are PMC members to vote.
>
> Julian
>
> [1] https://calcite.apache.org/avatica/ <
> https://calcite.apache.org/avatica/>
>
> > On Feb 27, 2017, at 6:41 AM, Wes McKinney  wrote:
> >
> > Responding to Todd's e-mail:
> >
> > 1) Open source release model
> >
> > My expectation is that this library would release about once a month,
> > with occasional faster releases for critical fixes.
> >
> > 2) Governance/review model
> >
> > Beyond having centralized code reviews, it's hard to predict how the
> > governance would play out. I understand that OSS projects behave
> > differently in their planning / design / review process, so work on a
> > common need may require more of a negotiation than the prior
> > "unilateral" process.
> >
> > I think it says something for our communities that we would make a
> > commitment in our collaboration on this to the success of the
> > "consumer" projects. So if the Arrow or Parquet communities were
> > contemplating a change that might impact Kudu, for example, it would
> > be in our best interest to be careful and communicate proactively.
> >
> > This all makes sense. From an Arrow and Parquet perspective, we do not
> > add very much testing burden because our continuous integration suites
> > do not take long to run.
> >
> > 3) Pre-commit/test mechanics
> >
> > One thing that would help would be community-maintained
> > Dockerfiles/Docker images (or equivalent) to assist with validation
> > and testing for developers.
> >
> > I am happy to comply with a pre-commit testing protocol that works for
> > the Kudu and Impala teams.
> >
> > 4) Integration mechanics for breaking changes
> >
> >> One option is that each "user" of the libraries manually "rolls" to new
> versions when they feel like it, but there's still now a case where a
> common change "pushes work onto" the consumers to update call sites, etc.
> >
> > Breaking API changes will create extra work, because any automated
> > testing that we create will not be able to validate the patch to the
> > common library. Perhaps we can configure a manual way (in Jenkins,
> > say) to test two patches together.
> >
> > In the event that a community member has a patch containing an API
> > break that impacts a project that they are not a contributor for,
> > there should be some expectation to either work with the affected
> > project on a coordinated patch or obtain their +1 to merge the patch
> > even though it will may require a follow up patch if the roll-forward
> > in the consumer project exposes bugs in the common library. There may
> > be situations like:
> >
> > * Kudu changes API in $COMMON that impacts Arrow
> > * Arrow says +1, we will roll forward $COMMON later
> > * Patch merged
> > * Arrow rolls forward, discovers bug caused by patch in $COMMON
> > * Arrow proposes patch to $COMMON
> > * ...
> >
> > This is the worst case scenario, of course, but I actually think it is
> > good because it would indicate that the unit testing in $COMMON needs
> > to be improved. Unit testing in the common library, therefore, would
> > take on more of a "defensive" quality than currently.
> >
> > In any case, I'm 

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

2017-02-27 Thread Julian Hyde
“Commons” projects are often problematic. It is difficult to tell what is in 
scope and out of scope. If the scope is drawn too wide, there is a real problem 
of orphaned features, because people contribute one feature and then disappear.

Let’s remember the Apache mantra: community over code. If you create a 
sustainable community, the code will get looked after. Would this project form 
a new community, or just a new piece of code? As I read the current proposal, 
it would be the intersection of some existing communities, not a new community.

I think it would take a considerable effort to create a new project and 
community around the idea of “c++ commons” (or is it “database-related c++ 
commons”?). I think you already have such a community, to a first 
approximation, in the Arrow project, because Kudu and Impala developers are 
already part of the Arrow community. There’s no reason why Arrow cannot contain 
new modules that have different release schedules than the rest of Arrow. As a 
TLP, releases are less burdensome, and can happen in a little over 3 days if 
the component is kept stable.

Lastly, the code is fungible. It can be marked “experimental” within Arrow and 
moved to another project, or into a new project, as it matures. The Apache 
license and the ASF CLA makes this very easy. We are doing something like this 
in Calcite: the Avatica sub-project [1] has a community that intersect’s with 
Calcite’s, is disconnected at a code level, and may over time evolve into a 
separate project. In the mean time, being part of an established project is 
helpful, because there are PMC members to vote.

Julian

[1] https://calcite.apache.org/avatica/ 

> On Feb 27, 2017, at 6:41 AM, Wes McKinney  wrote:
> 
> Responding to Todd's e-mail:
> 
> 1) Open source release model
> 
> My expectation is that this library would release about once a month,
> with occasional faster releases for critical fixes.
> 
> 2) Governance/review model
> 
> Beyond having centralized code reviews, it's hard to predict how the
> governance would play out. I understand that OSS projects behave
> differently in their planning / design / review process, so work on a
> common need may require more of a negotiation than the prior
> "unilateral" process.
> 
> I think it says something for our communities that we would make a
> commitment in our collaboration on this to the success of the
> "consumer" projects. So if the Arrow or Parquet communities were
> contemplating a change that might impact Kudu, for example, it would
> be in our best interest to be careful and communicate proactively.
> 
> This all makes sense. From an Arrow and Parquet perspective, we do not
> add very much testing burden because our continuous integration suites
> do not take long to run.
> 
> 3) Pre-commit/test mechanics
> 
> One thing that would help would be community-maintained
> Dockerfiles/Docker images (or equivalent) to assist with validation
> and testing for developers.
> 
> I am happy to comply with a pre-commit testing protocol that works for
> the Kudu and Impala teams.
> 
> 4) Integration mechanics for breaking changes
> 
>> One option is that each "user" of the libraries manually "rolls" to new 
>> versions when they feel like it, but there's still now a case where a common 
>> change "pushes work onto" the consumers to update call sites, etc.
> 
> Breaking API changes will create extra work, because any automated
> testing that we create will not be able to validate the patch to the
> common library. Perhaps we can configure a manual way (in Jenkins,
> say) to test two patches together.
> 
> In the event that a community member has a patch containing an API
> break that impacts a project that they are not a contributor for,
> there should be some expectation to either work with the affected
> project on a coordinated patch or obtain their +1 to merge the patch
> even though it will may require a follow up patch if the roll-forward
> in the consumer project exposes bugs in the common library. There may
> be situations like:
> 
> * Kudu changes API in $COMMON that impacts Arrow
> * Arrow says +1, we will roll forward $COMMON later
> * Patch merged
> * Arrow rolls forward, discovers bug caused by patch in $COMMON
> * Arrow proposes patch to $COMMON
> * ...
> 
> This is the worst case scenario, of course, but I actually think it is
> good because it would indicate that the unit testing in $COMMON needs
> to be improved. Unit testing in the common library, therefore, would
> take on more of a "defensive" quality than currently.
> 
> In any case, I'm keen to move forward to coming up with a concrete
> plan if we can reach consensus on the particulars.
> 
> Thanks
> Wes
> 
> On Sun, Feb 26, 2017 at 10:18 PM, Leif Walsh  wrote:
>> I also support the idea of creating an "apache commons modern c++" style
>> library, maybe tailored toward the needs of columnar data processing

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

2017-02-27 Thread Wes McKinney
Responding to Todd's e-mail:

1) Open source release model

My expectation is that this library would release about once a month,
with occasional faster releases for critical fixes.

2) Governance/review model

Beyond having centralized code reviews, it's hard to predict how the
governance would play out. I understand that OSS projects behave
differently in their planning / design / review process, so work on a
common need may require more of a negotiation than the prior
"unilateral" process.

I think it says something for our communities that we would make a
commitment in our collaboration on this to the success of the
"consumer" projects. So if the Arrow or Parquet communities were
contemplating a change that might impact Kudu, for example, it would
be in our best interest to be careful and communicate proactively.

This all makes sense. From an Arrow and Parquet perspective, we do not
add very much testing burden because our continuous integration suites
do not take long to run.

3) Pre-commit/test mechanics

One thing that would help would be community-maintained
Dockerfiles/Docker images (or equivalent) to assist with validation
and testing for developers.

I am happy to comply with a pre-commit testing protocol that works for
the Kudu and Impala teams.

4) Integration mechanics for breaking changes

> One option is that each "user" of the libraries manually "rolls" to new 
> versions when they feel like it, but there's still now a case where a common 
> change "pushes work onto" the consumers to update call sites, etc.

Breaking API changes will create extra work, because any automated
testing that we create will not be able to validate the patch to the
common library. Perhaps we can configure a manual way (in Jenkins,
say) to test two patches together.

In the event that a community member has a patch containing an API
break that impacts a project that they are not a contributor for,
there should be some expectation to either work with the affected
project on a coordinated patch or obtain their +1 to merge the patch
even though it will may require a follow up patch if the roll-forward
in the consumer project exposes bugs in the common library. There may
be situations like:

* Kudu changes API in $COMMON that impacts Arrow
* Arrow says +1, we will roll forward $COMMON later
* Patch merged
* Arrow rolls forward, discovers bug caused by patch in $COMMON
* Arrow proposes patch to $COMMON
* ...

This is the worst case scenario, of course, but I actually think it is
good because it would indicate that the unit testing in $COMMON needs
to be improved. Unit testing in the common library, therefore, would
take on more of a "defensive" quality than currently.

In any case, I'm keen to move forward to coming up with a concrete
plan if we can reach consensus on the particulars.

Thanks
Wes

On Sun, Feb 26, 2017 at 10:18 PM, Leif Walsh  wrote:
> I also support the idea of creating an "apache commons modern c++" style
> library, maybe tailored toward the needs of columnar data processing
> tools.  I think APR is the wrong project but I think that *style* of
> project is the right direction to aim.
>
> I agree this adds test and release process complexity across products but I
> think the benefits of a shared, well-tested library outweigh that, and
> creating such test infrastructure will have long-term benefits as well.
>
> I'd be happy to lend a hand wherever it's needed.
>
> On Sun, Feb 26, 2017 at 4:03 PM Todd Lipcon  wrote:
>
>> Hey folks,
>>
>> As Henry mentioned, Impala is starting to share more code with Kudu (most
>> notably our RPC system, but that pulls in a fair bit of utility code as
>> well), so we've been chatting periodically offline about the best way to do
>> this. Having more projects potentially interested in collaborating is
>> definitely welcome, though I think does also increase the complexity of
>> whatever solution we come up with.
>>
>> I think the potential benefits of collaboration are fairly self-evident, so
>> I'll focus on my concerns here, which somewhat echo Henry's.
>>
>> 1) Open source release model
>>
>> The ASF is very much against having projects which do not do releases. So,
>> if we were to create some new ASF project to hold this code, we'd be
>> expected to do frequent releases thereof. Wes volunteered above to lead
>> frequent releases, but we actually need at least 3 PMC members to vote on
>> each release, and given people can come and go, we'd probably need at least
>> 5-8 people who are actively committed to helping with the release process
>> of this "commons" project.
>>
>> Unlike our existing projects, which seem to release every 2-3 months, if
>> that, I think this one would have to release _much_ more frequently, if we
>> expect downstream projects to depend on released versions rather than just
>> pulling in some recent (or even trunk) git hash. Since the ASF requires the
>> normal voting period and process for every