I didn't have time for this, but it just bit me. We definitely have Guava on the API surface of runner support code in ways that get incompatibly shaded. I will probably start "1a" by making a shaded library org.apache.beam:vendored-guava and starting to use it. It sounds like there is generally unanimous support for that much, anyhow.
Kenn On Wed, Dec 13, 2017 at 8:31 AM, Aljoscha Krettek <aljos...@apache.org> wrote: > Thanks Ismaël for bringing up this discussion again! > > I would be in favour of 1) and more specifically of 1a) > > Aljoscha > > > On 12. Dec 2017, at 18:56, Lukasz Cwik <lc...@google.com> wrote: > > You can always run tests on post shaded artifacts instead of the compiled > classes, it just requires us to change our maven surefire / gradle test > configurations but it is true that most IDEs would behave better with a > dependency jar unless you delegate all the build/test actions to the build > system and then it won't matter. > > On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles <k...@google.com> wrote: > >> There's also, with additional overhead, >> >> 1a) A relocated and shipped package for each thing we want to relocate. I >> think this has also been tried outside Beam... >> >> Pros: >> * all the pros of 1) plus no bloat beyond what is necessary >> Cons: >> * abandons whitelist approach for public deps, reverting to blacklist >> approach for trouble things like guava, so a bit less principled >> >> For both 1) and 1a) I would add: >> >> Pros: >> * clearly readable dependency since code will `import >> org.apache.beam.private.guava21` and IDEs will understand it is a >> distinct lilbrary >> * can run tests on unpackaged classes, as long as deps are shaded or >> provided as jars >> * no mysterious action at a distance from inherited configuration >> Cons: >> * need to adjust imports >> >> Kenn >> >> On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik <lc...@google.com> wrote: >> >>> I would suggest that either we use: >>> 1) A common deps package containing shaded dependencies allows for >>> Pros >>> * doesn't require the user to build an uber jar >>> Risks >>> * dependencies package will keep growing even if something is or isn't >>> needed by all of Apache Beam leading to a large jar anyways negating any >>> space savings >>> >>> 2) Shade within each module to a common location like >>> org.apache.beam.relocated.guava.... >>> Pros >>> * you only get the shaded dependencies of the things that are required >>> * its one less dependency for users to manage >>> Risks >>> * requires an uber jar to be built to get the space savings (either by a >>> user or a distribution of Apache Beam) otherwise we negate any space >>> savings. >>> >>> If we either use a common relocation scheme or a dependencies jar then >>> each relocation should specifically contain the version number of the >>> package because we would like to allow for us to be using two different >>> versions of the same library. >>> >>> For the common deps package approach, should we check in code where the >>> imports contain the relocated location (e.g. import >>> org.apache.beam.guava.20.0.com.google.common.collect.ImmutableList)? >>> >>> >>> On Mon, Dec 11, 2017 at 8:47 AM, Jean-Baptiste Onofré <j...@nanthrax.net> >>> wrote: >>> >>>> Thanks for bringing that back. >>>> >>>> Indeed guava is shaded in different uber-jar. Maybe we can have a >>>> common deps module that we include once (but the user will have to >>>> explicitly define the dep) ? >>>> >>>> Basically, what do you propose for protobuf (unfortunately, I don't see >>>> an obvious) ? >>>> >>>> Regards >>>> JB >>>> >>>> >>>> On 12/11/2017 05:35 PM, Ismaël Mejía wrote: >>>> >>>>> Hello, I wanted to bring back this subject because I think we should >>>>> take action on this and at least first have a shaded version of guava. >>>>> I was playing with a toy project and I did the procedure we use to >>>>> submit jars to a Hadoop cluster via Flink/Spark which involves >>>>> creating an uber jar and I realized that the size of the jar was way >>>>> bigger than I expected, and the fact that we shade guava in every >>>>> module contributes to this. I found guava shaded on: >>>>> >>>>> sdks/java/core >>>>> runners/core-construction-java >>>>> runners/core-java >>>>> model/job-management >>>>> runners/spark >>>>> sdks/java/io/hadoop-file-system >>>>> sdks/java/io/kafka >>>>> >>>>> This means at least 6 times more of the size it should which counts in >>>>> around 15MB more (2.4MB*6 deps) of extra weight that we can simply >>>>> reduce by using a shaded version of the library. >>>>> >>>>> I add this point to the previous ones mentioned during the discussion >>>>> because this goes against the end user experience and affects us all >>>>> (devs/users). >>>>> >>>>> Another question is if we should shade (and how) protocol buffers >>>>> because now with the portability work we are exposing it closer to the >>>>> end users. I say this because I also found an issue while running a >>>>> job on YARN with the spark runner because hadoop-common includes >>>>> protobuf-java 2 and I had to explicitly provide protocol-buffers 3 as >>>>> a dependency to be able to use triggers (note the Spark runner >>>>> translates them using some method from runners/core-java). Since >>>>> hadoop-common is provided in the cluster with the older version of >>>>> protobuf, I am afraid that this will bite us in the future. >>>>> >>>>> Ismaël >>>>> >>>>> ps. There is already a JIRA for that shading for protobuf on >>>>> hadoop-common but this is not coming until version 3 is out. >>>>> https://issues.apache.org/jira/browse/HADOOP-13136 >>>>> >>>>> ps2. Extra curious situation is to see that the dataflow-runner ends >>>>> up having guava shaded twice via its shaded version on >>>>> core-construction-java. >>>>> >>>>> ps3. Of course this message means a de-facto +1 at least to do it for >>>>> guava and evaluate it for other libs. >>>>> >>>>> >>>>> On Tue, Oct 17, 2017 at 7:29 PM, Lukasz Cwik <lc...@google.com.invalid> >>>>> wrote: >>>>> >>>>>> An issue to call out is how to deal with our generated code (.avro and >>>>>> .proto) as I don't believe those plugins allow you to generate code >>>>>> using a >>>>>> shaded package prefix on imports. >>>>>> >>>>>> On Tue, Oct 17, 2017 at 10:28 AM, Thomas Groh < >>>>>> tg...@google.com.invalid> >>>>>> wrote: >>>>>> >>>>>> +1 to the goal. I'm hugely in favor of not doing the same shading work >>>>>>> every time for dependencies we know we'll use. >>>>>>> >>>>>>> This also means that if we end up pulling in transitive dependencies >>>>>>> we >>>>>>> don't want in any particular module we can avoid having to adjust our >>>>>>> repackaging strategy for that module - which I have run into >>>>>>> face-first in >>>>>>> the past. >>>>>>> >>>>>>> On Tue, Oct 17, 2017 at 9:48 AM, Kenneth Knowles < >>>>>>> k...@google.com.invalid> >>>>>>> wrote: >>>>>>> >>>>>>> Hi all, >>>>>>>> >>>>>>>> Shading is a big part of how we keep our dependencies sane in Beam. >>>>>>>> But >>>>>>>> downsides: shading is super slow, causes massive jar bloat, and >>>>>>>> kind of >>>>>>>> hard to get right because artifacts and namespaces are not 1-to-1. >>>>>>>> >>>>>>>> I know that some communities distribute their own shaded >>>>>>>> distributions of >>>>>>>> dependencies. I had a thought about doing something similar that I >>>>>>>> wanted >>>>>>>> to throw out there for people to poke holes in. >>>>>>>> >>>>>>>> To set the scene, here is how I view shading: >>>>>>>> >>>>>>>> - A module has public dependencies and private dependencies. >>>>>>>> - Public deps are used for data interchange; users must share >>>>>>>> these >>>>>>>> >>>>>>> deps. >>>>>>> >>>>>>>> - Private deps are just functionality and can be hidden (in our >>>>>>>> case, >>>>>>>> relocated + bundled) >>>>>>>> - It isn't necessarily that simple, because public and private >>>>>>>> deps >>>>>>>> >>>>>>> might >>>>>>> >>>>>>>> interact in higher-order ways ("public" is contagious) >>>>>>>> >>>>>>>> Shading is an implementation detail of expressing these >>>>>>>> characteristics. >>>>>>>> >>>>>>> We >>>>>>> >>>>>>>> use shading selectively because of its downsides I mentioned above. >>>>>>>> >>>>>>>> But what about this idea: Introduce shaded deps as a single separate >>>>>>>> artifact. >>>>>>>> >>>>>>>> - sdks/java/private-deps: bundled uber jar with relocated >>>>>>>> versions of >>>>>>>> everything we want to shade >>>>>>>> >>>>>>>> - sdks/java/core and sdks/java/harness: no relocation or bundling >>>>>>>> - >>>>>>>> depends on `beam-sdks-java-private-deps` and imports like >>>>>>>> `org.apache.beam.sdk.private.com.google.common` directly (this is >>>>>>>> what >>>>>>>> they >>>>>>>> are rewritten to >>>>>>>> >>>>>>>> Some benefits >>>>>>>> >>>>>>>> - much faster builds of other modules >>>>>>>> - only one shaded uber jar >>>>>>>> - rare/no rebuilds of the uber jar >>>>>>>> - can use maven enforcer to forbid imports like com.google.common >>>>>>>> - configuration all in one place >>>>>>>> - no automated rewriting of our real code, which has led to some >>>>>>>> major >>>>>>>> confusion >>>>>>>> - easy to implement incrementally >>>>>>>> >>>>>>>> Downsides: >>>>>>>> >>>>>>>> - plenty of effort work to get there >>>>>>>> - unclear how many different such deps modules we need; sharing >>>>>>>> them >>>>>>>> >>>>>>> could >>>>>>> >>>>>>>> get weird >>>>>>>> - if we hit a roadblock, we will have committed a lot of time >>>>>>>> >>>>>>>> Just something I was musing as I spent another evening waiting for >>>>>>>> slow >>>>>>>> builds to try to confirm changes to brittle poms. >>>>>>>> >>>>>>>> Kenn >>>>>>>> >>>>>>>> >>>>>>> >>>> -- >>>> Jean-Baptiste Onofré >>>> jbono...@apache.org >>>> http://blog.nanthrax.net >>>> Talend - http://www.talend.com >>>> >>> >>> >> > >