Nice. This sounds like a great idea (or two?) and goes along with what I just started for futures.
Guava: filed https://issues.apache.org/jira/browse/BEAM-3606 and assigned to Ismaël :-) and converted my futures thing to a subtask. Specific things for our micro guava: - checkArgumentNotNull can throw the right exception - our own Optional because Java's is not Serializable - futures combinators since many are missing, especially Java's don't do exceptions right Protobuf: didn't file an issue because I'm not sure I was wondering if pre-shading works. We really need to get rid of it from public APIs in a 100% reliable way. It is also a problem for Dataflow. I was wondering if one approach is to pre-shade gcpio-protobuf-java, gcpio-protobuf-java-util, gcpio-grpc-java, etc. Anything that needs to take a Message object. (and do the same for beam-model-protobuf-java since the model bits have to depend on each other but nothing else). Kenn On Thu, Feb 1, 2018 at 1:56 AM, Ismaël Mejía <ieme...@gmail.com> wrote: > Huge +1 to get rid of Guava! > > This solves annoying dependency issues for some IOs and allow us to > get rid of the shading that makes current jars bigger than they should > be. > > We can create our own 'micro guava' package with some classes for > things that are hard to migrate, or that we prefer to still have like > the check* methods for example. Given the size of the task we should > probably divide it into subtasks, more important is to get rid of it > for 'sdks/java/core'. We can then attack other areas afterwards. > > Other important idea would be to get rid of Protobuf in public APIs > like GCPIO and to better shade it from leaking into the runners. An > unexpected side effect of this is a leak of netty via gRPC/protobuf > that is byting us for the Spark runner, but well that's worth a > different discussion. > > > On Thu, Feb 1, 2018 at 10:08 AM, Romain Manni-Bucau > <rmannibu...@gmail.com> wrote: > > a map of list is fine and not a challenge we'll face long I hope ;) > > > > > > Romain Manni-Bucau > > @rmannibucau | Blog | Old Blog | Github | LinkedIn > > > > 2018-02-01 9:40 GMT+01:00 Reuven Lax <re...@google.com>: > >> > >> Not sure we'll be able to replace them all. Things like guava Table and > >> Multimap don't have great replacements in Java8. > >> > >> On Wed, Jan 31, 2018 at 10:11 PM, Jean-Baptiste Onofré <j...@nanthrax.net > > > >> wrote: > >>> > >>> +1, it was on my TODO for a while waiting the Java8 update. > >>> > >>> Regards > >>> JB > >>> > >>> On 02/01/2018 06:56 AM, Romain Manni-Bucau wrote: > >>> > Why not dropping guava for all beam codebase? With java 8 it is quite > >>> > easy to do > >>> > it and avoid a bunch of conflicts. Did it in 2 projects with quite a > >>> > good result. > >>> > > >>> > Le 1 févr. 2018 06:50, "Lukasz Cwik" <lc...@google.com > >>> > <mailto:lc...@google.com>> a écrit : > >>> > > >>> > Make sure to include the guava version in the artifact name so > that > >>> > we can > >>> > have multiple vendored versions. > >>> > > >>> > On Wed, Jan 31, 2018 at 9:16 PM, Kenneth Knowles <k...@google.com > >>> > <mailto:k...@google.com>> wrote: > >>> > > >>> > I didn't have time for this, but it just bit me. We > definitely > >>> > have > >>> > Guava on the API surface of runner support code in ways that > >>> > get > >>> > incompatibly shaded. I will probably start "1a" by making a > >>> > shaded > >>> > library org.apache.beam:vendored-guava and starting to use > it. > >>> > It sounds > >>> > like there is generally unanimous support for that much, > >>> > anyhow. > >>> > > >>> > Kenn > >>> > > >>> > On Wed, Dec 13, 2017 at 8:31 AM, Aljoscha Krettek > >>> > <aljos...@apache.org > >>> > <mailto:aljos...@apache.org>> wrote: > >>> > > >>> > Thanks Ismaël for bringing up this discussion again! > >>> > > >>> > I would be in favour of 1) and more specifically of 1a) > >>> > > >>> > Aljoscha > >>> > > >>> > > >>> >> On 12. Dec 2017, at 18:56, Lukasz Cwik < > lc...@google.com > >>> >> <mailto:lc...@google.com>> wrote: > >>> >> > >>> >> You can always run tests on post shaded artifacts > instead > >>> >> of the > >>> >> compiled classes, it just requires us to change our > maven > >>> >> surefire > >>> >> / gradle test configurations but it is true that most > IDEs > >>> >> would > >>> >> behave better with a dependency jar unless you delegate > >>> >> all the > >>> >> build/test actions to the build system and then it won't > >>> >> matter. > >>> >> > >>> >> On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles > >>> >> <k...@google.com > >>> >> <mailto:k...@google.com>> wrote: > >>> >> > >>> >> There's also, with additional overhead, > >>> >> > >>> >> 1a) A relocated and shipped package for each thing > we > >>> >> want to > >>> >> relocate. I think this has also been tried outside > >>> >> Beam... > >>> >> > >>> >> Pros: > >>> >> * all the pros of 1) plus no bloat beyond what is > >>> >> necessary > >>> >> Cons: > >>> >> * abandons whitelist approach for public deps, > >>> >> reverting to > >>> >> blacklist approach for trouble things like guava, > so a > >>> >> bit > >>> >> less principled > >>> >> > >>> >> For both 1) and 1a) I would add: > >>> >> > >>> >> Pros: > >>> >> * clearly readable dependency since code will > `import > >>> >> org.apache.beam.private.guava21` and IDEs will > >>> >> understand it > >>> >> is a distinct lilbrary > >>> >> * can run tests on unpackaged classes, as long as > deps > >>> >> are > >>> >> shaded or provided as jars > >>> >> * no mysterious action at a distance from inherited > >>> >> configuration > >>> >> Cons: > >>> >> * need to adjust imports > >>> >> > >>> >> Kenn > >>> >> > >>> >> On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik > >>> >> <lc...@google.com > >>> >> <mailto:lc...@google.com>> wrote: > >>> >> > >>> >> I would suggest that either we use: > >>> >> 1) A common deps package containing shaded > >>> >> dependencies > >>> >> allows for > >>> >> Pros > >>> >> * doesn't require the user to build an uber jar > >>> >> Risks > >>> >> * dependencies package will keep growing even if > >>> >> something > >>> >> is or isn't needed by all of Apache Beam leading > >>> >> to a > >>> >> large jar anyways negating any space savings > >>> >> > >>> >> 2) Shade within each module to a common location > >>> >> like > >>> >> org.apache.beam.relocated.guava.... > >>> >> Pros > >>> >> * you only get the shaded dependencies of the > >>> >> things that > >>> >> are required > >>> >> * its one less dependency for users to manage > >>> >> Risks > >>> >> * requires an uber jar to be built to get the > >>> >> space > >>> >> savings (either by a user or a distribution of > >>> >> Apache > >>> >> Beam) otherwise we negate any space savings. > >>> >> > >>> >> If we either use a common relocation scheme or a > >>> >> dependencies jar then each relocation should > >>> >> specifically > >>> >> contain the version number of the package > because > >>> >> we would > >>> >> like to allow for us to be using two different > >>> >> versions of > >>> >> the same library. > >>> >> > >>> >> For the common deps package approach, should we > >>> >> check in > >>> >> code where the imports contain the relocated > >>> >> location > >>> >> (e.g. import org.apache.beam.guava.20.0.com > >>> >> > >>> >> <http://org.apache.beam.guava.20.0.com/>.google.common. > collect.ImmutableList)? > >>> >> > >>> >> > >>> >> On Mon, Dec 11, 2017 at 8:47 AM, Jean-Baptiste > >>> >> Onofré > >>> >> <j...@nanthrax.net <mailto:j...@nanthrax.net>> > wrote: > >>> >> > >>> >> Thanks for bringing that back. > >>> >> > >>> >> Indeed guava is shaded in different > uber-jar. > >>> >> Maybe we > >>> >> can have a common deps module that we > include > >>> >> once > >>> >> (but the user will have to explicitly define > >>> >> the dep) ? > >>> >> > >>> >> Basically, what do you propose for protobuf > >>> >> (unfortunately, I don't see an obvious) ? > >>> >> > >>> >> Regards > >>> >> JB > >>> >> > >>> >> > >>> >> On 12/11/2017 05:35 PM, Ismaël Mejía wrote: > >>> >> > >>> >> Hello, I wanted to bring back this > subject > >>> >> because > >>> >> I think we should > >>> >> take action on this and at least first > >>> >> have a > >>> >> shaded version of guava. > >>> >> I was playing with a toy project and I > did > >>> >> the > >>> >> procedure we use to > >>> >> submit jars to a Hadoop cluster via > >>> >> Flink/Spark > >>> >> which involves > >>> >> creating an uber jar and I realized that > >>> >> the size > >>> >> of the jar was way > >>> >> bigger than I expected, and the fact > that > >>> >> we shade > >>> >> guava in every > >>> >> module contributes to this. I found > guava > >>> >> shaded on: > >>> >> > >>> >> sdks/java/core > >>> >> runners/core-construction-java > >>> >> runners/core-java > >>> >> model/job-management > >>> >> runners/spark > >>> >> sdks/java/io/hadoop-file-system > >>> >> sdks/java/io/kafka > >>> >> > >>> >> This means at least 6 times more of the > >>> >> size it > >>> >> should which counts in > >>> >> around 15MB more (2.4MB*6 deps) of extra > >>> >> weight > >>> >> that we can simply > >>> >> reduce by using a shaded version of the > >>> >> library. > >>> >> > >>> >> I add this point to the previous ones > >>> >> mentioned > >>> >> during the discussion > >>> >> because this goes against the end user > >>> >> experience > >>> >> and affects us all > >>> >> (devs/users). > >>> >> > >>> >> Another question is if we should shade > >>> >> (and how) > >>> >> protocol buffers > >>> >> because now with the portability work we > >>> >> are > >>> >> exposing it closer to the > >>> >> end users. I say this because I also > found > >>> >> an > >>> >> issue while running a > >>> >> job on YARN with the spark runner > because > >>> >> hadoop-common includes > >>> >> protobuf-java 2 and I had to explicitly > >>> >> provide > >>> >> protocol-buffers 3 as > >>> >> a dependency to be able to use triggers > >>> >> (note the > >>> >> Spark runner > >>> >> translates them using some method from > >>> >> runners/core-java). Since > >>> >> hadoop-common is provided in the cluster > >>> >> with the > >>> >> older version of > >>> >> protobuf, I am afraid that this will > bite > >>> >> us in > >>> >> the future. > >>> >> > >>> >> Ismaël > >>> >> > >>> >> ps. There is already a JIRA for that > >>> >> shading for > >>> >> protobuf on > >>> >> hadoop-common but this is not coming > until > >>> >> version > >>> >> 3 is out. > >>> >> > >>> >> https://issues.apache.org/jira/browse/HADOOP-13136 > >>> >> > >>> >> <https://issues.apache.org/jira/browse/HADOOP-13136> > >>> >> > >>> >> ps2. Extra curious situation is to see > >>> >> that the > >>> >> dataflow-runner ends > >>> >> up having guava shaded twice via its > >>> >> shaded version on > >>> >> core-construction-java. > >>> >> > >>> >> ps3. Of course this message means a > >>> >> de-facto +1 at > >>> >> least to do it for > >>> >> guava and evaluate it for other libs. > >>> >> > >>> >> > >>> >> On Tue, Oct 17, 2017 at 7:29 PM, Lukasz > >>> >> Cwik > >>> >> <lc...@google.com.invalid > >>> >> <mailto:lc...@google.com.invalid>> > wrote: > >>> >> > >>> >> An issue to call out is how to deal > >>> >> with our > >>> >> generated code (.avro and > >>> >> .proto) as I don't believe those > >>> >> plugins allow > >>> >> you to generate code using a > >>> >> shaded package prefix on imports. > >>> >> > >>> >> On Tue, Oct 17, 2017 at 10:28 AM, > >>> >> Thomas Groh > >>> >> <tg...@google.com.invalid > >>> >> <mailto:tg...@google.com.invalid>> > >>> >> wrote: > >>> >> > >>> >> +1 to the goal. I'm hugely in > >>> >> favor of not > >>> >> doing the same shading work > >>> >> every time for dependencies we > >>> >> know we'll use. > >>> >> > >>> >> This also means that if we end > up > >>> >> pulling > >>> >> in transitive dependencies we > >>> >> don't want in any particular > >>> >> module we can > >>> >> avoid having to adjust our > >>> >> repackaging strategy for that > >>> >> module - > >>> >> which I have run into face-first > >>> >> in > >>> >> the past. > >>> >> > >>> >> On Tue, Oct 17, 2017 at 9:48 AM, > >>> >> Kenneth > >>> >> Knowles <k...@google.com.invalid > >>> >> <mailto:k...@google.com.invalid > >> > >>> >> wrote: > >>> >> > >>> >> Hi all, > >>> >> > >>> >> Shading is a big part of how > >>> >> we keep > >>> >> our dependencies sane in > Beam. > >>> >> But > >>> >> downsides: shading is super > >>> >> slow, > >>> >> causes massive jar bloat, > and > >>> >> kind of > >>> >> hard to get right because > >>> >> artifacts > >>> >> and namespaces are not > 1-to-1. > >>> >> > >>> >> I know that some communities > >>> >> distribute their own shaded > >>> >> distributions of > >>> >> dependencies. I had a > thought > >>> >> about > >>> >> doing something similar > that I > >>> >> wanted > >>> >> to throw out there for > people > >>> >> to poke > >>> >> holes in. > >>> >> > >>> >> To set the scene, here is > how > >>> >> I view > >>> >> shading: > >>> >> > >>> >> - A module has public > >>> >> dependencies > >>> >> and private dependencies. > >>> >> - Public deps are used for > >>> >> data > >>> >> interchange; users must > share > >>> >> these > >>> >> > >>> >> deps. > >>> >> > >>> >> - Private deps are just > >>> >> functionality and can be > >>> >> hidden (in > >>> >> our case, > >>> >> relocated + bundled) > >>> >> - It isn't necessarily > that > >>> >> simple, > >>> >> because public and private > >>> >> deps > >>> >> > >>> >> might > >>> >> > >>> >> interact in higher-order > ways > >>> >> ("public" is contagious) > >>> >> > >>> >> Shading is an implementation > >>> >> detail of > >>> >> expressing these > >>> >> characteristics. > >>> >> > >>> >> We > >>> >> > >>> >> use shading selectively > >>> >> because of its > >>> >> downsides I mentioned above. > >>> >> > >>> >> But what about this idea: > >>> >> Introduce > >>> >> shaded deps as a single > >>> >> separate > >>> >> artifact. > >>> >> > >>> >> - sdks/java/private-deps: > >>> >> bundled > >>> >> uber jar with relocated > >>> >> versions of > >>> >> everything we want to shade > >>> >> > >>> >> - sdks/java/core and > >>> >> sdks/java/harness: no > >>> >> relocation or > >>> >> bundling - > >>> >> depends on > >>> >> > `beam-sdks-java-private-deps` > >>> >> and > >>> >> imports like > >>> >> > >>> >> `org.apache.beam.sdk.private.com.google.common` > >>> >> directly (this is what > >>> >> they > >>> >> are rewritten to > >>> >> > >>> >> Some benefits > >>> >> > >>> >> - much faster builds of > >>> >> other modules > >>> >> - only one shaded uber jar > >>> >> - rare/no rebuilds of the > >>> >> uber jar > >>> >> - can use maven enforcer > to > >>> >> forbid > >>> >> imports like > com.google.common > >>> >> - configuration all in one > >>> >> place > >>> >> - no automated rewriting > of > >>> >> our real > >>> >> code, which has led to some > >>> >> major > >>> >> confusion > >>> >> - easy to implement > >>> >> incrementally > >>> >> > >>> >> Downsides: > >>> >> > >>> >> - plenty of effort work to > >>> >> get there > >>> >> - unclear how many > different > >>> >> such > >>> >> deps modules we need; > sharing > >>> >> them > >>> >> > >>> >> could > >>> >> > >>> >> get weird > >>> >> - if we hit a roadblock, > we > >>> >> will > >>> >> have committed a lot of time > >>> >> > >>> >> Just something I was musing > as > >>> >> I spent > >>> >> another evening waiting for > >>> >> slow > >>> >> builds to try to confirm > >>> >> changes to > >>> >> brittle poms. > >>> >> > >>> >> Kenn > >>> >> > >>> >> > >>> >> > >>> >> -- > >>> >> Jean-Baptiste Onofré > >>> >> jbono...@apache.org > >>> >> <mailto:jbono...@apache.org> > >>> >> http://blog.nanthrax.net > >>> >> <http://blog.nanthrax.net/> > >>> >> Talend - http://www.talend.com > >>> >> <http://www.talend.com/> > >>> >> > >>> >> > >>> >> > >>> >> > >>> > > >>> > > >>> > > >>> > >>> -- > >>> Jean-Baptiste Onofré > >>> jbono...@apache.org > >>> http://blog.nanthrax.net > >>> Talend - http://www.talend.com > >> > >> > > >