I agree with you about the advantages of shading, but I also believe that shading needs to be used only where necessary.
Guava is a good example for shading, since it's popular and ends-up in your classpath somehow.. more than once.. unless you shade. I was wondering what's "com.google.thirdparty" and why it's shaded. >From my experience, it's best to "push" the shading downstream as much as possible, meaning that if a runner provides a shaded artifact, it will do so without shaded dependencies (or as little as possible). One idea could be to "tie" the versions of dependencies between runners and the SDK - the runner ends up shading according to the engine it runs on. Having shaded and non-shaded artifacts could work as well, though this is risky in the sense that the user might end-up having runtime collisions if the SDK/runner/engine are not aligned. Though this is something that could always eventually happen... On Mon, Jul 25, 2016 at 6:36 PM Lukasz Cwik <[email protected]> wrote: > I believe that shading is a net win because many larger projects have > hundreds of transitive dependencies and making sure that you can use a > complex library like Beam with another complex library like Spark or Hadoop > quickly becomes untenable without something like shading due to version > compatibility issues. I also believe that shading does simplify the getting > started experience for many users since we would only need to expose the > dependencies which cross our API boundaries. > > It does come at the cost of dealing with libraries that don't honor API > boundaries (e.g. reflection, serialization, code generation libraries) and > finding either effective workarounds or increasing the API surface of what > is not shaded. Which is all extra work for Beam maintainers. > > Its not impossible to have a large project work with another large project > but it is also equally difficult since we give up a lot of version > compatibility freedom. > > This does not mean we can't have two artifacts, one shaded and one not but > if we were to have both, would this hurt portability between runners? > If you have experience maintaining another project with or without shading > like tech, would love to hear it as well. > > On Mon, Jul 25, 2016 at 11:09 AM, Amit Sela <[email protected]> wrote: > > > I wanted to raise the issue of the SDK providing a shaded jar for > > dependency. > > > > AFAIK it's generally considered bad practice, though eventually it's > > trade-offs. I can definitely understand why here > > < > > > https://github.com/apache/incubator-beam/blob/master/sdks/java/core/pom.xml#L196 > > > > > we > > shade Guava for example - so the user can use it's desired Guava version > - > > but on the other hand, runners eventually have to serialize/deserialize > the > > SDK's classes sometimes. > > For example: Using "Top.largest(10)" to get top 10 results, uses the > SDK's > > BoundedHeap, which is backed by a ReverseList which is currently not > > supported by Kryo (I've submitted a PR), but even if you write-up a > > ReverseListSerializer, in order to register the class you have to > > explicitly state it's *repackaged* name... (I know you can use Coders to > > shuffle bytes around and so avoid Kryo serializing classes) > > > > I'm not saying that shading is completely wrong in some cases, but I > would > > like to know more about the considerations made, and let's not forget > that > > some runners (Spark for example) shade also... How risky is it for Beam > to > > provide such shaded artifacts ? What/How should we inform our users about > > it ? i.e., Elastic published this > > <https://www.elastic.co/blog/to-shade-or-not-to-shade> about their > choice > > for (not) shading. > > > > Thanks, > > Amit > > >
