I agree with you about the advantages of shading, but I also believe that
shading needs to be used only where necessary.

Guava is a good example for shading, since it's popular and ends-up in your
classpath somehow.. more than once.. unless you shade.
I was wondering what's "com.google.thirdparty" and why it's shaded.

>From my experience, it's best to "push" the shading downstream as much as
possible, meaning that if a runner provides a shaded artifact, it will do
so without shaded dependencies (or as little as possible).
One idea could be to "tie" the versions of dependencies between runners and
the SDK - the runner ends up shading according to the engine it runs on.

Having shaded and non-shaded artifacts could work as well, though this is
risky in the sense that the user might end-up having runtime collisions if
the SDK/runner/engine are not aligned. Though this is something that could
always eventually happen...

On Mon, Jul 25, 2016 at 6:36 PM Lukasz Cwik <[email protected]>
wrote:

> I believe that shading is a net win because many larger projects have
> hundreds of transitive dependencies and making sure that you can use a
> complex library like Beam with another complex library like Spark or Hadoop
> quickly becomes untenable without something like shading due to version
> compatibility issues. I also believe that shading does simplify the getting
> started experience for many users since we would only need to expose the
> dependencies which cross our API boundaries.
>
> It does come at the cost of dealing with libraries that don't honor API
> boundaries (e.g. reflection, serialization, code generation libraries) and
> finding either effective workarounds or increasing the API surface of what
> is not shaded. Which is all extra work for Beam maintainers.
>
> Its not impossible to have a large project work with another large project
> but it is also equally difficult since we give up a lot of version
> compatibility freedom.
>
> This does not mean we can't have two artifacts, one shaded and one not but
> if we were to have both, would this hurt portability between runners?
> If you have experience maintaining another project with or without shading
> like tech, would love to hear it as well.
>
> On Mon, Jul 25, 2016 at 11:09 AM, Amit Sela <[email protected]> wrote:
>
> > I wanted to raise the issue of the SDK providing a shaded jar for
> > dependency.
> >
> > AFAIK it's generally considered bad practice, though eventually it's
> > trade-offs. I can definitely understand why here
> > <
> >
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/pom.xml#L196
> > >
> > we
> > shade Guava for example - so the user can use it's desired Guava version
> -
> > but on the other hand, runners eventually have to serialize/deserialize
> the
> > SDK's classes sometimes.
> > For example: Using "Top.largest(10)" to get top 10 results, uses the
> SDK's
> > BoundedHeap, which is backed by a ReverseList which is currently not
> > supported by Kryo (I've submitted a PR), but even if you write-up a
> > ReverseListSerializer, in order to register the class you have to
> > explicitly state it's *repackaged* name... (I know you can use Coders to
> > shuffle bytes around and so avoid Kryo serializing classes)
> >
> > I'm not saying that shading is completely wrong in some cases, but I
> would
> > like to know more about the considerations made, and let's not forget
> that
> > some runners (Spark for example) shade also... How risky is it for Beam
> to
> > provide such shaded artifacts ? What/How should we inform our users about
> > it ? i.e., Elastic published this
> > <https://www.elastic.co/blog/to-shade-or-not-to-shade> about their
> choice
> > for (not) shading.
> >
> > Thanks,
> > Amit
> >
>

Reply via email to