Re: [Discuss] Beam SDK (Java) providing a shaded jar as a dependency

Kenneth Knowles Mon, 25 Jul 2016 10:58:40 -0700

The way I see this issue is that shading is a tool that we are using to
imperfectly implement two kinds of dependencies:


1. Private dependencies, which are implementation details.
2. Public dependencies, which are observable through the API surface.

The SDK and user are required to share the same instance (not just version)
of a public dependency (like joda-time). The SDK and user should not be
able to know if they are using the same instance of a private dependency -
not only should different versions be a non-issue, but even global data
should not be shared. As Luke said, reflection can push things from private
to public very easily.

That is a bit of an idealized view. If the language & tool stack understood
these concepts then shading would be a non-issue, debuggers would be happy,
etc.

With maven-shade-plugin, #1 is implemented via repackaging & bundling,
while #2 is implemented by doing nothing. I don't find the ES rationale
applicable here, but I'll be the first to admit that implementing these via
shading has many drawbacks: redundant configuration that can be
error-prone, breaks for split packages, corner cases around reflection,
large jars, unit tests run pre-shading, ...

So if I reframe what you are saying, I hear a proposal to go
public-by-default and to only treat things as private dependencies if we
really see a problem.

I tend to favor private-by-default, to keep a tight grip on what our API
surface really is, but I also have no problem just giving up when a library
is not "shadeable". I could also be convinced that there is such an
impedance mismatch between the goal and the tooling that public-by-default
is a more practical route.

Kenn

On Mon, Jul 25, 2016 at 9:04 AM, Amit Sela <[email protected]> wrote:

> I agree with you about the advantages of shading, but I also believe that
> shading needs to be used only where necessary.
>
> Guava is a good example for shading, since it's popular and ends-up in your
> classpath somehow.. more than once.. unless you shade.
> I was wondering what's "com.google.thirdparty" and why it's shaded.
>
> From my experience, it's best to "push" the shading downstream as much as
> possible, meaning that if a runner provides a shaded artifact, it will do
> so without shaded dependencies (or as little as possible).
> One idea could be to "tie" the versions of dependencies between runners and
> the SDK - the runner ends up shading according to the engine it runs on.
>
> Having shaded and non-shaded artifacts could work as well, though this is
> risky in the sense that the user might end-up having runtime collisions if
> the SDK/runner/engine are not aligned. Though this is something that could
> always eventually happen...
>
> On Mon, Jul 25, 2016 at 6:36 PM Lukasz Cwik <[email protected]>
> wrote:
>
> > I believe that shading is a net win because many larger projects have
> > hundreds of transitive dependencies and making sure that you can use a
> > complex library like Beam with another complex library like Spark or
> Hadoop
> > quickly becomes untenable without something like shading due to version
> > compatibility issues. I also believe that shading does simplify the
> getting
> > started experience for many users since we would only need to expose the
> > dependencies which cross our API boundaries.
> >
> > It does come at the cost of dealing with libraries that don't honor API
> > boundaries (e.g. reflection, serialization, code generation libraries)
> and
> > finding either effective workarounds or increasing the API surface of
> what
> > is not shaded. Which is all extra work for Beam maintainers.
> >
> > Its not impossible to have a large project work with another large
> project
> > but it is also equally difficult since we give up a lot of version
> > compatibility freedom.
> >
> > This does not mean we can't have two artifacts, one shaded and one not
> but
> > if we were to have both, would this hurt portability between runners?
> > If you have experience maintaining another project with or without
> shading
> > like tech, would love to hear it as well.
> >
> > On Mon, Jul 25, 2016 at 11:09 AM, Amit Sela <[email protected]>
> wrote:
> >
> > > I wanted to raise the issue of the SDK providing a shaded jar for
> > > dependency.
> > >
> > > AFAIK it's generally considered bad practice, though eventually it's
> > > trade-offs. I can definitely understand why here
> > > <
> > >
> >
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/pom.xml#L196
> > > >
> > > we
> > > shade Guava for example - so the user can use it's desired Guava
> version
> > -
> > > but on the other hand, runners eventually have to serialize/deserialize
> > the
> > > SDK's classes sometimes.
> > > For example: Using "Top.largest(10)" to get top 10 results, uses the
> > SDK's
> > > BoundedHeap, which is backed by a ReverseList which is currently not
> > > supported by Kryo (I've submitted a PR), but even if you write-up a
> > > ReverseListSerializer, in order to register the class you have to
> > > explicitly state it's *repackaged* name... (I know you can use Coders
> to
> > > shuffle bytes around and so avoid Kryo serializing classes)
> > >
> > > I'm not saying that shading is completely wrong in some cases, but I
> > would
> > > like to know more about the considerations made, and let's not forget
> > that
> > > some runners (Spark for example) shade also... How risky is it for Beam
> > to
> > > provide such shaded artifacts ? What/How should we inform our users
> about
> > > it ? i.e., Elastic published this
> > > <https://www.elastic.co/blog/to-shade-or-not-to-shade> about their
> > choice
> > > for (not) shading.
> > >
> > > Thanks,
> > > Amit
> > >
> >
>

Re: [Discuss] Beam SDK (Java) providing a shaded jar as a dependency

Reply via email to