Re: [Discuss] Beam SDK (Java) providing a shaded jar as a dependency

Dan Halperin Mon, 25 Jul 2016 12:09:46 -0700

A few reactions:

* Keeping tight control over your module's API surface is *critical* for
your users. For example, Hadoop added a public dependency on an old version
of Guava many years ago and it has really hurt the community ever since.
Google "hadoop guava pain" is pretty instructive, and this issue is now 3
years old without resolution:
https://issues.apache.org/jira/browse/HADOOP-10101


* Seems like Elastic wants to put all the hard work on their customer
rather than bother thinking about their API service. Assuming that the
user's POM is so simple that it probably only depends on your package (as
in their blog post) does your users a real disservice. For Beam, we should
set the bar very high.

* Kryo or other serialization libraries needing to be aware of full Guava
path names -- this seems like a bug in the assumptions being made by Kryo
or the user of Kryo. This is well solved by using the public API surface.
     -> Nothing in BoundedHeap is actually declared as a ReverseList ; so
Kryo should be able to just use the List serializer, shouldn't it?
     -> This is what ListCoder does -- we don't have an ArrayListCoder and
an ImmutableListCoder and a LinkedListCoder.
  If the user actually depended on a more concrete type than List, they
should include it in the declared types. Reflection is often harmful.

* Multiple artifacts -- strongly opposed. I think it would be very
confusing for our users if we had shaded and unshaded versions of Beam on
maven central.

Dan


On Mon, Jul 25, 2016 at 10:58 AM, Kenneth Knowles <[email protected]>
wrote:

> The way I see this issue is that shading is a tool that we are using to
> imperfectly implement two kinds of dependencies:
>
> 1. Private dependencies, which are implementation details.
> 2. Public dependencies, which are observable through the API surface.
>
> The SDK and user are required to share the same instance (not just version)
> of a public dependency (like joda-time). The SDK and user should not be
> able to know if they are using the same instance of a private dependency -
> not only should different versions be a non-issue, but even global data
> should not be shared. As Luke said, reflection can push things from private
> to public very easily.
>
> That is a bit of an idealized view. If the language & tool stack understood
> these concepts then shading would be a non-issue, debuggers would be happy,
> etc.
>
> With maven-shade-plugin, #1 is implemented via repackaging & bundling,
> while #2 is implemented by doing nothing. I don't find the ES rationale
> applicable here, but I'll be the first to admit that implementing these via
> shading has many drawbacks: redundant configuration that can be
> error-prone, breaks for split packages, corner cases around reflection,
> large jars, unit tests run pre-shading, ...
>
> So if I reframe what you are saying, I hear a proposal to go
> public-by-default and to only treat things as private dependencies if we
> really see a problem.
>
> I tend to favor private-by-default, to keep a tight grip on what our API
> surface really is, but I also have no problem just giving up when a library
> is not "shadeable". I could also be convinced that there is such an
> impedance mismatch between the goal and the tooling that public-by-default
> is a more practical route.
>
> Kenn
>
> On Mon, Jul 25, 2016 at 9:04 AM, Amit Sela <[email protected]> wrote:
>
> > I agree with you about the advantages of shading, but I also believe that
> > shading needs to be used only where necessary.
> >
> > Guava is a good example for shading, since it's popular and ends-up in
> your
> > classpath somehow.. more than once.. unless you shade.
> > I was wondering what's "com.google.thirdparty" and why it's shaded.
> >
> > From my experience, it's best to "push" the shading downstream as much as
> > possible, meaning that if a runner provides a shaded artifact, it will do
> > so without shaded dependencies (or as little as possible).
> > One idea could be to "tie" the versions of dependencies between runners
> and
> > the SDK - the runner ends up shading according to the engine it runs on.
> >
> > Having shaded and non-shaded artifacts could work as well, though this is
> > risky in the sense that the user might end-up having runtime collisions
> if
> > the SDK/runner/engine are not aligned. Though this is something that
> could
> > always eventually happen...
> >
> > On Mon, Jul 25, 2016 at 6:36 PM Lukasz Cwik <[email protected]>
> > wrote:
> >
> > > I believe that shading is a net win because many larger projects have
> > > hundreds of transitive dependencies and making sure that you can use a
> > > complex library like Beam with another complex library like Spark or
> > Hadoop
> > > quickly becomes untenable without something like shading due to version
> > > compatibility issues. I also believe that shading does simplify the
> > getting
> > > started experience for many users since we would only need to expose
> the
> > > dependencies which cross our API boundaries.
> > >
> > > It does come at the cost of dealing with libraries that don't honor API
> > > boundaries (e.g. reflection, serialization, code generation libraries)
> > and
> > > finding either effective workarounds or increasing the API surface of
> > what
> > > is not shaded. Which is all extra work for Beam maintainers.
> > >
> > > Its not impossible to have a large project work with another large
> > project
> > > but it is also equally difficult since we give up a lot of version
> > > compatibility freedom.
> > >
> > > This does not mean we can't have two artifacts, one shaded and one not
> > but
> > > if we were to have both, would this hurt portability between runners?
> > > If you have experience maintaining another project with or without
> > shading
> > > like tech, would love to hear it as well.
> > >
> > > On Mon, Jul 25, 2016 at 11:09 AM, Amit Sela <[email protected]>
> > wrote:
> > >
> > > > I wanted to raise the issue of the SDK providing a shaded jar for
> > > > dependency.
> > > >
> > > > AFAIK it's generally considered bad practice, though eventually it's
> > > > trade-offs. I can definitely understand why here
> > > > <
> > > >
> > >
> >
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/pom.xml#L196
> > > > >
> > > > we
> > > > shade Guava for example - so the user can use it's desired Guava
> > version
> > > -
> > > > but on the other hand, runners eventually have to
> serialize/deserialize
> > > the
> > > > SDK's classes sometimes.
> > > > For example: Using "Top.largest(10)" to get top 10 results, uses the
> > > SDK's
> > > > BoundedHeap, which is backed by a ReverseList which is currently not
> > > > supported by Kryo (I've submitted a PR), but even if you write-up a
> > > > ReverseListSerializer, in order to register the class you have to
> > > > explicitly state it's *repackaged* name... (I know you can use Coders
> > to
> > > > shuffle bytes around and so avoid Kryo serializing classes)
> > > >
> > > > I'm not saying that shading is completely wrong in some cases, but I
> > > would
> > > > like to know more about the considerations made, and let's not forget
> > > that
> > > > some runners (Spark for example) shade also... How risky is it for
> Beam
> > > to
> > > > provide such shaded artifacts ? What/How should we inform our users
> > about
> > > > it ? i.e., Elastic published this
> > > > <https://www.elastic.co/blog/to-shade-or-not-to-shade> about their
> > > choice
> > > > for (not) shading.
> > > >
> > > > Thanks,
> > > > Amit
> > > >
> > >
> >
>

Re: [Discuss] Beam SDK (Java) providing a shaded jar as a dependency

Reply via email to