+1 overall.

I like Luke's idea of testing and publishing BOMS for common platform
configurations, including different versions of deps. It generalizes the
approach Ismaël describes for KafkaIO.

A special case is when libraries and services exist in only one place and
are entirely controlled by a vendor, for example the GCP BOM. Especially
when a vendor is infamous for making changes that are incompatible even
with other versions of their own libraries and services, the best path for
users is probably to use their BOM to get versions that are expected to
work together. It sort of OK (but not great) to just do blind upgrades to
their latest recommendations. This does not apply to any product that is
hosted OSS or cross-vendor.

Dropping support for a version of Hadoop or Kafka is similar to dropping
support for Python 2. It is a good point that it could use a VOTE. This
will make sure that people don't just miss the wiki page about sensible
upgrades. I think it would be helpful to have an idea what these key deps
are. I don't think all deps need this level of formality. And some deps
deserve even more extreme treatment, like the Flink runner. Hadoop is
potentially that important to ship multiple artifacts if it really became
necessary.

Kenn

On Thu, Oct 22, 2020 at 8:34 AM Luke Cwik <[email protected]> wrote:

> Traditionally I have been pushing for as many versions of deps to use the
> same version across all the Beam modules (the purpose of the list of deps
> in BeamModulePlugin.groovy) to simplify dependency convergence.
>
> One solution is to test and publish BOMs for the various common platform
> configurations.
> e.g.
> Beam + Spark + Hadoop2 + ...
> Beam + Dataflow + GCP + ...
>
> Then users can choose which BOM they want and anything outside the set of
> BOMs that we supply is up to them to figure out how to support which is
> what they are forced to do now.
>
> Using multiple BOMs allows us to work around restrictions that a certain
> platform choice has (e.g. if Spark is incompatible with Guava 29, then we
> publish a BOM with deps compatible with a version it is compatible with).
>
> Doing this requires us:
> * to test the BOMs for the configuration they are expected to work with
> (more Jenkins runs)
> * has us maintain multiple version lists
> * forces us to write code which is compatible across multiple versions of
> a dependency and not just a single point version (similar to the work done
> in Kafka).
>
> On Thu, Oct 22, 2020 at 7:25 AM Ismaël Mejía <[email protected]> wrote:
>
>> I have seen ongoing work on upgrading dependencies, this is a great task
>> needed
>> for the health of the project and its IO connectors, however I am a bit
>> worried
>> on the impact of these on existing users. We should be aware that we
>> support old
>> versions of the clients for valid reasons. If we update a version of a
>> client we
>> should ensure that it still interacts correctly with existing users and
>> runtime
>> systems. Basically we need two conditions:
>>
>> 1. We cannot update dependencies without considering the current use of
>> them.
>> 2. We must avoid upgrading to a non-stable or non-LTS dependency version
>>
>> For (1) in a recent thread Piotr brang some issues about updating Hadoop
>> dependencies to version 3. This surprised me because the whole Big Data
>> ecosystem is just catching up with Hadoop 3  (Flink does not even release
>> artifacts for this yet, and Spark just started on version 3 some months
>> ago),
>> which means that most of our users still need that we guarantee
>> compatiblity
>> with Hadoop 2.x dependencies.
>>
>> The Hadoop dependencies are mostly 'provided' so a way to achieve this is
>> by
>> creating new test configurations that guarantees backwards (or forwards)
>> compatibility by providing the respective versions. This is similar to
>> what we
>> do currently in KafkaIO by using by default version 1.0.0 but testing
>> compatibility with 2.1.0 by providing the right dependencies too.
>>
>> The same thread discusses also upgrading to version 3.3.x the latest, but
>> per
>> (2) we should not consider upgrades to non stable versions which of
>> Hadoop  is
>> currently 3.2.1.  https://hadoop.apache.org/docs/stable/
>>
>> I also saw a recent upgrade of SolrIO to version 8 which may affect some
>> users
>> of previous versions with no discussion about it on the mailing lists and
>> no
>> backwards compatibility guarantees.
>> https://github.com/apache/beam/pull/13027
>>
>> In the Solr case I think probably this update makes more sense since Solr
>> 5.x
>> is deprecated and less people would be probably impacted but still it
>> would
>> have been good to discuss this on user@
>>
>> I don't know how we can find a good equilibrium between deciding on those
>> upgrades from maintainers vs users without adding much overhead. Should
>> we have
>> a VOTE maybe for the most sensible dependencies? or just assume this is a
>> criteria for the maintainers, I am afraid we may end up with
>> incompatible changes
>> due to the lack of awareness or for not much in return but at the same
>> time I wonder if it makes sense to add the extra work of discussion
>> for minor dependencies where this matters less.
>>
>> Should we document maybe the sensible dependency upgrades (the recent
>> thread on Avro upgrade comes to my mind too)? Or should we have the same
>> criteria for all.  Other ideas?
>>
>

Reply via email to