Re: [DISCUSS] Versioning, Hadoop related dependencies and enterprise users

Yifan Zou Fri, 07 Sep 2018 15:47:40 -0700

Thanks all for comments and suggestions. We want to close this thread and
start implementing the new policy based on the discussion:


1. Stop assigning JIRAs to the first person listed in the dependency owners
files <https://github.com/apache/beam/tree/master/ownership>. Instead, cc
people on the owner list.
2. We will be creating JIRAs for upgrading individual dependencies, not for
upgrading to specific versions of those dependencies. For example, if a
given dependency X is three minor versions or an year behind we will create
a JIRA for upgrading that. But the specific version to upgrade to has to be
determined by the Beam community. Beam community might choose to close a
JIRA if there are known issues with available recent releases. Tool will
reopen such a closed JIRA to inform owners if Beam is hitting the 'fixed
version' or 3 new versions of the dependency have been released since JIRA
was closed.

Thank you.

Regards.
Yifan

On Wed, Sep 5, 2018 at 2:14 PM Yifan Zou <yifan...@google.com> wrote:

> +1 on the jira "fix version".
> The release frequency of dependencies are various, so that using new
> information such as versions from the Jira closing date to reopen the
> issues might not be very proper. We could check the fix versions first, and
> if specified, then reopen the issue in that version's release cycle; it
> not, follow Cham's proposal (2).
>
> On Wed, Sep 5, 2018 at 1:59 PM Chamikara Jayalath <chamik...@google.com>
> wrote:
>
>>
>>
>> On Wed, Sep 5, 2018 at 12:50 PM Tim Robertson <timrobertson...@gmail.com>
>> wrote:
>>
>>> Thank you Cham, and everyone for contributing
>>>
>>> Sorry for slow reply to a thread I started, but I've been swamped on non
>>> Beam projects.
>>>
>>> KafkaIO's policy of 'let the user decide exact version at runtime' has
>>>> been quite useful so far. How feasible is that for other connectors?
>>>
>>>
>>> I presume shimming might be needed in a few places but it's certainly
>>> something we might want to explore more. I'll look into KafkaIO.
>>>
>>> On Cham's proposal :
>>>
>>> (1) +0.5. We can always then opt to either assign or take ownership of
>>> an issue, although I am also happy to stick with the owners model - it
>>> prompted me to investigate and resulted in this thread.
>>>
>>> (2) I think this makes sense.
>>> A bot informing us that we're falling behind versions is immensely
>>> useful as long as we can link issues to others which might have a wider
>>> discussion (remember many dependencies need to be treated together such as
>>> "Support Hadoop 3.0.x" or "Support HBase 2.x"). Would it make sense to let
>>> owners use the Jira "fix versions" to put in future release to inform the
>>> bot when it should start alerting again?
>>>
>>
>> I think this makes sense. Setting a "fix version" will be specially
>> useful for dependency changes that result in API changes that have to be
>> postponed till next major version of Beam.
>>
>> On grouping, I believe we already group JIRAs into tasks and sub-tasks
>> based on group ids of dependencies. I suppose it will not be too hard to
>> close multiple sub-tasks with the same reasoning.
>>
>>
>>>
>>>
>>> On Wed, Sep 5, 2018 at 3:18 AM Yifan Zou <yifan...@google.com> wrote:
>>>
>>>> Thanks Cham for putting this together. Also, after modifying the
>>>> dependency tool based on the policy above, we will close all existing JIRA
>>>> issues that prevent creating duplicate bugs and stop pushing assignees to
>>>> upgrade dependencies with old bugs.
>>>>
>>>> Please let us know if you have any comments on the revised policy in
>>>> Cham's email.
>>>>
>>>> Thanks all.
>>>>
>>>> Regards.
>>>> Yifan Zou
>>>>
>>>> On Tue, Sep 4, 2018 at 5:35 PM Chamikara Jayalath <chamik...@google.com>
>>>> wrote:
>>>>
>>>>> Based on this email thread and offline feedback from several folks,
>>>>> current concerns regarding dependency upgrade policy and tooling seems to
>>>>> be following.
>>>>>
>>>>> (1) We have to be careful when upgrading dependencies. For example, we
>>>>> should not create JIRAs for upgrading to dependency versions that have
>>>>> known issues.
>>>>>
>>>>> (2) Dependency owners list can get stale. Somebody who is interested
>>>>> in upgrading a dependency today might not be interested in the same task 
>>>>> in
>>>>> six months. Responsibility of upgrading a dependency should lie with the
>>>>> community instead of pre-identified owner(s).
>>>>>
>>>>> On the other hand we do not want Beam to significantly fall behind
>>>>> when it comes to dependencies. We should upgrade dependencies whenever it
>>>>> makes sense. This allows us to offer a more up to date system and to makes
>>>>> things easy for users that deploy Beam along with other systems.
>>>>>
>>>>> I discussed these issues with Yifan and we would like to suggest
>>>>> following changes to current policy and tooling that might help alleviate
>>>>> some of the concerns.
>>>>>
>>>>> (1) Instead of a dependency "owners" list we will be maintaining an
>>>>> "interested parties" list. When we create a JIRA for a dependency we will
>>>>> not assign it to an owner but rather we will CC all the folks that
>>>>> mentioned that they will be interested in receiving updates related to 
>>>>> that
>>>>> dependency. Hope is that some of the interested parties will also put
>>>>> forward the effort to upgrade dependencies they are interested in but the
>>>>> responsibility of upgrading dependencies lie with the community as a 
>>>>> whole.
>>>>>
>>>>>  (2) We will be creating JIRAs for upgrading individual dependencies,
>>>>> not for upgrading to specific versions of those dependencies. For example,
>>>>> if a given dependency X is three minor versions or an year behind we will
>>>>> create a JIRA for upgrading that. But the specific version to upgrade to
>>>>> has to be determined by the Beam community. Beam community might choose to
>>>>> close a JIRA if there are known issues with available recent releases. 
>>>>> Tool
>>>>> may reopen such a closed JIRA in the future if new information becomes
>>>>> available (for example, 3 new versions have been released since JIRA was
>>>>> closed).
>>>>>
>>>>> Thoughts ?
>>>>>
>>>>> Thanks,
>>>>> Cham
>>>>>
>>>>> On Tue, Aug 28, 2018 at 1:51 PM Chamikara Jayalath <
>>>>> chamik...@google.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 28, 2018 at 12:05 PM Thomas Weise <t...@apache.org> wrote:
>>>>>>
>>>>>>> I think there is an invalid assumption being made in this
>>>>>>> discussion, which is that most projects comply with semantic versioning.
>>>>>>> The reality in the open source big data space is unfortunately quite
>>>>>>> different. Ismaël has well characterized the situation and HBase isn't 
>>>>>>> an
>>>>>>> exception. Another indicator for the scale of problem is extensive 
>>>>>>> amount
>>>>>>> of shading used in Beam and other projects. It wouldn't be necessary if
>>>>>>> semver compliance was something we can rely on.
>>>>>>>
>>>>>>> Our recent Flink upgrade broke user(s). And we noticed a backward
>>>>>>> incompatible Flink change that affected the portable Flink runner even
>>>>>>> between patches.
>>>>>>>
>>>>>>> Many projects (including Beam) guarantee compatibility only for a
>>>>>>> subset of public API. Sometimes a REST API is not covered, sometimes not
>>>>>>> strictly internal protocols change and so on, all of which can break 
>>>>>>> users,
>>>>>>> despite the public API remaining "compatible". As much as I would love 
>>>>>>> to
>>>>>>> rely on the version number to tell me wether an upgrade is safe or not,
>>>>>>> that's not practically possible.
>>>>>>>
>>>>>>> Furthermore, we need to proceed with caution forcing upgrades on
>>>>>>> users that host the target systems. To stay with the Flink example, 
>>>>>>> moving
>>>>>>> Beam from 1.4 to 1.5 is actually a major change to some, because they 
>>>>>>> now
>>>>>>> have to upgrade their Flink clusters/deployments to be able to use the 
>>>>>>> new
>>>>>>> version of Beam.
>>>>>>>
>>>>>>> Upgrades need to be done with caution and may require extensive
>>>>>>> verification beyond what our automation provides. I think the Spark 
>>>>>>> change
>>>>>>> from 1.x to 2.x and also the JDK 1.8 change were good examples, they
>>>>>>> provided the community a window to provide feedback and influence the
>>>>>>> change.
>>>>>>>
>>>>>>
>>>>>> Thanks for the clarification.
>>>>>>
>>>>>> Current policy indeed requests caution and explicit checks when
>>>>>> upgrading all dependencies (including minor and patch versions) but
>>>>>> language might have to be updated to emphasize your concerns.
>>>>>>
>>>>>> Here's the current text.
>>>>>>
>>>>>> "Beam releases adhere to
>>>>>> <https://beam.apache.org/get-started/downloads/> semantic
>>>>>> versioning. Hence, community members should take care when updating
>>>>>> dependencies. Minor version updates to dependencies should be backwards
>>>>>> compatible in most cases. Some updates to dependencies though may result 
>>>>>> in
>>>>>> backwards incompatible API or functionality changes to Beam. PR reviewers
>>>>>> and committers should take care to detect any dependency updates that 
>>>>>> could
>>>>>> potentially introduce backwards incompatible changes to Beam before 
>>>>>> merging
>>>>>> and PRs that update dependencies should include a statement regarding 
>>>>>> this
>>>>>> verification in the form of a PR comment. Dependency updates that result 
>>>>>> in
>>>>>> backwards incompatible changes to non-experimental features of Beam 
>>>>>> should
>>>>>> be held till next major version release of Beam. Any exceptions to this
>>>>>> policy should only occur in extreme cases (for example, due to a security
>>>>>> vulnerability of an existing dependency that is only fixed in a 
>>>>>> subsequent
>>>>>> major version) and should be discussed in the Beam dev list. Note that
>>>>>> backwards incompatible changes to experimental features may be introduced
>>>>>> in a minor version release."
>>>>>>
>>>>>> Also, are there any other steps we can take to make sure that Beam
>>>>>> dependencies are not too old while offering a stable system ? Note that
>>>>>> having a lot of legacy dependencies that do not get upgraded regularly 
>>>>>> can
>>>>>> also result in user pain and Beam being unusable for certain users who 
>>>>>> run
>>>>>> into dependency conflicts when using Beam along with other systems (which
>>>>>> will increase the amount of shading/vendoring we have to do).
>>>>>>
>>>>>> Please note that current tooling does not force upgrades or
>>>>>> automatically upgrade dependencies. It simply creates JIRAs that can be
>>>>>> closed with a reason if needed. For Python SDK though we have version
>>>>>> ranges in place for most dependencies [1] so these dependencies get 
>>>>>> updated
>>>>>> automatically according to the corresponding ranges.
>>>>>> https://github.com/apache/beam/blob/master/sdks/python/setup.py#L103
>>>>>>
>>>>>> Thanks,
>>>>>> Cham
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Thomas
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 28, 2018 at 11:29 AM Raghu Angadi <rang...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks for the IO versioning summary.
>>>>>>>> KafkaIO's policy of 'let the user decide exact version at runtime'
>>>>>>>> has been quite useful so far. How feasible is that for other 
>>>>>>>> connectors?
>>>>>>>>
>>>>>>>> Also, KafkaIO does not limit itself to minimum features available
>>>>>>>> across all the supported versions. Some of the features (e.g. server 
>>>>>>>> side
>>>>>>>> timestamps) are disabled based on runtime Kafka version.  The unit 
>>>>>>>> tests
>>>>>>>> currently run with single recent version. Integration tests could 
>>>>>>>> certainly
>>>>>>>> use multiple versions. With some more effort in writing tests, we could
>>>>>>>> make multiple versions of the unit tests.
>>>>>>>>
>>>>>>>> Raghu.
>>>>>>>>
>>>>>>>> IO versioning
>>>>>>>>> * Elasticsearch. We delayed the move to version 6 until we heard of
>>>>>>>>> more active users needing it (more deployments). We support 2.x and
>>>>>>>>> 5.x (but 2.x went recently EOL). Support for 6.x is in progress.
>>>>>>>>> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
>>>>>>>>> because most big data distributions still use 5.x (however 5.x has
>>>>>>>>> been EOL).
>>>>>>>>> * KafkaIO uses version 1.x but Kafka recently moved to 2.x, however
>>>>>>>>> most of the deployments of Kafka use earlier versions than 1.x.
>>>>>>>>> This
>>>>>>>>> module uses a single version with the kafka client as a provided
>>>>>>>>> dependency and so far it works (but we don’t have multi version
>>>>>>>>> tests).
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Aug 28, 2018 at 8:38 AM Ismaël Mejía <ieme...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I think we should refine the strategy on dependencies discussed
>>>>>>>>> recently. Sorry to come late with this (I did not follow closely
>>>>>>>>> the
>>>>>>>>> previous discussion), but the current approach is clearly not in
>>>>>>>>> line
>>>>>>>>> with the industry reality (at least not for IO connectors + Hadoop
>>>>>>>>> +
>>>>>>>>> Spark/Flink use).
>>>>>>>>>
>>>>>>>>> A really proactive approach to dependency updates is a good
>>>>>>>>> practice
>>>>>>>>> for the core dependencies we have e.g. Guava, Bytebuddy, Avro,
>>>>>>>>> Protobuf, etc, and of course for the case of cloud based IOs e.g.
>>>>>>>>> GCS,
>>>>>>>>> Bigquery, AWS S3, etc. However when we talk about self hosted data
>>>>>>>>> sources or processing systems this gets more complicated and I
>>>>>>>>> think
>>>>>>>>> we should be more flexible and do this case by case (and remove
>>>>>>>>> these
>>>>>>>>> from the auto update email reminder).
>>>>>>>>>
>>>>>>>>> Some open source projects have at least three maintained versions:
>>>>>>>>> - LTS – maps to what most of the people have installed (or the big
>>>>>>>>> data distributions use) e.g. HBase 1.1.x, Hadoop 2.6.x
>>>>>>>>> - Stable – current recommended version. HBase 1.4.x, Hadoop 2.8.x
>>>>>>>>> - Next – latest release. HBase 2.1.x Hadoop 3.1.x
>>>>>>>>>
>>>>>>>>> Following the most recent versions can be good to be close to the
>>>>>>>>> current development of other projects and some of the fixes, but
>>>>>>>>> these
>>>>>>>>> versions are commonly not deployed for most users and adopting a
>>>>>>>>> LTS
>>>>>>>>> or stable only approach won't satisfy all cases either. To
>>>>>>>>> understand
>>>>>>>>> why this is complex let’s see some historical issues:
>>>>>>>>>
>>>>>>>>> IO versioning
>>>>>>>>> * Elasticsearch. We delayed the move to version 6 until we heard of
>>>>>>>>> more active users needing it (more deployments). We support 2.x and
>>>>>>>>> 5.x (but 2.x went recently EOL). Support for 6.x is in progress.
>>>>>>>>> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
>>>>>>>>> because most big data distributions still use 5.x (however 5.x has
>>>>>>>>> been EOL).
>>>>>>>>> * KafkaIO uses version 1.x but Kafka recently moved to 2.x, however
>>>>>>>>> most of the deployments of Kafka use earlier versions than 1.x.
>>>>>>>>> This
>>>>>>>>> module uses a single version with the kafka client as a provided
>>>>>>>>> dependency and so far it works (but we don’t have multi version
>>>>>>>>> tests).
>>>>>>>>>
>>>>>>>>> Runners versioning
>>>>>>>>> * The move to Spark 1 to Spark 2 was decided after evaluating the
>>>>>>>>> tradeoffs between maintaining multiple version support and to have
>>>>>>>>> breaking changes with the issues of maintaining multiple versions.
>>>>>>>>> This is a rare case but also with consequences. This dependency is
>>>>>>>>> provided but we don't actively test issues on version migration.
>>>>>>>>> * Flink moved to version 1.5, introducing incompatibility in
>>>>>>>>> checkpointing (discussed recently and with not yet consensus on
>>>>>>>>> how to
>>>>>>>>> handle).
>>>>>>>>>
>>>>>>>>> As you can see, it seems really hard to have a solution that fits
>>>>>>>>> all
>>>>>>>>> cases. Probably the only rule that I see from this list is that we
>>>>>>>>> should upgrade versions for connectors that have been deprecated or
>>>>>>>>> arrived to the EOL (e.g. Solr 5.x, Elasticsearch 2.x).
>>>>>>>>>
>>>>>>>>> For the case of the provided dependencies I wonder if as part of
>>>>>>>>> the
>>>>>>>>> tests we should provide tests with multiple versions (note that
>>>>>>>>> this
>>>>>>>>> is currently blocked by BEAM-4087).
>>>>>>>>>
>>>>>>>>> Any other ideas or opinions to see how we can handle this? What
>>>>>>>>> other
>>>>>>>>> people in the community think ? (Notice that this can have relation
>>>>>>>>> with the ongoing LTS discussion.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Aug 28, 2018 at 10:44 AM Tim Robertson
>>>>>>>>> <timrobertson...@gmail.com> wrote:
>>>>>>>>> >
>>>>>>>>> > Hi folks,
>>>>>>>>> >
>>>>>>>>> > I'd like to revisit the discussion around our versioning policy
>>>>>>>>> specifically for the Hadoop ecosystem and make sure we are aware of 
>>>>>>>>> the
>>>>>>>>> implications.
>>>>>>>>> >
>>>>>>>>> > As an example our policy today would have us on HBase 2.1 and I
>>>>>>>>> have reminders to address this.
>>>>>>>>> >
>>>>>>>>> > However, currently the versions of HBase in the major hadoop
>>>>>>>>> distros are:
>>>>>>>>> >
>>>>>>>>> >  - Cloudera 5 on HBase 1.2 (Cloudera 6 is 2.1 but is only in
>>>>>>>>> beta)
>>>>>>>>> >  - Hortonworks HDP3 on HBase 2.0 (only recently released so we
>>>>>>>>> can assume is not widely adopted)
>>>>>>>>> >  - AWS EMR HBase on 1.4
>>>>>>>>> >
>>>>>>>>> > On the versioning I think we might need a more nuanced approach
>>>>>>>>> to ensure that we target real communities of existing and potential 
>>>>>>>>> users.
>>>>>>>>> Enterprise users need to stick to the supported versions in the
>>>>>>>>> distributions to maintain support contracts from the vendors.
>>>>>>>>> >
>>>>>>>>> > Should our versioning policy have more room to consider on a
>>>>>>>>> case by case basis?
>>>>>>>>> >
>>>>>>>>> > For Hadoop might we benefit from a strategy on which community
>>>>>>>>> of users Beam is targeting?
>>>>>>>>> >
>>>>>>>>> > (OT: I'm collecting some thoughts on what we might consider to
>>>>>>>>> target enterprise hadoop users - kerberos on all relevant IO, 
>>>>>>>>> performance,
>>>>>>>>> leaking beyond encryption zones with temporary files etc)
>>>>>>>>> >
>>>>>>>>> > Thanks,
>>>>>>>>> > Tim
>>>>>>>>>
>>>>>>>>

Re: [DISCUSS] Versioning, Hadoop related dependencies and enterprise users

Reply via email to