Re: [DISCUSS] Versioning, Hadoop related dependencies and enterprise users

Chamikara Jayalath Wed, 05 Sep 2018 13:59:25 -0700

On Wed, Sep 5, 2018 at 12:50 PM Tim Robertson <timrobertson...@gmail.com>
wrote:


> Thank you Cham, and everyone for contributing
>
> Sorry for slow reply to a thread I started, but I've been swamped on non
> Beam projects.
>
> KafkaIO's policy of 'let the user decide exact version at runtime' has
>> been quite useful so far. How feasible is that for other connectors?
>
>
> I presume shimming might be needed in a few places but it's certainly
> something we might want to explore more. I'll look into KafkaIO.
>
> On Cham's proposal :
>
> (1) +0.5. We can always then opt to either assign or take ownership of an
> issue, although I am also happy to stick with the owners model - it
> prompted me to investigate and resulted in this thread.
>
> (2) I think this makes sense.
> A bot informing us that we're falling behind versions is immensely useful
> as long as we can link issues to others which might have a wider discussion
> (remember many dependencies need to be treated together such as "Support
> Hadoop 3.0.x" or "Support HBase 2.x"). Would it make sense to let owners
> use the Jira "fix versions" to put in future release to inform the bot when
> it should start alerting again?
>

I think this makes sense. Setting a "fix version" will be specially useful
for dependency changes that result in API changes that have to be postponed
till next major version of Beam.

On grouping, I believe we already group JIRAs into tasks and sub-tasks
based on group ids of dependencies. I suppose it will not be too hard to
close multiple sub-tasks with the same reasoning.


>
>
> On Wed, Sep 5, 2018 at 3:18 AM Yifan Zou <yifan...@google.com> wrote:
>
>> Thanks Cham for putting this together. Also, after modifying the
>> dependency tool based on the policy above, we will close all existing JIRA
>> issues that prevent creating duplicate bugs and stop pushing assignees to
>> upgrade dependencies with old bugs.
>>
>> Please let us know if you have any comments on the revised policy in
>> Cham's email.
>>
>> Thanks all.
>>
>> Regards.
>> Yifan Zou
>>
>> On Tue, Sep 4, 2018 at 5:35 PM Chamikara Jayalath <chamik...@google.com>
>> wrote:
>>
>>> Based on this email thread and offline feedback from several folks,
>>> current concerns regarding dependency upgrade policy and tooling seems to
>>> be following.
>>>
>>> (1) We have to be careful when upgrading dependencies. For example, we
>>> should not create JIRAs for upgrading to dependency versions that have
>>> known issues.
>>>
>>> (2) Dependency owners list can get stale. Somebody who is interested in
>>> upgrading a dependency today might not be interested in the same task in
>>> six months. Responsibility of upgrading a dependency should lie with the
>>> community instead of pre-identified owner(s).
>>>
>>> On the other hand we do not want Beam to significantly fall behind when
>>> it comes to dependencies. We should upgrade dependencies whenever it makes
>>> sense. This allows us to offer a more up to date system and to makes things
>>> easy for users that deploy Beam along with other systems.
>>>
>>> I discussed these issues with Yifan and we would like to suggest
>>> following changes to current policy and tooling that might help alleviate
>>> some of the concerns.
>>>
>>> (1) Instead of a dependency "owners" list we will be maintaining an
>>> "interested parties" list. When we create a JIRA for a dependency we will
>>> not assign it to an owner but rather we will CC all the folks that
>>> mentioned that they will be interested in receiving updates related to that
>>> dependency. Hope is that some of the interested parties will also put
>>> forward the effort to upgrade dependencies they are interested in but the
>>> responsibility of upgrading dependencies lie with the community as a whole.
>>>
>>>  (2) We will be creating JIRAs for upgrading individual dependencies,
>>> not for upgrading to specific versions of those dependencies. For example,
>>> if a given dependency X is three minor versions or an year behind we will
>>> create a JIRA for upgrading that. But the specific version to upgrade to
>>> has to be determined by the Beam community. Beam community might choose to
>>> close a JIRA if there are known issues with available recent releases. Tool
>>> may reopen such a closed JIRA in the future if new information becomes
>>> available (for example, 3 new versions have been released since JIRA was
>>> closed).
>>>
>>> Thoughts ?
>>>
>>> Thanks,
>>> Cham
>>>
>>> On Tue, Aug 28, 2018 at 1:51 PM Chamikara Jayalath <chamik...@google.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Tue, Aug 28, 2018 at 12:05 PM Thomas Weise <t...@apache.org> wrote:
>>>>
>>>>> I think there is an invalid assumption being made in this discussion,
>>>>> which is that most projects comply with semantic versioning. The reality 
>>>>> in
>>>>> the open source big data space is unfortunately quite different. Ismaël 
>>>>> has
>>>>> well characterized the situation and HBase isn't an exception. Another
>>>>> indicator for the scale of problem is extensive amount of shading used in
>>>>> Beam and other projects. It wouldn't be necessary if semver compliance was
>>>>> something we can rely on.
>>>>>
>>>>> Our recent Flink upgrade broke user(s). And we noticed a backward
>>>>> incompatible Flink change that affected the portable Flink runner even
>>>>> between patches.
>>>>>
>>>>> Many projects (including Beam) guarantee compatibility only for a
>>>>> subset of public API. Sometimes a REST API is not covered, sometimes not
>>>>> strictly internal protocols change and so on, all of which can break 
>>>>> users,
>>>>> despite the public API remaining "compatible". As much as I would love to
>>>>> rely on the version number to tell me wether an upgrade is safe or not,
>>>>> that's not practically possible.
>>>>>
>>>>> Furthermore, we need to proceed with caution forcing upgrades on users
>>>>> that host the target systems. To stay with the Flink example, moving Beam
>>>>> from 1.4 to 1.5 is actually a major change to some, because they now have
>>>>> to upgrade their Flink clusters/deployments to be able to use the new
>>>>> version of Beam.
>>>>>
>>>>> Upgrades need to be done with caution and may require extensive
>>>>> verification beyond what our automation provides. I think the Spark change
>>>>> from 1.x to 2.x and also the JDK 1.8 change were good examples, they
>>>>> provided the community a window to provide feedback and influence the
>>>>> change.
>>>>>
>>>>
>>>> Thanks for the clarification.
>>>>
>>>> Current policy indeed requests caution and explicit checks when
>>>> upgrading all dependencies (including minor and patch versions) but
>>>> language might have to be updated to emphasize your concerns.
>>>>
>>>> Here's the current text.
>>>>
>>>> "Beam releases adhere to
>>>> <https://beam.apache.org/get-started/downloads/> semantic versioning.
>>>> Hence, community members should take care when updating dependencies. Minor
>>>> version updates to dependencies should be backwards compatible in most
>>>> cases. Some updates to dependencies though may result in backwards
>>>> incompatible API or functionality changes to Beam. PR reviewers and
>>>> committers should take care to detect any dependency updates that could
>>>> potentially introduce backwards incompatible changes to Beam before merging
>>>> and PRs that update dependencies should include a statement regarding this
>>>> verification in the form of a PR comment. Dependency updates that result in
>>>> backwards incompatible changes to non-experimental features of Beam should
>>>> be held till next major version release of Beam. Any exceptions to this
>>>> policy should only occur in extreme cases (for example, due to a security
>>>> vulnerability of an existing dependency that is only fixed in a subsequent
>>>> major version) and should be discussed in the Beam dev list. Note that
>>>> backwards incompatible changes to experimental features may be introduced
>>>> in a minor version release."
>>>>
>>>> Also, are there any other steps we can take to make sure that Beam
>>>> dependencies are not too old while offering a stable system ? Note that
>>>> having a lot of legacy dependencies that do not get upgraded regularly can
>>>> also result in user pain and Beam being unusable for certain users who run
>>>> into dependency conflicts when using Beam along with other systems (which
>>>> will increase the amount of shading/vendoring we have to do).
>>>>
>>>> Please note that current tooling does not force upgrades or
>>>> automatically upgrade dependencies. It simply creates JIRAs that can be
>>>> closed with a reason if needed. For Python SDK though we have version
>>>> ranges in place for most dependencies [1] so these dependencies get updated
>>>> automatically according to the corresponding ranges.
>>>> https://github.com/apache/beam/blob/master/sdks/python/setup.py#L103
>>>>
>>>> Thanks,
>>>> Cham
>>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>> Thomas
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Aug 28, 2018 at 11:29 AM Raghu Angadi <rang...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks for the IO versioning summary.
>>>>>> KafkaIO's policy of 'let the user decide exact version at runtime'
>>>>>> has been quite useful so far. How feasible is that for other connectors?
>>>>>>
>>>>>> Also, KafkaIO does not limit itself to minimum features available
>>>>>> across all the supported versions. Some of the features (e.g. server side
>>>>>> timestamps) are disabled based on runtime Kafka version.  The unit tests
>>>>>> currently run with single recent version. Integration tests could 
>>>>>> certainly
>>>>>> use multiple versions. With some more effort in writing tests, we could
>>>>>> make multiple versions of the unit tests.
>>>>>>
>>>>>> Raghu.
>>>>>>
>>>>>> IO versioning
>>>>>>> * Elasticsearch. We delayed the move to version 6 until we heard of
>>>>>>> more active users needing it (more deployments). We support 2.x and
>>>>>>> 5.x (but 2.x went recently EOL). Support for 6.x is in progress.
>>>>>>> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
>>>>>>> because most big data distributions still use 5.x (however 5.x has
>>>>>>> been EOL).
>>>>>>> * KafkaIO uses version 1.x but Kafka recently moved to 2.x, however
>>>>>>> most of the deployments of Kafka use earlier versions than 1.x. This
>>>>>>> module uses a single version with the kafka client as a provided
>>>>>>> dependency and so far it works (but we don’t have multi version
>>>>>>> tests).
>>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 28, 2018 at 8:38 AM Ismaël Mejía <ieme...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I think we should refine the strategy on dependencies discussed
>>>>>>> recently. Sorry to come late with this (I did not follow closely the
>>>>>>> previous discussion), but the current approach is clearly not in line
>>>>>>> with the industry reality (at least not for IO connectors + Hadoop +
>>>>>>> Spark/Flink use).
>>>>>>>
>>>>>>> A really proactive approach to dependency updates is a good practice
>>>>>>> for the core dependencies we have e.g. Guava, Bytebuddy, Avro,
>>>>>>> Protobuf, etc, and of course for the case of cloud based IOs e.g.
>>>>>>> GCS,
>>>>>>> Bigquery, AWS S3, etc. However when we talk about self hosted data
>>>>>>> sources or processing systems this gets more complicated and I think
>>>>>>> we should be more flexible and do this case by case (and remove these
>>>>>>> from the auto update email reminder).
>>>>>>>
>>>>>>> Some open source projects have at least three maintained versions:
>>>>>>> - LTS – maps to what most of the people have installed (or the big
>>>>>>> data distributions use) e.g. HBase 1.1.x, Hadoop 2.6.x
>>>>>>> - Stable – current recommended version. HBase 1.4.x, Hadoop 2.8.x
>>>>>>> - Next – latest release. HBase 2.1.x Hadoop 3.1.x
>>>>>>>
>>>>>>> Following the most recent versions can be good to be close to the
>>>>>>> current development of other projects and some of the fixes, but
>>>>>>> these
>>>>>>> versions are commonly not deployed for most users and adopting a LTS
>>>>>>> or stable only approach won't satisfy all cases either. To understand
>>>>>>> why this is complex let’s see some historical issues:
>>>>>>>
>>>>>>> IO versioning
>>>>>>> * Elasticsearch. We delayed the move to version 6 until we heard of
>>>>>>> more active users needing it (more deployments). We support 2.x and
>>>>>>> 5.x (but 2.x went recently EOL). Support for 6.x is in progress.
>>>>>>> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
>>>>>>> because most big data distributions still use 5.x (however 5.x has
>>>>>>> been EOL).
>>>>>>> * KafkaIO uses version 1.x but Kafka recently moved to 2.x, however
>>>>>>> most of the deployments of Kafka use earlier versions than 1.x. This
>>>>>>> module uses a single version with the kafka client as a provided
>>>>>>> dependency and so far it works (but we don’t have multi version
>>>>>>> tests).
>>>>>>>
>>>>>>> Runners versioning
>>>>>>> * The move to Spark 1 to Spark 2 was decided after evaluating the
>>>>>>> tradeoffs between maintaining multiple version support and to have
>>>>>>> breaking changes with the issues of maintaining multiple versions.
>>>>>>> This is a rare case but also with consequences. This dependency is
>>>>>>> provided but we don't actively test issues on version migration.
>>>>>>> * Flink moved to version 1.5, introducing incompatibility in
>>>>>>> checkpointing (discussed recently and with not yet consensus on how
>>>>>>> to
>>>>>>> handle).
>>>>>>>
>>>>>>> As you can see, it seems really hard to have a solution that fits all
>>>>>>> cases. Probably the only rule that I see from this list is that we
>>>>>>> should upgrade versions for connectors that have been deprecated or
>>>>>>> arrived to the EOL (e.g. Solr 5.x, Elasticsearch 2.x).
>>>>>>>
>>>>>>> For the case of the provided dependencies I wonder if as part of the
>>>>>>> tests we should provide tests with multiple versions (note that this
>>>>>>> is currently blocked by BEAM-4087).
>>>>>>>
>>>>>>> Any other ideas or opinions to see how we can handle this? What other
>>>>>>> people in the community think ? (Notice that this can have relation
>>>>>>> with the ongoing LTS discussion.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 28, 2018 at 10:44 AM Tim Robertson
>>>>>>> <timrobertson...@gmail.com> wrote:
>>>>>>> >
>>>>>>> > Hi folks,
>>>>>>> >
>>>>>>> > I'd like to revisit the discussion around our versioning policy
>>>>>>> specifically for the Hadoop ecosystem and make sure we are aware of the
>>>>>>> implications.
>>>>>>> >
>>>>>>> > As an example our policy today would have us on HBase 2.1 and I
>>>>>>> have reminders to address this.
>>>>>>> >
>>>>>>> > However, currently the versions of HBase in the major hadoop
>>>>>>> distros are:
>>>>>>> >
>>>>>>> >  - Cloudera 5 on HBase 1.2 (Cloudera 6 is 2.1 but is only in beta)
>>>>>>> >  - Hortonworks HDP3 on HBase 2.0 (only recently released so we can
>>>>>>> assume is not widely adopted)
>>>>>>> >  - AWS EMR HBase on 1.4
>>>>>>> >
>>>>>>> > On the versioning I think we might need a more nuanced approach to
>>>>>>> ensure that we target real communities of existing and potential users.
>>>>>>> Enterprise users need to stick to the supported versions in the
>>>>>>> distributions to maintain support contracts from the vendors.
>>>>>>> >
>>>>>>> > Should our versioning policy have more room to consider on a case
>>>>>>> by case basis?
>>>>>>> >
>>>>>>> > For Hadoop might we benefit from a strategy on which community of
>>>>>>> users Beam is targeting?
>>>>>>> >
>>>>>>> > (OT: I'm collecting some thoughts on what we might consider to
>>>>>>> target enterprise hadoop users - kerberos on all relevant IO, 
>>>>>>> performance,
>>>>>>> leaking beyond encryption zones with temporary files etc)
>>>>>>> >
>>>>>>> > Thanks,
>>>>>>> > Tim
>>>>>>>
>>>>>>

Re: [DISCUSS] Versioning, Hadoop related dependencies and enterprise users

Reply via email to