On Wed, Sep 5, 2018 at 12:50 PM Tim Robertson <timrobertson...@gmail.com> wrote:
> Thank you Cham, and everyone for contributing > > Sorry for slow reply to a thread I started, but I've been swamped on non > Beam projects. > > KafkaIO's policy of 'let the user decide exact version at runtime' has >> been quite useful so far. How feasible is that for other connectors? > > > I presume shimming might be needed in a few places but it's certainly > something we might want to explore more. I'll look into KafkaIO. > > On Cham's proposal : > > (1) +0.5. We can always then opt to either assign or take ownership of an > issue, although I am also happy to stick with the owners model - it > prompted me to investigate and resulted in this thread. > > (2) I think this makes sense. > A bot informing us that we're falling behind versions is immensely useful > as long as we can link issues to others which might have a wider discussion > (remember many dependencies need to be treated together such as "Support > Hadoop 3.0.x" or "Support HBase 2.x"). Would it make sense to let owners > use the Jira "fix versions" to put in future release to inform the bot when > it should start alerting again? > I think this makes sense. Setting a "fix version" will be specially useful for dependency changes that result in API changes that have to be postponed till next major version of Beam. On grouping, I believe we already group JIRAs into tasks and sub-tasks based on group ids of dependencies. I suppose it will not be too hard to close multiple sub-tasks with the same reasoning. > > > On Wed, Sep 5, 2018 at 3:18 AM Yifan Zou <yifan...@google.com> wrote: > >> Thanks Cham for putting this together. Also, after modifying the >> dependency tool based on the policy above, we will close all existing JIRA >> issues that prevent creating duplicate bugs and stop pushing assignees to >> upgrade dependencies with old bugs. >> >> Please let us know if you have any comments on the revised policy in >> Cham's email. >> >> Thanks all. >> >> Regards. >> Yifan Zou >> >> On Tue, Sep 4, 2018 at 5:35 PM Chamikara Jayalath <chamik...@google.com> >> wrote: >> >>> Based on this email thread and offline feedback from several folks, >>> current concerns regarding dependency upgrade policy and tooling seems to >>> be following. >>> >>> (1) We have to be careful when upgrading dependencies. For example, we >>> should not create JIRAs for upgrading to dependency versions that have >>> known issues. >>> >>> (2) Dependency owners list can get stale. Somebody who is interested in >>> upgrading a dependency today might not be interested in the same task in >>> six months. Responsibility of upgrading a dependency should lie with the >>> community instead of pre-identified owner(s). >>> >>> On the other hand we do not want Beam to significantly fall behind when >>> it comes to dependencies. We should upgrade dependencies whenever it makes >>> sense. This allows us to offer a more up to date system and to makes things >>> easy for users that deploy Beam along with other systems. >>> >>> I discussed these issues with Yifan and we would like to suggest >>> following changes to current policy and tooling that might help alleviate >>> some of the concerns. >>> >>> (1) Instead of a dependency "owners" list we will be maintaining an >>> "interested parties" list. When we create a JIRA for a dependency we will >>> not assign it to an owner but rather we will CC all the folks that >>> mentioned that they will be interested in receiving updates related to that >>> dependency. Hope is that some of the interested parties will also put >>> forward the effort to upgrade dependencies they are interested in but the >>> responsibility of upgrading dependencies lie with the community as a whole. >>> >>> (2) We will be creating JIRAs for upgrading individual dependencies, >>> not for upgrading to specific versions of those dependencies. For example, >>> if a given dependency X is three minor versions or an year behind we will >>> create a JIRA for upgrading that. But the specific version to upgrade to >>> has to be determined by the Beam community. Beam community might choose to >>> close a JIRA if there are known issues with available recent releases. Tool >>> may reopen such a closed JIRA in the future if new information becomes >>> available (for example, 3 new versions have been released since JIRA was >>> closed). >>> >>> Thoughts ? >>> >>> Thanks, >>> Cham >>> >>> On Tue, Aug 28, 2018 at 1:51 PM Chamikara Jayalath <chamik...@google.com> >>> wrote: >>> >>>> >>>> >>>> On Tue, Aug 28, 2018 at 12:05 PM Thomas Weise <t...@apache.org> wrote: >>>> >>>>> I think there is an invalid assumption being made in this discussion, >>>>> which is that most projects comply with semantic versioning. The reality >>>>> in >>>>> the open source big data space is unfortunately quite different. Ismaël >>>>> has >>>>> well characterized the situation and HBase isn't an exception. Another >>>>> indicator for the scale of problem is extensive amount of shading used in >>>>> Beam and other projects. It wouldn't be necessary if semver compliance was >>>>> something we can rely on. >>>>> >>>>> Our recent Flink upgrade broke user(s). And we noticed a backward >>>>> incompatible Flink change that affected the portable Flink runner even >>>>> between patches. >>>>> >>>>> Many projects (including Beam) guarantee compatibility only for a >>>>> subset of public API. Sometimes a REST API is not covered, sometimes not >>>>> strictly internal protocols change and so on, all of which can break >>>>> users, >>>>> despite the public API remaining "compatible". As much as I would love to >>>>> rely on the version number to tell me wether an upgrade is safe or not, >>>>> that's not practically possible. >>>>> >>>>> Furthermore, we need to proceed with caution forcing upgrades on users >>>>> that host the target systems. To stay with the Flink example, moving Beam >>>>> from 1.4 to 1.5 is actually a major change to some, because they now have >>>>> to upgrade their Flink clusters/deployments to be able to use the new >>>>> version of Beam. >>>>> >>>>> Upgrades need to be done with caution and may require extensive >>>>> verification beyond what our automation provides. I think the Spark change >>>>> from 1.x to 2.x and also the JDK 1.8 change were good examples, they >>>>> provided the community a window to provide feedback and influence the >>>>> change. >>>>> >>>> >>>> Thanks for the clarification. >>>> >>>> Current policy indeed requests caution and explicit checks when >>>> upgrading all dependencies (including minor and patch versions) but >>>> language might have to be updated to emphasize your concerns. >>>> >>>> Here's the current text. >>>> >>>> "Beam releases adhere to >>>> <https://beam.apache.org/get-started/downloads/> semantic versioning. >>>> Hence, community members should take care when updating dependencies. Minor >>>> version updates to dependencies should be backwards compatible in most >>>> cases. Some updates to dependencies though may result in backwards >>>> incompatible API or functionality changes to Beam. PR reviewers and >>>> committers should take care to detect any dependency updates that could >>>> potentially introduce backwards incompatible changes to Beam before merging >>>> and PRs that update dependencies should include a statement regarding this >>>> verification in the form of a PR comment. Dependency updates that result in >>>> backwards incompatible changes to non-experimental features of Beam should >>>> be held till next major version release of Beam. Any exceptions to this >>>> policy should only occur in extreme cases (for example, due to a security >>>> vulnerability of an existing dependency that is only fixed in a subsequent >>>> major version) and should be discussed in the Beam dev list. Note that >>>> backwards incompatible changes to experimental features may be introduced >>>> in a minor version release." >>>> >>>> Also, are there any other steps we can take to make sure that Beam >>>> dependencies are not too old while offering a stable system ? Note that >>>> having a lot of legacy dependencies that do not get upgraded regularly can >>>> also result in user pain and Beam being unusable for certain users who run >>>> into dependency conflicts when using Beam along with other systems (which >>>> will increase the amount of shading/vendoring we have to do). >>>> >>>> Please note that current tooling does not force upgrades or >>>> automatically upgrade dependencies. It simply creates JIRAs that can be >>>> closed with a reason if needed. For Python SDK though we have version >>>> ranges in place for most dependencies [1] so these dependencies get updated >>>> automatically according to the corresponding ranges. >>>> https://github.com/apache/beam/blob/master/sdks/python/setup.py#L103 >>>> >>>> Thanks, >>>> Cham >>>> >>>> >>>>> >>>>> Thanks, >>>>> Thomas >>>>> >>>>> >>>>> >>>>> On Tue, Aug 28, 2018 at 11:29 AM Raghu Angadi <rang...@google.com> >>>>> wrote: >>>>> >>>>>> Thanks for the IO versioning summary. >>>>>> KafkaIO's policy of 'let the user decide exact version at runtime' >>>>>> has been quite useful so far. How feasible is that for other connectors? >>>>>> >>>>>> Also, KafkaIO does not limit itself to minimum features available >>>>>> across all the supported versions. Some of the features (e.g. server side >>>>>> timestamps) are disabled based on runtime Kafka version. The unit tests >>>>>> currently run with single recent version. Integration tests could >>>>>> certainly >>>>>> use multiple versions. With some more effort in writing tests, we could >>>>>> make multiple versions of the unit tests. >>>>>> >>>>>> Raghu. >>>>>> >>>>>> IO versioning >>>>>>> * Elasticsearch. We delayed the move to version 6 until we heard of >>>>>>> more active users needing it (more deployments). We support 2.x and >>>>>>> 5.x (but 2.x went recently EOL). Support for 6.x is in progress. >>>>>>> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x >>>>>>> because most big data distributions still use 5.x (however 5.x has >>>>>>> been EOL). >>>>>>> * KafkaIO uses version 1.x but Kafka recently moved to 2.x, however >>>>>>> most of the deployments of Kafka use earlier versions than 1.x. This >>>>>>> module uses a single version with the kafka client as a provided >>>>>>> dependency and so far it works (but we don’t have multi version >>>>>>> tests). >>>>>>> >>>>>> >>>>>> >>>>>> On Tue, Aug 28, 2018 at 8:38 AM Ismaël Mejía <ieme...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> I think we should refine the strategy on dependencies discussed >>>>>>> recently. Sorry to come late with this (I did not follow closely the >>>>>>> previous discussion), but the current approach is clearly not in line >>>>>>> with the industry reality (at least not for IO connectors + Hadoop + >>>>>>> Spark/Flink use). >>>>>>> >>>>>>> A really proactive approach to dependency updates is a good practice >>>>>>> for the core dependencies we have e.g. Guava, Bytebuddy, Avro, >>>>>>> Protobuf, etc, and of course for the case of cloud based IOs e.g. >>>>>>> GCS, >>>>>>> Bigquery, AWS S3, etc. However when we talk about self hosted data >>>>>>> sources or processing systems this gets more complicated and I think >>>>>>> we should be more flexible and do this case by case (and remove these >>>>>>> from the auto update email reminder). >>>>>>> >>>>>>> Some open source projects have at least three maintained versions: >>>>>>> - LTS – maps to what most of the people have installed (or the big >>>>>>> data distributions use) e.g. HBase 1.1.x, Hadoop 2.6.x >>>>>>> - Stable – current recommended version. HBase 1.4.x, Hadoop 2.8.x >>>>>>> - Next – latest release. HBase 2.1.x Hadoop 3.1.x >>>>>>> >>>>>>> Following the most recent versions can be good to be close to the >>>>>>> current development of other projects and some of the fixes, but >>>>>>> these >>>>>>> versions are commonly not deployed for most users and adopting a LTS >>>>>>> or stable only approach won't satisfy all cases either. To understand >>>>>>> why this is complex let’s see some historical issues: >>>>>>> >>>>>>> IO versioning >>>>>>> * Elasticsearch. We delayed the move to version 6 until we heard of >>>>>>> more active users needing it (more deployments). We support 2.x and >>>>>>> 5.x (but 2.x went recently EOL). Support for 6.x is in progress. >>>>>>> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x >>>>>>> because most big data distributions still use 5.x (however 5.x has >>>>>>> been EOL). >>>>>>> * KafkaIO uses version 1.x but Kafka recently moved to 2.x, however >>>>>>> most of the deployments of Kafka use earlier versions than 1.x. This >>>>>>> module uses a single version with the kafka client as a provided >>>>>>> dependency and so far it works (but we don’t have multi version >>>>>>> tests). >>>>>>> >>>>>>> Runners versioning >>>>>>> * The move to Spark 1 to Spark 2 was decided after evaluating the >>>>>>> tradeoffs between maintaining multiple version support and to have >>>>>>> breaking changes with the issues of maintaining multiple versions. >>>>>>> This is a rare case but also with consequences. This dependency is >>>>>>> provided but we don't actively test issues on version migration. >>>>>>> * Flink moved to version 1.5, introducing incompatibility in >>>>>>> checkpointing (discussed recently and with not yet consensus on how >>>>>>> to >>>>>>> handle). >>>>>>> >>>>>>> As you can see, it seems really hard to have a solution that fits all >>>>>>> cases. Probably the only rule that I see from this list is that we >>>>>>> should upgrade versions for connectors that have been deprecated or >>>>>>> arrived to the EOL (e.g. Solr 5.x, Elasticsearch 2.x). >>>>>>> >>>>>>> For the case of the provided dependencies I wonder if as part of the >>>>>>> tests we should provide tests with multiple versions (note that this >>>>>>> is currently blocked by BEAM-4087). >>>>>>> >>>>>>> Any other ideas or opinions to see how we can handle this? What other >>>>>>> people in the community think ? (Notice that this can have relation >>>>>>> with the ongoing LTS discussion. >>>>>>> >>>>>>> >>>>>>> On Tue, Aug 28, 2018 at 10:44 AM Tim Robertson >>>>>>> <timrobertson...@gmail.com> wrote: >>>>>>> > >>>>>>> > Hi folks, >>>>>>> > >>>>>>> > I'd like to revisit the discussion around our versioning policy >>>>>>> specifically for the Hadoop ecosystem and make sure we are aware of the >>>>>>> implications. >>>>>>> > >>>>>>> > As an example our policy today would have us on HBase 2.1 and I >>>>>>> have reminders to address this. >>>>>>> > >>>>>>> > However, currently the versions of HBase in the major hadoop >>>>>>> distros are: >>>>>>> > >>>>>>> > - Cloudera 5 on HBase 1.2 (Cloudera 6 is 2.1 but is only in beta) >>>>>>> > - Hortonworks HDP3 on HBase 2.0 (only recently released so we can >>>>>>> assume is not widely adopted) >>>>>>> > - AWS EMR HBase on 1.4 >>>>>>> > >>>>>>> > On the versioning I think we might need a more nuanced approach to >>>>>>> ensure that we target real communities of existing and potential users. >>>>>>> Enterprise users need to stick to the supported versions in the >>>>>>> distributions to maintain support contracts from the vendors. >>>>>>> > >>>>>>> > Should our versioning policy have more room to consider on a case >>>>>>> by case basis? >>>>>>> > >>>>>>> > For Hadoop might we benefit from a strategy on which community of >>>>>>> users Beam is targeting? >>>>>>> > >>>>>>> > (OT: I'm collecting some thoughts on what we might consider to >>>>>>> target enterprise hadoop users - kerberos on all relevant IO, >>>>>>> performance, >>>>>>> leaking beyond encryption zones with temporary files etc) >>>>>>> > >>>>>>> > Thanks, >>>>>>> > Tim >>>>>>> >>>>>>