zentol commented on issue #10436: [FLINK-14920] [flink-end-to-end-perf-tests] Set up environment to run performance e2e tests URL: https://github.com/apache/flink/pull/10436#issuecomment-592501122 okay, here we go. Let me first walk you through some of the problems I saw in this PR: 1) There is a weird inconsistency in the required SSH credentials; the library requires username/password, but to start the cluster we rely on `start-cluster.sh` which requires password-less SSH to be setup. This requires some further investigations/clarifications. 2) The ssh utilities would be easier to use if there were a dedicated container for credentials. 3) The used ssh library is really old (from 2006), and the latest release was in 2014; we may want to use another library. 4) You've added a lot of additional functionality into the FlinkDistribution, which overloads it a bit, and it also breaks some API contracts. For example, in it's current form it should be possible to modify the configuration after `before()` was called, but these changes aren't applied to the remote hosts. 5) The DistributedStandaloneFlinkResource kinda contains nothing, and may not work since it assumes at least one JM to be on the local host. ``` ``` As mentioned earlier, while working on this I noticed a few things that are awkward in the existing e2e code, and have filed a number of tickets and prepared solutions for them: With FLINK-16307 we now always rely on `start-cluster.sh` to start the cluster, which will be more consistent with the distributed setup. With FLINK-16190 tests no longer use the FlinkDistribution directly, which allows us to rework it into a simple wrapper. With FLINK-16189 the FlinkDistribution no longer contains any testing logic; parameters are passed in from the FlinkResource implementations instead. This makes it easier to re-use in ways that don't fit the junit lifecycle, like in our case creating a FlinkDistribution, mutating it and afterwards distributing it across several hosts. ``` ``` Based on the problems I saw in this PR and the filed tickets I have created a work-in-progress branch that shows how I envision the solution: https://github.com/zentol/flink/tree/14920_wip Let me outline some differences / decisions I made: The StandalondCluster-/-JobController are now a top-level class to simplify re-use by multiple FlinkResource implementations. All options specific to the distributed setup are now evaluated by the `DistributedFlinkResourceFactory`; this reduces the number of places where parameters are processed, which makes it easier to discover which options need to set. The `DistributedFlinkResource` takes care of the distributed setup. It mutates the local copy of the distribution (that is used to orchestrate the cluster startup / job submissions), distributes the distributions and has the logic for performing operations across multiple distributions. A remote distribution is represented by the `RemoteFlinkDistribution` class, a heavily stripped-down version of the `FlinkDistribution` that does not allow any modifications at this point. It is mostly used to copy/search logs and stop any processes running on this host (which may not even be required). The SSH utils are reworked into an `SshTool`, which acts both as a container for ssh configuration options (port, credentials) and access point to SSH method that can be passed around conveniently. I also switched to `com.hierynomus:sshj`, which appears to be better maintained and has fairly nice API. I left a number of TODO comments for things that are still missing. Please take a look and tell me what you think.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
