[ https://issues.apache.org/jira/browse/FLINK-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16255075#comment-16255075 ]
Fabian Hueske commented on FLINK-3656: -------------------------------------- There's still a bit of duplication, but I agree. We can close this parent issue for now. > Rework Table API tests > ---------------------- > > Key: FLINK-3656 > URL: https://issues.apache.org/jira/browse/FLINK-3656 > Project: Flink > Issue Type: Improvement > Components: Table API & SQL > Reporter: Vasia Kalavri > Labels: starter > > The {{flink-table}} component consists of > several APIs > * Scala-embedded Table API > * String-based Table API (for Java) > * SQL > and compiles to two execution backends: > * DataStream API > * DataSet API > There are many different translation paths involved until a query is executed: > # Table API String -> Table API logical plan > # Table API Scala-expressions -> Table API logical plan > # Table API logical plan -> Calcite RelNode plans > # SQL -> Calcite RelNode plans (done by exclusively via Calcite) > # Calcite RelNodes -> DataSet RelNodes > # DataSet RelNodes -> DataSet program > # Calcite RelNodes -> DataStream RelNodes > # DataStream RelNodes -> DataStream program > # Calcite RexNode expressions -> generated code > which need to be thoroughly tested. > Initially, many tests were done as end-to-end integration tests with high > overhead. > However, due to the combinations of APIs and execution back-ends, this > approach causes many redundant tests and long build times. > Therefore, I propose the following testing scheme: > 1. Table API String -> Table API expression: > The String-based Table API is tested by comparing the resulting logical plan > (Table.logicalPlan) to the logical plan of an equivalent Table program that > uses the Scala-embedded syntax. The logical plan is the Table API internal > representation which is later converted into a Calcite RelNode plan. > All existing integration tests that check the "Java" Table API should be > ported to unit tests. There will also be duplicated tests because, the Java > Table API is tested for batch and streaming which is not necessary anymore. > 2. Table API Scala-expressions -> Table API logical plan -> Calcite RelNodes > -> DataSet RelNodes / DataStream RelNodes > These tests cover the translation and optimization of Table API queries and > verify the Calcite optimized plan. We need distinct tests for DataSet and > DataStream environments since features and translation rules vary. These test > will also identify if added or modified rules or cost functions result in > different plans. These should be the main tests for the Table API and very > extensive. > These tests should be implemented by extending the {{TableTestBase}} which is > a base class for unit tests and hence very lightweight. > 3. SQL -> Calcite RelNodes -> DataSet RelNodes / DataStream RelNodes > These are the same tests as described for 2. (Table API Scala-expressions -> > DataSet / DataStream RelNodes) but just for SQL. > 4. DataSet RelNode -> DataSet program > Unfortunately, the DataSet API lacks a good mechanism to test generated > programs, i.e., get a plan traversable of all operators with access to all > user-defined functions. Until such a testing utility is available, I propose > to test the translation to DataSet programs as end-to-end integration tests. > However, I think we can run most tests on a Collection ExecutionEnvironment, > which does not start a Flink cluster but runs all code on Java collections. > This makes these tests much more lightweight than cluster-based ITCases. The > goal of these tests should be to cover all translation paths from DataSetRel > to DataSet program, i.e., all DataSetRel nodes and their translation logic. > These tests should be implemented by extending the > {{TableProgramsCollectionTestBase}} (see FLINK-5268). > Moreover, we should have very few cluster-based ITCases in place that check > the execution path with the actual operators, serializers, and comparators. > However, we should limit these tests to the minimum to keep build time low. > These tests should be implemented by extending the > {{TableProgramsClusterTestBase}} (FLINK-5268) and all be located in the same > class to avoid repeated instantiation of the Flink MiniCluster. > 5. DataStream RelNode -> DataStream program > Here basically the same applies as for the DataSet programs. I'm not aware of > a good way to test generated DataStream programs without executing them. A > testing utility would be great for all libraries that are built on top of the > API. Until then, I propose to use end-to-end integration tests. > Unfortunately, the DataStream API does not feature a collection execution > mode, so all tests need to be run on a MiniCluster. Therefore, we should > again keep these tests to the minimum. These tests should be implemented by > extending the {{StreamingMultipleProgramsTestBase}} and be located in few > classes to avoid repeated instantiations of the FLink MiniCluster. > 6. (Scala expressions | String-parsed expressions | SQL expressions) -> > RexNode expressions -> Generated Code > In order to avoid extensive optimization tests for each supported expression > or built-in function, we have the {{ExpressionTestBase}} which compiles > expressions into generated code and tests for the correctness of results. All > supported expressions and built-in function should be tested by extending the > {{ExpressionTestBase}} instead of running a full integration test. > I will add a few JIRAs to migrate existing tests to the new testing scheme. -- This message was sent by Atlassian JIRA (v6.4.14#64029)