Hi, as you all know, Flink has a layered architecture with multiple alternatives for certain levels. Exampels are: - Programming APIs: Java, Scala, (and Python in progress) - Processing Backends: distributed runtime (former Nephele), Java Collections, (and potentially Tez in the future)
The challenge with multiple alternatives that serve the same purpuse is that these should be in sync. A feature that is added to the Java API should also be added to the Scala API (and other APIs in the future). The same applies to new runtime strategies and operators, such as outer joins. I think we need a policy how to keep the features of different layer alternatives in sync. With the recent update of the Scala API, a ScalaAPICompletenessTest was added that checks whether the Scala API offers the same methods as the Java API. Adding a feature to the Java API breaks the build and requires to either adapt the Scala API as well or exclude the added methods from the APICompletenessTest. While this test is a great tool to make sure that that APIs are synced, this basically requires that APIs are always synced, i.e., a modification of the Java API must go with an equivalent change of the Scala API. If we make this a tight policy and force compatibility at all times, contributors must know about several different technologies (Scala Compiler Macros, Python, the implementation details of multiple runtime backends, ...). This sounds like a huge entrance barrier to me. To make it clear, I am definitely in favor of keeping APIs and backends in sync. However, I propose to enforce this only for releases, i.e., allow out-of-sync APIs on the master branch and fix the APIs for releases. With this additional requirement, we also need to think twice which features to add as multiple components of the system will be affected. What do you guys think?
