Hey Mark & others, We've been following the structure proposed in this thread to extend test coverage for Beam Python SDK on Python 3.5, 3.6, 3.7 interpreters, see [1].
This structure allowed us to add 3.x suites without slowing down the pre/postcommit execution time. We can actually see a drop in precommit latency [2] around March 23 we first made some Python 3.x suites run in parallel, and we have added more suites since then without slowing down pre/postcommits. Therefore I am in favor of this proposal, especially since AFAIK we don't have better one. Thanks a lot! I do have some feedback on this proposal: 1. There is a duplication of gradle code between test suites for different python minor versions, for example see the identical definition of DirectRunner PostCommitIT suite for Python 3.6 and Python 3.7 [4,5]. Possible solution to reduce the duplication is to move common code that defines a task into a separate groovy file shared across multiple gradle files. We have an example of this, where enablePythonPerformanceTest() is defined in BeamModulePlugin.groovy, and used in several build.gradle files to create a gradle task required for performance tests, see: [6]. I followed the same example in a Python 3 test suite for Portable Flink Runner I am working on [3], however I am not sure if BeamModulePlugin is the best place to define common gradle tasks to needed for Python CI. Perhaps we can make a separate groovy file for this purpose in sdk/python/test-suites? 2. Python 3 test suites currently live in sdks/python/test-suites, while most Python 2 suites are still defined in sdks/python/build.gradle. This may cause confusion for folks working on adding new Python suites. If there is an overall agreement on proposed structure I suggest to start moving Python 2 CI tasks out of sdks/python/build.gradle into sdks/python/test-suites/[runner]/py27/build.gradle, or a common groovy file. If there are better alternatives we can continue discussing them here. Thanks, Valenyn [1] https://github.com/apache/beam/tree/master/sdks/python/test-suites [2] http://104.154.241.245/d/_TNndF2iz/pre-commit-test-latency?orgId=1&from=1546507894013&to=1554189164736 [3] https://github.com/apache/beam/pull/8745 [4] https://github.com/apache/beam/blob/291f1e9fb5ce5ee4bb7e2519ffe40334fb5c08c5/sdks/python/test-suites/direct/py36/build.gradle#L27 [5] https://github.com/apache/beam/blob/291f1e9fb5ce5ee4bb7e2519ffe40334fb5c08c5/sdks/python/test-suites/direct/py37/build.gradle#L27 [6] https://github.com/apache/beam/search?q=enablePythonPerformanceTest&unscoped_q=enablePythonPerformanceTest On Fri, Mar 29, 2019 at 9:45 AM Udi Meiri <eh...@google.com> wrote: > I don't use gradle commands for Python development either, because they > are slow (no incremental testing). > > > > On Fri, Mar 29, 2019 at 9:16 AM Michael Luckey <adude3...@gmail.com> > wrote: > >> >> >> On Fri, Mar 29, 2019 at 2:31 PM Robert Bradshaw <rober...@google.com> >> wrote: >> >>> On Fri, Mar 29, 2019 at 12:54 PM Michael Luckey <adude3...@gmail.com> >>> wrote: >>> > >>> > Really like the idea of improving here. >>> > >>> > Unfortunately, I haven't worked with python on that scale yet, so bear >>> with my naive understandings in this regard. If I understand correctly, the >>> suggestion will result in a couple of projects consisting only of a >>> build,gradle file to kind of workaround on gradles decision not to >>> parallelize within projects, right? In consequence, this also kind of >>> decouples projects from their content - they stuff which constitutes the >>> project - and forces the build file to 'somehow reach out to content of >>> other (only python root?) projects. E.g couples projects. This somehow >>> 'feels non natural' to me. But, of course, might be the path to go. As I >>> said before, never worked on python on that scale. >>> >>> It feels a bit odd to me as well. Is it possible to have multiple >>> projects per directory (e.g. a suite of testing ones) rather than >>> having to break things up like this, especially if the goal is >>> primarily to get parallel running of tests? Especially if we could >>> automatically create the cross-product rather than manually? There >>> also seems to be some redundancy with what tox is doing here. >>> >> >> Not sure, whether I understand correctly. But I do not think that's >> possible. If we are going to do some cross-product, we are probably better >> of doing that on tasks, e.g. by leveraging task rules or programmatically >> adding tasks (which is already done in parts). Of course, this will not >> help with parallelisation (but might enable that, see below). >> >> >>> >>> > But I believe to remember Robert talking about using in project >>> parallelisation for his development. Is this something which could also >>> work on CI? Of course, that will not help with different python versions, >>> but maybe that could be solved also by gradles variants which are >>> introduced in 5.3 - definitely need some time to investigate the >>> possibilities here. On first sight it feels like lots of duplication to >>> create 'builds' for any python version. Or wouldn't that be the case? >>> > >>> > And another naive thought on my side, isn't that non parallelizability >>> also caused by the monolithic setup of the python code base? E.g. if I >>> understand correctly, java sdk is split into core/runners/ios etc, each >>> encapsulate into full blown projects, i.e. buckets of sources, tests and >>> build file. Would it technically possible to do something similar with >>> python? I assume that being discussed before and teared apart, but couldn't >>> find on mailing list. >>> >>> Neither the culture nor the tooling of Python supports lots of >>> interdependent "sub-packages" for a single project--at least not >>> something smaller than one would want to deploy to Pypi. So while one >>> could do this, it'd be going against the grain. There are also much >>> lower-hanging opportunities for parallelization (e.g. running the test >>> suites for separate python versions in parallel). >>> >>> It's not very natural (as I understand it) with Go either. If we're >>> talking directory re-organization, I think it would make sense to >>> consider having top-level java, python, go, ... next to model, >>> website, etc. >>> >> >> Yes. We shouldn't work against common culture/practices, but try to >> embrace native tooling and add support where required. >> >> To reiterate on parallelisation, there are (at least) three opportunities: >> >> 1. Parallelise on test level. For python, this is detox? >> > > This is actually 2 levels. :) > 1a. Parallelise at the nosetest level - run unit tests in parallel in a > single tox environment. (I have a PR in progress to migrate to pytest, and > we should be able to do file-level parallelism provided we solve pickling > issues.) > 1b. Parallelise at the tox environment level, e..g, somehow running the > multiple tox environments (py27,py27-cython,py35,...) in parallel. > > >> 2. Parallelise on Gradle project level (--parallel option) >> 3. Parallelise on CI level >> >> So what I ve done before, if 1. does not help, nor 2. cause project is >> 'just to big' was 3, i.e. splitting on CI level. So the simplest thing I >> could imagine right now would be - as suggested above - to split >> pythonPreCommit into something like pythonX_YPrecommit, which then runs >> those different python versions in parallel. Of course, that could be done >> ad infinitum by splitting further into runners, IOs whatever. >> >> OTOH, we do already have tons of jobs running as we seem to map Gradle >> tasks to Jenkins jobs. So it might be more appropriate to leverage CI >> parallelization on jenkins pipeline level. Something like creating >> pythonPrecommit as a pipeline, which itself runs several steps in parallel. >> Did not contemplate on the impacts, though, on PRs and phrase triggering. >> Of course, we might need to improve on our 'verbosity' as currently it >> seems not always clear what command line actually triggered etc. With this >> approach, it seems possible to run tasks within same Gradle project in >> parallel with whatever granularity we might want to choose. (Of course we >> do add overhead here, but that's probably always the case by doing things >> in parallel) >> >> Of course 3. would not help on parallelisation on developers machine, but >> I never had a problem with that. >> >> >>> >>> > And as a last thought, will usage of pygradle help with better >>> python/gradle integration? Currently, we mainly use gradle to call into >>> shell scripts, which doesn't help gradle nor probably pythons tooling to do >>> the job very well? But deeper integration might cause problems on python >>> dev side, dunno :( >>> >>> Possibly. >>> >>> Are there any Python developers that primarily use the gradle >>> commands? Personally, I only use them if I'm using Java (or sometimes >>> work that is a mix of Java and Python, e.g. the Python-on-Flink >>> tests). Otherwise I use tox, or "python setup.py test [-s ...]" >>> directly. Gradle primarily has value as a top-level orchestration (in >>> particular for CI) and easy way for those who only touch Python >>> occasionally to run all the tests. If that's the case, optimizing our >>> gradle scripts for CI seems best. >>> >> >> My impression till today is, that nobody is actually using Gradle on >> daily python development. So I assume it is currently used on CI only. But >> I believe gradle could add value for devs also. We just need to improve our >> current integration. And to be very clear, gradle should not replace python >> tooling, but support usage in a way which looks familiar to any python dev. >> And probably render some shell scripting obsolete. >> >> >>> >>> > On Thu, Mar 28, 2019 at 6:37 PM Mark Liu <mark...@apache.org> wrote: >>> >> >>> >> Thank you Ahmet. Answer your questions below: >>> >> >>> >>> >>> >>> - Could you comment on what kind of parallelization we will gain by >>> this? In terms of real numbers, how would this affect build and test times? >>> >> >>> >> >>> >> The proposal is based on Gradle parallel execution: "you can force >>> Gradle to execute tasks in parallel as long as those tasks are in different >>> projects". In Beam, project is declared per build.gradle file and >>> registered in settings.gradle. Tasks that are included in single Gradle >>> execution will run in parallel only if they are declared in separate >>> build.gradle files. >>> >> >>> >> An example of applying parallel is beam_PreCommit_Python job which >>> runs :pythonPreCommit task that contains tasks distributed in 4 >>> build.gradle. The execution graph looks like >>> https://scans.gradle.com/s/4frpmto6o7hto/timeline: >>> >> >>> >> Without this proposal, all tasks will run in sequential which can be >>> ~2x longer. If more py36 and py37 tests added in the future, things will be >>> even worse. >>> >> >>> >>> - I am guessing this will reduce complexity. Is it possible to >>> quantify the improvement related to this? >>> >> >>> >> >>> >> The general code complexity of function/method/property may not >>> change here since we basically group tasks in a different way without >>> changing inside logic. I don't know if there is any tool to measure Gradle >>> build complexity. Would love to try if there is. >>> >> >>> >>> >>> >>> - Beyond the proposal, I am assuming you are willing to work on. >>> Just want to clarify this. In either case, would you need help? >>> >> >>> >> >>> >> Yes, I'd love to take on major refactor works. At the same time, I'll >>> create jira for each kind of tests (like flink/protable/hdfs tests) in >>> sdks/python/build.gradle to move into test-suites. Test owners or anyone >>> interested to this work are welcome to contribute! >>> >> >>> >> Mark >>> >> >>> >> On Wed, Mar 27, 2019 at 3:53 PM Ahmet Altay <al...@google.com> wrote: >>> >>> >>> >>> This sounds good to me. Thank you for doing this. Few questions: >>> >>> - Could you comment on what kind of parallelization we will gain by >>> this? In terms of real numbers, how would this affect build and test times? >>> >>> - I am guessing this will reduce complexity. Is it possible to >>> quantify the improvement related to this? >>> >>> - Beyond the proposal, I am assuming you are willing to work on. >>> Just want to clarify this. In either case, would you need help? >>> >>> >>> >>> Thank you, >>> >>> Ahmet >>> >>> >>> >>> On Wed, Mar 27, 2019 at 10:19 AM Mark Liu <mark...@apache.org> >>> wrote: >>> >>>> >>> >>>> Hi Python SDK Developers, >>> >>>> >>> >>>> You may notice that Gradle files changed a lot recently as >>> parallelization applied to Python tests and more python versions were >>> enabled in testing. There are tricks over the build scripts and tests are >>> grown naturally and distributed under sdks/python, which caused frictions >>> (like rollback PR-8059). >>> >>>> >>> >>>> Thus, I created BEAM-6907 and would like to initiate some works to >>> cleanup and standardize Gradle structure in Python SDK. In general, I think >>> we want to: >>> >>>> >>> >>>> - Apply parallel execution >>> >>>> - Share common tasks >>> >>>> - Centralize test related tasks >>> >>>> - Have a clear Gradle structure for projects/tasks >>> >>>> >>> >>>> This is Gradle directory structure I proposed: >>> >>>> >>> >>>> sdks/python/ >>> >>>> >>> >>>> build.gradle --> hold builds, snapshot, analytic tasks >>> >>>> test-suites/ --> all pre/post/VR test suites under here >>> >>>> >>> >>>> README.md >>> >>>> >>> >>>> dataflow/ --> grouped by runner or unit test (tox) >>> >>>> >>> >>>> py27/ --> grouped by py version >>> >>>> >>> >>>> build.gradle >>> >>>> >>> >>>> py35/ >>> >>>> >>> >>>> ... >>> >>>> >>> >>>> direct/ >>> >>>> >>> >>>> py27/ >>> >>>> >>> >>>> ... >>> >>>> >>> >>>> flink/ >>> >>>> >>> >>>> tox/ >>> >>>> ... >>> >>>> >>> >>>> >>> >>>> The ideas are: >>> >>>> - Only keep builds, snapshot and analytic jobs in >>> sdks/python/build.gradle >>> >>>> - Move all test related tasks to sdks/python/test-suites/ >>> >>>> - In sdks/python/test-suites, we first group by runners, unit test >>> or other testing that can't fit to them, and then group by py versions if >>> needed. >>> >>>> - An example of ../test-suites/../py35/build.gradle is this. >>> >>>> >>> >>>> Please feel free to explore existing Gradle scripts in Python SDK >>> and bring any thoughts on this proposal if you have. >>> >>>> >>> >>>> Thanks! >>> >>>> Mark >>> >>