Hey Mark & others,

We've been following the structure proposed in this thread to extend test
coverage for Beam Python SDK on Python 3.5, 3.6, 3.7 interpreters, see [1].

This structure allowed us to add 3.x suites without slowing down the
pre/postcommit execution time. We can actually see a drop in precommit
latency [2] around March 23 we first made some Python 3.x suites run in
parallel, and we have added more suites since then without slowing down
pre/postcommits. Therefore I am in favor of this proposal, especially since
AFAIK we don't have better one. Thanks a lot!

I do have some feedback on this proposal:

1. There is a duplication of gradle code between test suites for different
python minor versions, for example see the identical definition of
DirectRunner PostCommitIT suite for Python 3.6 and Python 3.7 [4,5].

Possible solution to reduce the duplication is to move common code that
defines a task into a separate groovy file shared across multiple gradle
files. We have an example of this, where enablePythonPerformanceTest() is
defined in BeamModulePlugin.groovy, and used in several build.gradle files
to create a gradle task required for performance tests, see: [6]. I
followed the same example in a Python 3 test suite for Portable Flink
Runner I am working on [3], however I am not sure if BeamModulePlugin is
the best place to define common gradle tasks to needed for Python CI.
Perhaps we can make a separate groovy file for this purpose in
sdk/python/test-suites?

2. Python 3 test suites currently live in sdks/python/test-suites, while
most Python 2 suites are still defined in sdks/python/build.gradle.

This may cause confusion for folks working on adding new Python suites. If
there is an overall agreement on proposed structure I suggest to  start
moving Python 2 CI tasks out of  sdks/python/build.gradle into
sdks/python/test-suites/[runner]/py27/build.gradle, or a common groovy
file. If there are better alternatives we can continue discussing them here.

Thanks,
Valenyn


[1] https://github.com/apache/beam/tree/master/sdks/python/test-suites
[2]
http://104.154.241.245/d/_TNndF2iz/pre-commit-test-latency?orgId=1&from=1546507894013&to=1554189164736
[3] https://github.com/apache/beam/pull/8745
[4]
https://github.com/apache/beam/blob/291f1e9fb5ce5ee4bb7e2519ffe40334fb5c08c5/sdks/python/test-suites/direct/py36/build.gradle#L27
[5]
https://github.com/apache/beam/blob/291f1e9fb5ce5ee4bb7e2519ffe40334fb5c08c5/sdks/python/test-suites/direct/py37/build.gradle#L27
[6]
https://github.com/apache/beam/search?q=enablePythonPerformanceTest&unscoped_q=enablePythonPerformanceTest


On Fri, Mar 29, 2019 at 9:45 AM Udi Meiri <eh...@google.com> wrote:

> I don't use gradle commands for Python development either, because they
> are slow (no incremental testing).
>
>
>
> On Fri, Mar 29, 2019 at 9:16 AM Michael Luckey <adude3...@gmail.com>
> wrote:
>
>>
>>
>> On Fri, Mar 29, 2019 at 2:31 PM Robert Bradshaw <rober...@google.com>
>> wrote:
>>
>>> On Fri, Mar 29, 2019 at 12:54 PM Michael Luckey <adude3...@gmail.com>
>>> wrote:
>>> >
>>> > Really like the idea of improving here.
>>> >
>>> > Unfortunately, I haven't worked with python on that scale yet, so bear
>>> with my naive understandings in this regard. If I understand correctly, the
>>> suggestion will result in a couple of projects consisting only of a
>>> build,gradle file to kind of workaround on gradles decision not to
>>> parallelize within projects, right? In consequence, this also kind of
>>> decouples projects from their content - they stuff which constitutes the
>>> project - and forces the build file to 'somehow reach out to content of
>>> other (only python root?) projects. E.g couples projects. This somehow
>>> 'feels non natural' to me. But, of course, might be the path to go. As I
>>> said before, never worked on python on that scale.
>>>
>>> It feels a bit odd to me as well. Is it possible to have multiple
>>> projects per directory (e.g. a suite of testing ones) rather than
>>> having to break things up like this, especially if the goal is
>>> primarily to get parallel running of tests? Especially if we could
>>> automatically create the cross-product rather than manually? There
>>> also seems to be some redundancy with what tox is doing here.
>>>
>>
>> Not sure, whether I understand correctly. But I do not think that's
>> possible. If we are going to do some cross-product, we are probably better
>> of doing that on tasks, e.g. by leveraging task rules or programmatically
>> adding tasks (which is already done in parts). Of course, this will not
>> help with parallelisation (but might enable that, see below).
>>
>>
>>>
>>> > But I believe to remember Robert talking about using in project
>>> parallelisation for his development. Is this something which could also
>>> work on CI? Of course, that will not help with different python versions,
>>> but maybe that could be solved also by gradles variants which are
>>> introduced in 5.3 - definitely need some time to investigate the
>>> possibilities here. On first sight it feels like lots of duplication to
>>> create 'builds' for any python version. Or wouldn't that be the case?
>>> >
>>> > And another naive thought on my side, isn't that non parallelizability
>>> also caused by the monolithic setup of the python code base? E.g. if I
>>> understand correctly, java sdk is split into core/runners/ios etc, each
>>> encapsulate into full blown projects, i.e. buckets of sources, tests and
>>> build file. Would it technically possible to do something similar with
>>> python? I assume that being discussed before and teared apart, but couldn't
>>> find on mailing list.
>>>
>>> Neither the culture nor the tooling of Python supports lots of
>>> interdependent "sub-packages" for a single project--at least not
>>> something smaller than one would want to deploy to Pypi. So while one
>>> could do this, it'd be going against the grain. There are also much
>>> lower-hanging opportunities for parallelization (e.g. running the test
>>> suites for separate python versions in parallel).
>>>
>>> It's not very natural (as I understand it) with Go either. If we're
>>> talking directory re-organization, I think it would make sense to
>>> consider having top-level java, python, go, ... next to model,
>>> website, etc.
>>>
>>
>> Yes. We shouldn't work against common culture/practices, but try to
>> embrace native tooling and add support where required.
>>
>> To reiterate on parallelisation, there are (at least) three opportunities:
>>
>> 1. Parallelise on test level. For python, this is detox?
>>
>
> This is actually 2 levels. :)
> 1a. Parallelise at the nosetest level - run unit tests in parallel in a
> single tox environment. (I have a PR in progress to migrate to pytest, and
> we should be able to do file-level parallelism provided we solve pickling
> issues.)
> 1b. Parallelise at the tox environment level, e..g, somehow running the
> multiple tox environments (py27,py27-cython,py35,...) in parallel.
>
>
>> 2. Parallelise on Gradle project level (--parallel option)
>> 3. Parallelise on CI level
>>
>> So what I ve done before, if 1. does not help, nor 2. cause project is
>> 'just to big' was 3, i.e. splitting on CI level. So the simplest thing I
>> could imagine right now would be - as suggested above - to split
>> pythonPreCommit into something like pythonX_YPrecommit, which then runs
>> those different python versions in parallel. Of course, that could be done
>> ad infinitum by splitting further into runners, IOs whatever.
>>
>> OTOH, we do already have tons of jobs running as we seem to map Gradle
>> tasks to Jenkins jobs. So it might be more appropriate to leverage CI
>> parallelization on jenkins pipeline level. Something like creating
>> pythonPrecommit as a pipeline, which itself runs several steps in parallel.
>> Did not contemplate on the impacts, though, on PRs and phrase triggering.
>> Of course, we might need to improve on our 'verbosity' as currently it
>> seems not always clear what command line actually triggered etc. With this
>> approach, it seems possible to run tasks within same Gradle project in
>> parallel with whatever granularity we might want to choose. (Of course we
>> do add overhead here, but that's probably always the case by doing things
>> in parallel)
>>
>> Of course 3. would not help on parallelisation on developers machine, but
>> I never had a problem with that.
>>
>>
>>>
>>> > And as a last thought, will usage of pygradle help with better
>>> python/gradle integration? Currently, we mainly use gradle to call into
>>> shell scripts, which doesn't help gradle nor probably pythons tooling to do
>>> the job very well? But deeper integration might cause problems on python
>>> dev side, dunno :(
>>>
>>> Possibly.
>>>
>>> Are there any Python developers that primarily use the gradle
>>> commands? Personally, I only use them if I'm using Java (or sometimes
>>> work that is a mix of Java and Python, e.g. the Python-on-Flink
>>> tests). Otherwise I use tox, or "python setup.py test [-s ...]"
>>> directly. Gradle primarily has value as a top-level orchestration (in
>>> particular for CI) and easy way for those who only touch Python
>>> occasionally to run all the tests. If that's the case, optimizing our
>>> gradle scripts for CI seems best.
>>>
>>
>> My impression till today is, that nobody is actually using Gradle on
>> daily python development. So I assume it is currently used on CI only. But
>> I believe gradle could add value for devs also. We just need to improve our
>> current integration. And to be very clear, gradle should not replace python
>> tooling, but support usage in a way which looks familiar to any python dev.
>> And probably render some shell scripting obsolete.
>>
>>
>>>
>>> > On Thu, Mar 28, 2019 at 6:37 PM Mark Liu <mark...@apache.org> wrote:
>>> >>
>>> >> Thank you Ahmet. Answer your questions below:
>>> >>
>>> >>>
>>> >>> - Could you comment on what kind of parallelization we will gain by
>>> this? In terms of real numbers, how would this affect build and test times?
>>> >>
>>> >>
>>> >> The proposal is based on Gradle parallel execution: "you can force
>>> Gradle to execute tasks in parallel as long as those tasks are in different
>>> projects". In Beam, project is declared per build.gradle file and
>>> registered in settings.gradle. Tasks that are included in single Gradle
>>> execution will run in parallel only if they are declared in separate
>>> build.gradle files.
>>> >>
>>> >> An example of applying parallel is beam_PreCommit_Python job which
>>> runs :pythonPreCommit task that contains tasks distributed in 4
>>> build.gradle. The execution graph looks like
>>> https://scans.gradle.com/s/4frpmto6o7hto/timeline:
>>> >>
>>> >> Without this proposal, all tasks will run in sequential which can be
>>> ~2x longer. If more py36 and py37 tests added in the future, things will be
>>> even worse.
>>> >>
>>> >>> - I am guessing this will reduce complexity. Is it possible to
>>> quantify the improvement related to this?
>>> >>
>>> >>
>>> >> The general code complexity of function/method/property may not
>>> change here since we basically group tasks in a different way without
>>> changing inside logic. I don't know if there is any tool to measure Gradle
>>> build complexity. Would love to try if there is.
>>> >>
>>> >>>
>>> >>> - Beyond the proposal, I am assuming you are willing to work on.
>>> Just want to clarify this. In either case, would you need help?
>>> >>
>>> >>
>>> >> Yes, I'd love to take on major refactor works. At the same time, I'll
>>> create jira for each kind of tests (like flink/protable/hdfs tests) in
>>> sdks/python/build.gradle to move into test-suites. Test owners or anyone
>>> interested to this work are welcome to contribute!
>>> >>
>>> >> Mark
>>> >>
>>> >> On Wed, Mar 27, 2019 at 3:53 PM Ahmet Altay <al...@google.com> wrote:
>>> >>>
>>> >>> This sounds good to me. Thank you for doing this. Few questions:
>>> >>> - Could you comment on what kind of parallelization we will gain by
>>> this? In terms of real numbers, how would this affect build and test times?
>>> >>> - I am guessing this will reduce complexity. Is it possible to
>>> quantify the improvement related to this?
>>> >>> - Beyond the proposal, I am assuming you are willing to work on.
>>> Just want to clarify this. In either case, would you need help?
>>> >>>
>>> >>> Thank you,
>>> >>> Ahmet
>>> >>>
>>> >>> On Wed, Mar 27, 2019 at 10:19 AM Mark Liu <mark...@apache.org>
>>> wrote:
>>> >>>>
>>> >>>> Hi Python SDK Developers,
>>> >>>>
>>> >>>> You may notice that Gradle files changed a lot recently as
>>> parallelization applied to Python tests and more python versions were
>>> enabled in testing. There are tricks over the build scripts and tests are
>>> grown naturally and distributed under sdks/python, which caused frictions
>>> (like rollback PR-8059).
>>> >>>>
>>> >>>> Thus, I created BEAM-6907 and would like to initiate some works to
>>> cleanup and standardize Gradle structure in Python SDK. In general, I think
>>> we want to:
>>> >>>>
>>> >>>> - Apply parallel execution
>>> >>>> - Share common tasks
>>> >>>> - Centralize test related tasks
>>> >>>> - Have a clear Gradle structure for projects/tasks
>>> >>>>
>>> >>>> This is Gradle directory structure I proposed:
>>> >>>>
>>> >>>> sdks/python/
>>> >>>>
>>> >>>> build.gradle    --> hold builds, snapshot, analytic tasks
>>> >>>> test-suites/    --> all pre/post/VR test suites under here
>>> >>>>
>>> >>>> README.md
>>> >>>>
>>> >>>> dataflow/    --> grouped by runner or unit test (tox)
>>> >>>>
>>> >>>> py27/    --> grouped by py version
>>> >>>>
>>> >>>> build.gradle
>>> >>>>
>>> >>>> py35/
>>> >>>>
>>> >>>> ...
>>> >>>>
>>> >>>> direct/
>>> >>>>
>>> >>>> py27/
>>> >>>>
>>> >>>> ...
>>> >>>>
>>> >>>> flink/
>>> >>>>
>>> >>>> tox/
>>> >>>> ...
>>> >>>>
>>> >>>>
>>> >>>> The ideas are:
>>> >>>> - Only keep builds, snapshot and analytic jobs in
>>> sdks/python/build.gradle
>>> >>>> - Move all test related tasks to sdks/python/test-suites/
>>> >>>> - In sdks/python/test-suites, we first group by runners, unit test
>>> or other testing that can't fit to them, and then group by py versions if
>>> needed.
>>> >>>> - An example of ../test-suites/../py35/build.gradle is this.
>>> >>>>
>>> >>>> Please feel free to explore existing Gradle scripts in Python SDK
>>> and bring any thoughts on this proposal if you have.
>>> >>>>
>>> >>>> Thanks!
>>> >>>> Mark
>>>
>>

Reply via email to