Hi Mark & others,

+1 on using this structure. I don't see any other alternative to gradle as
some of the Python tasks have Java tasks as
a dependency. You can't debug that using just `python nosetests... or
tox`.  Parallelizing such tasks requires different
projects (and I don't think gradle supports multiple projects per
directory), so for each python
version we need a different folder. Having seperate build.gradle files for
each python version would also enable the
different versions to diverge on which tests they execute (e.g. not running
some tests for python 3.5 and 3.6 to
reduce Jenkins footprint in a PreCommit).

Valentyn, the code duplication problem can be addressed using a Gradle
script plugin , which
is a build.script (maybe another name is better?)  in
/sdks/python/test-suites/[runner], which you then import in
/sdks/python/test-suites/[runner]/pyXX/build.gradle with the
correct pythonVersion set, and then using `apply from`. See [1] for an
example. The location of
the python2 test-suite can be remedied by moving it to your suggestion. +1
on that as well.

On the coupling of projects or the structure not being natural, I think
that we can look at
this differently. Right now, in the Python SDK, all common code for tests
that needs to be parallelized is
in placed in BeamModulePlugin, which in turn couples all projects that use
it. It's one centralized
item that couples many different projects. It has code from Java, Python
and Go, and
currently has almost 2000 lines of code, which is IMHO not the way to go.
Moving the python code
from this binary plugin to a script plugin file defined in the the parent
directory of children
projects that use that code (as described in the paragraph above) moves the
coupling of a lot of projects through
BeamModulePlugin (a global coupling), to a per sub-tree coupling (a local
coupling).
In short, yes it might be unnatural, but it's still better than before.

[1] https://github.com/apache/beam/pull/8877
<https://github.com/apache/beam/pull/8877/files>

Thanks,
Frederik Bode

On Mon, Jun 3, 2019 at 5:13 PM Valentyn Tymofieiev <valen...@google.com>
wrote:

> Hey Mark & others,
>
> We've been following the structure proposed in this thread to extend test
> coverage for Beam Python SDK on Python 3.5, 3.6, 3.7 interpreters, see [1].
>
> This structure allowed us to add 3.x suites without slowing down the
> pre/postcommit execution time. We can actually see a drop in precommit
> latency [2] around March 23 we first made some Python 3.x suites run in
> parallel, and we have added more suites since then without slowing down
> pre/postcommits. Therefore I am in favor of this proposal, especially since
> AFAIK we don't have better one. Thanks a lot!
>
> I do have some feedback on this proposal:
>
> 1. There is a duplication of gradle code between test suites for different
> python minor versions, for example see the identical definition of
> DirectRunner PostCommitIT suite for Python 3.6 and Python 3.7 [4,5].
>
> Possible solution to reduce the duplication is to move common code that
> defines a task into a separate groovy file shared across multiple gradle
> files. We have an example of this, where enablePythonPerformanceTest() is
> defined in BeamModulePlugin.groovy, and used in several build.gradle files
> to create a gradle task required for performance tests, see: [6]. I
> followed the same example in a Python 3 test suite for Portable Flink
> Runner I am working on [3], however I am not sure if BeamModulePlugin is
> the best place to define common gradle tasks to needed for Python CI.
> Perhaps we can make a separate groovy file for this purpose in
> sdk/python/test-suites?
>

I would suggest placing the shared code in a new file in
https://github.com/apache/beam/tree/master/buildSrc/src/main/groovy/org/apache/beam/gradle,
we have several other groovy files related to building defined there
already.


> 2. Python 3 test suites currently live in sdks/python/test-suites, while
> most Python 2 suites are still defined in sdks/python/build.gradle.
>
> This may cause confusion for folks working on adding new Python suites. If
> there is an overall agreement on proposed structure I suggest to  start
> moving Python 2 CI tasks out of  sdks/python/build.gradle into
> sdks/python/test-suites/[runner]/py27/build.gradle, or a common groovy
> file. If there are better alternatives we can continue discussing them here.
>

For the runner specific Java ITs, we have been trying to get the tests to
be placed inside the runners own directory instead of underneath the SDK
directories. This was to make it easy to associate test task -> runner. So
an alternative would be to have runners/[runner]/test-suites/py27/... be
that location.


> Thanks,
> Valenyn
>
>
> [1] https://github.com/apache/beam/tree/master/sdks/python/test-suites
> [2]
> http://104.154.241.245/d/_TNndF2iz/pre-commit-test-latency?orgId=1&from=1546507894013&to=1554189164736
> [3] https://github.com/apache/beam/pull/8745
> [4]
> https://github.com/apache/beam/blob/291f1e9fb5ce5ee4bb7e2519ffe40334fb5c08c5/sdks/python/test-suites/direct/py36/build.gradle#L27
> [5]
> https://github.com/apache/beam/blob/291f1e9fb5ce5ee4bb7e2519ffe40334fb5c08c5/sdks/python/test-suites/direct/py37/build.gradle#L27
> [6]
> https://github.com/apache/beam/search?q=enablePythonPerformanceTest&unscoped_q=enablePythonPerformanceTest
>
>
> On Fri, Mar 29, 2019 at 9:45 AM Udi Meiri <eh...@google.com> wrote:
>
>> I don't use gradle commands for Python development either, because they
>> are slow (no incremental testing).
>>
>>
>>
>> On Fri, Mar 29, 2019 at 9:16 AM Michael Luckey <adude3...@gmail.com>
>> wrote:
>>
>>>
>>>
>>> On Fri, Mar 29, 2019 at 2:31 PM Robert Bradshaw <rober...@google.com>
>>> wrote:
>>>
>>>> On Fri, Mar 29, 2019 at 12:54 PM Michael Luckey <adude3...@gmail.com>
>>>> wrote:
>>>> >
>>>> > Really like the idea of improving here.
>>>> >
>>>> > Unfortunately, I haven't worked with python on that scale yet, so
>>>> bear with my naive understandings in this regard. If I understand
>>>> correctly, the suggestion will result in a couple of projects consisting
>>>> only of a build,gradle file to kind of workaround on gradles decision not
>>>> to parallelize within projects, right? In consequence, this also kind of
>>>> decouples projects from their content - they stuff which constitutes the
>>>> project - and forces the build file to 'somehow reach out to content of
>>>> other (only python root?) projects. E.g couples projects. This somehow
>>>> 'feels non natural' to me. But, of course, might be the path to go. As I
>>>> said before, never worked on python on that scale.
>>>>
>>>> It feels a bit odd to me as well. Is it possible to have multiple
>>>> projects per directory (e.g. a suite of testing ones) rather than
>>>> having to break things up like this, especially if the goal is
>>>> primarily to get parallel running of tests? Especially if we could
>>>> automatically create the cross-product rather than manually? There
>>>> also seems to be some redundancy with what tox is doing here.
>>>>
>>>
>>> Not sure, whether I understand correctly. But I do not think that's
>>> possible. If we are going to do some cross-product, we are probably better
>>> of doing that on tasks, e.g. by leveraging task rules or programmatically
>>> adding tasks (which is already done in parts). Of course, this will not
>>> help with parallelisation (but might enable that, see below).
>>>
>>>
>>>>
>>>> > But I believe to remember Robert talking about using in project
>>>> parallelisation for his development. Is this something which could also
>>>> work on CI? Of course, that will not help with different python versions,
>>>> but maybe that could be solved also by gradles variants which are
>>>> introduced in 5.3 - definitely need some time to investigate the
>>>> possibilities here. On first sight it feels like lots of duplication to
>>>> create 'builds' for any python version. Or wouldn't that be the case?
>>>> >
>>>> > And another naive thought on my side, isn't that non
>>>> parallelizability also caused by the monolithic setup of the python code
>>>> base? E.g. if I understand correctly, java sdk is split into
>>>> core/runners/ios etc, each encapsulate into full blown projects, i.e.
>>>> buckets of sources, tests and build file. Would it technically possible to
>>>> do something similar with python? I assume that being discussed before and
>>>> teared apart, but couldn't find on mailing list.
>>>>
>>>> Neither the culture nor the tooling of Python supports lots of
>>>> interdependent "sub-packages" for a single project--at least not
>>>> something smaller than one would want to deploy to Pypi. So while one
>>>> could do this, it'd be going against the grain. There are also much
>>>> lower-hanging opportunities for parallelization (e.g. running the test
>>>> suites for separate python versions in parallel).
>>>>
>>>> It's not very natural (as I understand it) with Go either. If we're
>>>> talking directory re-organization, I think it would make sense to
>>>> consider having top-level java, python, go, ... next to model,
>>>> website, etc.
>>>>
>>>
>>> Yes. We shouldn't work against common culture/practices, but try to
>>> embrace native tooling and add support where required.
>>>
>>> To reiterate on parallelisation, there are (at least) three
>>> opportunities:
>>>
>>> 1. Parallelise on test level. For python, this is detox?
>>>
>>
>> This is actually 2 levels. :)
>> 1a. Parallelise at the nosetest level - run unit tests in parallel in a
>> single tox environment. (I have a PR in progress to migrate to pytest, and
>> we should be able to do file-level parallelism provided we solve pickling
>> issues.)
>> 1b. Parallelise at the tox environment level, e..g, somehow running the
>> multiple tox environments (py27,py27-cython,py35,...) in parallel.
>>
>>
>>> 2. Parallelise on Gradle project level (--parallel option)
>>> 3. Parallelise on CI level
>>>
>>> So what I ve done before, if 1. does not help, nor 2. cause project is
>>> 'just to big' was 3, i.e. splitting on CI level. So the simplest thing I
>>> could imagine right now would be - as suggested above - to split
>>> pythonPreCommit into something like pythonX_YPrecommit, which then runs
>>> those different python versions in parallel. Of course, that could be done
>>> ad infinitum by splitting further into runners, IOs whatever.
>>>
>>> OTOH, we do already have tons of jobs running as we seem to map Gradle
>>> tasks to Jenkins jobs. So it might be more appropriate to leverage CI
>>> parallelization on jenkins pipeline level. Something like creating
>>> pythonPrecommit as a pipeline, which itself runs several steps in parallel.
>>> Did not contemplate on the impacts, though, on PRs and phrase triggering.
>>> Of course, we might need to improve on our 'verbosity' as currently it
>>> seems not always clear what command line actually triggered etc. With this
>>> approach, it seems possible to run tasks within same Gradle project in
>>> parallel with whatever granularity we might want to choose. (Of course we
>>> do add overhead here, but that's probably always the case by doing things
>>> in parallel)
>>>
>>> Of course 3. would not help on parallelisation on developers machine,
>>> but I never had a problem with that.
>>>
>>>
>>>>
>>>> > And as a last thought, will usage of pygradle help with better
>>>> python/gradle integration? Currently, we mainly use gradle to call into
>>>> shell scripts, which doesn't help gradle nor probably pythons tooling to do
>>>> the job very well? But deeper integration might cause problems on python
>>>> dev side, dunno :(
>>>>
>>>> Possibly.
>>>>
>>>> Are there any Python developers that primarily use the gradle
>>>> commands? Personally, I only use them if I'm using Java (or sometimes
>>>> work that is a mix of Java and Python, e.g. the Python-on-Flink
>>>> tests). Otherwise I use tox, or "python setup.py test [-s ...]"
>>>> directly. Gradle primarily has value as a top-level orchestration (in
>>>> particular for CI) and easy way for those who only touch Python
>>>> occasionally to run all the tests. If that's the case, optimizing our
>>>> gradle scripts for CI seems best.
>>>>
>>>
>>> My impression till today is, that nobody is actually using Gradle on
>>> daily python development. So I assume it is currently used on CI only. But
>>> I believe gradle could add value for devs also. We just need to improve our
>>> current integration. And to be very clear, gradle should not replace python
>>> tooling, but support usage in a way which looks familiar to any python dev.
>>> And probably render some shell scripting obsolete.
>>>
>>>
>>>>
>>>> > On Thu, Mar 28, 2019 at 6:37 PM Mark Liu <mark...@apache.org> wrote:
>>>> >>
>>>> >> Thank you Ahmet. Answer your questions below:
>>>> >>
>>>> >>>
>>>> >>> - Could you comment on what kind of parallelization we will gain by
>>>> this? In terms of real numbers, how would this affect build and test times?
>>>> >>
>>>> >>
>>>> >> The proposal is based on Gradle parallel execution: "you can force
>>>> Gradle to execute tasks in parallel as long as those tasks are in different
>>>> projects". In Beam, project is declared per build.gradle file and
>>>> registered in settings.gradle. Tasks that are included in single Gradle
>>>> execution will run in parallel only if they are declared in separate
>>>> build.gradle files.
>>>> >>
>>>> >> An example of applying parallel is beam_PreCommit_Python job which
>>>> runs :pythonPreCommit task that contains tasks distributed in 4
>>>> build.gradle. The execution graph looks like
>>>> https://scans.gradle.com/s/4frpmto6o7hto/timeline:
>>>> >>
>>>> >> Without this proposal, all tasks will run in sequential which can be
>>>> ~2x longer. If more py36 and py37 tests added in the future, things will be
>>>> even worse.
>>>> >>
>>>> >>> - I am guessing this will reduce complexity. Is it possible to
>>>> quantify the improvement related to this?
>>>> >>
>>>> >>
>>>> >> The general code complexity of function/method/property may not
>>>> change here since we basically group tasks in a different way without
>>>> changing inside logic. I don't know if there is any tool to measure Gradle
>>>> build complexity. Would love to try if there is.
>>>> >>
>>>> >>>
>>>> >>> - Beyond the proposal, I am assuming you are willing to work on.
>>>> Just want to clarify this. In either case, would you need help?
>>>> >>
>>>> >>
>>>> >> Yes, I'd love to take on major refactor works. At the same time,
>>>> I'll create jira for each kind of tests (like flink/protable/hdfs tests) in
>>>> sdks/python/build.gradle to move into test-suites. Test owners or anyone
>>>> interested to this work are welcome to contribute!
>>>> >>
>>>> >> Mark
>>>> >>
>>>> >> On Wed, Mar 27, 2019 at 3:53 PM Ahmet Altay <al...@google.com>
>>>> wrote:
>>>> >>>
>>>> >>> This sounds good to me. Thank you for doing this. Few questions:
>>>> >>> - Could you comment on what kind of parallelization we will gain by
>>>> this? In terms of real numbers, how would this affect build and test times?
>>>> >>> - I am guessing this will reduce complexity. Is it possible to
>>>> quantify the improvement related to this?
>>>> >>> - Beyond the proposal, I am assuming you are willing to work on.
>>>> Just want to clarify this. In either case, would you need help?
>>>> >>>
>>>> >>> Thank you,
>>>> >>> Ahmet
>>>> >>>
>>>> >>> On Wed, Mar 27, 2019 at 10:19 AM Mark Liu <mark...@apache.org>
>>>> wrote:
>>>> >>>>
>>>> >>>> Hi Python SDK Developers,
>>>> >>>>
>>>> >>>> You may notice that Gradle files changed a lot recently as
>>>> parallelization applied to Python tests and more python versions were
>>>> enabled in testing. There are tricks over the build scripts and tests are
>>>> grown naturally and distributed under sdks/python, which caused frictions
>>>> (like rollback PR-8059).
>>>> >>>>
>>>> >>>> Thus, I created BEAM-6907 and would like to initiate some works to
>>>> cleanup and standardize Gradle structure in Python SDK. In general, I think
>>>> we want to:
>>>> >>>>
>>>> >>>> - Apply parallel execution
>>>> >>>> - Share common tasks
>>>> >>>> - Centralize test related tasks
>>>> >>>> - Have a clear Gradle structure for projects/tasks
>>>> >>>>
>>>> >>>> This is Gradle directory structure I proposed:
>>>> >>>>
>>>> >>>> sdks/python/
>>>> >>>>
>>>> >>>> build.gradle    --> hold builds, snapshot, analytic tasks
>>>> >>>> test-suites/    --> all pre/post/VR test suites under here
>>>> >>>>
>>>> >>>> README.md
>>>> >>>>
>>>> >>>> dataflow/    --> grouped by runner or unit test (tox)
>>>> >>>>
>>>> >>>> py27/    --> grouped by py version
>>>> >>>>
>>>> >>>> build.gradle
>>>> >>>>
>>>> >>>> py35/
>>>> >>>>
>>>> >>>> ...
>>>> >>>>
>>>> >>>> direct/
>>>> >>>>
>>>> >>>> py27/
>>>> >>>>
>>>> >>>> ...
>>>> >>>>
>>>> >>>> flink/
>>>> >>>>
>>>> >>>> tox/
>>>> >>>> ...
>>>> >>>>
>>>> >>>>
>>>> >>>> The ideas are:
>>>> >>>> - Only keep builds, snapshot and analytic jobs in
>>>> sdks/python/build.gradle
>>>> >>>> - Move all test related tasks to sdks/python/test-suites/
>>>> >>>> - In sdks/python/test-suites, we first group by runners, unit test
>>>> or other testing that can't fit to them, and then group by py versions if
>>>> needed.
>>>> >>>> - An example of ../test-suites/../py35/build.gradle is this.
>>>> >>>>
>>>> >>>> Please feel free to explore existing Gradle scripts in Python SDK
>>>> and bring any thoughts on this proposal if you have.
>>>> >>>>
>>>> >>>> Thanks!
>>>> >>>> Mark
>>>
>>> [image: https://ml6.eu]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__ml6.eu_&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=yd_him24QhfROm7uRZLbfSsUHaA68_8FMl6s1MgT5sM&e=>


* Frederik Bode*

ML6 Ghent
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.be_maps_place_ML6_-4051.037408-2C3.7044893-2C17z_data-3D-213m1-214b1-214m5-213m4-211s0x47c37161feeca14b-3A0xb8f72585fdd21c90-218m2-213d51.037408-214d3.706678-3Fhl-3Dnl&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=26TZxPGXg0A_mqgeiw1lMeZYekpkExBAZ5MpavpUZmw&e=>
+32 4 92 78 96 18


**** DISCLAIMER ****

This email and any files transmitted with it are confidential and intended
solely for the use of the individual or entity to whom they are addressed.
If you have received this email in error please notify the system manager.
This message contains confidential information and is intended only for the
individual named. If you are not the named addressee you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately by e-mail ifyou have received this e-mail by mistake and delete
this e-mail from your system. If you are not the intended recipient you are
notified that disclosing, copying, distributing or taking any action in
reliance on the contents of this information is strictly prohibited.

Reply via email to