Re: Version Beam Website Documentation

2019-12-05 Thread Alex Van Boxel
It seems also be too complex for the Google Crawler as well. A lot of times
I arrived on documentation on an older version of a product when I search
(aka Google) for something.

 _/
_/ Alex Van Boxel


On Fri, Dec 6, 2019 at 6:20 AM Kenneth Knowles  wrote:

> Since we are not making breaking changes (we hope) and we try to be
> careful about performance regressions, I think it is OK to simply encourage
> users to upgrade to the latest if they expect the narrative documentation
> to match their version. The versioned API docs are probably enough. We
> might consider putting more info into the javadocs / pydocs to bridge the
> gap, if you have seen any issues with users hitting trouble.
>
> I am saying this for two reasons:
>
>  - versioning the site is more work, and someone would need to do that work
>  - but more than that, versioned site is more complex for users
>
> Kenn
>
> On Wed, Dec 4, 2019 at 1:48 PM Ankur Goenka  wrote:
>
>> I agree, having a single website showcase the latest beam versions and
>> encourages users to use the latest Beam version which is very useful.
>> Calling out version limitations are definitely makes users life easier.
>>
>> The usecase I have in mind is more on the lines of best practices and
>> recommended way of doing things.
>> One such example is the way we recommend new users to try Portable Flink.
>> We are overhauling and simplifying the user onboarding experience. Though
>> the old way of doing things are still supported, the easier new
>> recommendation for onboarding will only apply from Beam 2.18.
>> We can ofcource create sections on documentation for this usecase but it
>> seems like a poor man's way of versioning :)
>>
>> You also highlighted a great usecase about LTS release. Should we simply
>> separate out the documentations for LTS release and current version to make
>> it easy for the users to navigate the website and reduce management
>> overhead of updating specific sections.
>>
>> A few areas which might benefit from having multiple versions are
>> compatibility matrix, Common pipeline patterns, transform catalog and
>> runner pages.
>>
>>
>> On Wed, Dec 4, 2019 at 6:19 AM Jeff Klukas  wrote:
>>
>>> The API reference docs (Java and Python at least) are versioned, so we
>>> have a durable reference there and it's possible to link to particular
>>> sections of API docs for particular versions.
>>>
>>> For the major bits of introductory documentation (like the Beam
>>> Programming Guide), I think it's a good thing to have only a single
>>> version, so that people referencing it are always getting the most
>>> up-to-date wording and explanations, although it may be worth adding
>>> callouts there about minimum versions anywhere we discuss newer features.
>>> We should be encouraging the community to stay reasonably current, so I
>>> think any feature that's present in the latest LTS release should be fine
>>> to assume is available to users, although perhaps we should also state that
>>> explicitly on the website.
>>>
>>> Are there particular parts of the Beam website that you have in mind
>>> that would benefit from versioning? Are there specific cases you see where
>>> the current website would be confusing for someone using a Beam SDK that's
>>> a few versions old?
>>>
>>> On Tue, Dec 3, 2019 at 6:46 PM Ankur Goenka  wrote:
>>>
 Hi,

 We are constantly adding features to Beam which makes each new Beam
 version more feature rich and compelling.
 This also means that the old Beam released don't have the new features
 and might have different ways to do certain things.

 (I might be wrong here) - Our Beam website only publish a single
 version which is the latest version of documentation.
 This means that the users working with older SDK don't really have an
 easy way to lookup documentation for old versions of Beam.

 Proposal: Shall we consider publishing versioned Beam website to help
 users with old Beam version find the relevant information?

 Thanks,
 Ankur

>>>


Re: Portable runner bundle scheduling (Streaming/Python/Flink)

2019-12-05 Thread Thomas Weise
PR for this is now open: https://github.com/apache/beam/pull/10313

Hey Max,

Thanks for the feedback.

-->

On Sun, Nov 24, 2019 at 2:04 PM Maximilian Michels  wrote:

> Load-balancing the worker selection for bundle execution sounds like the
> solution to uneven work distribution across the workers. Some comments:
>
> (1) I could imagine that in case of long-running bundle execution (e.g.
> model execution), this could stall upstream operators because their busy
> downstream operators hold all available workers, thus also letting the
> pipeline throughput/latency suffer.
>

When there is a bottleneck in a downstream operator, the upstream operators
will eventually back up due to backpressure, regardless if workers are
available or not. Results only become available at the processing rate of
the slowest operator/stage.

The worker resources on a given node are limited. For the pipeline to
function, all operators need to make progress. There are typically more
subtasks/threads than there are worker processes, hence workers are shared.
The observation made was that with bundles of an executable stage always
pinned to the same worker, workers are not utilized properly.

The proposed change therefore is to (optionally) augment the distribution
of bundles over workers so that when any worker is available, progress can
be made.

With any executable stage able to execute on any available worker, we see
improved utilization.


>
> Instead of balancing across _all_ the workers available on particular
> node (aka TaskManager), it could make sense to just increase the share
> of SDK workers for a particular executable stage. At the moment, each
> stage just receives a single worker. Instead, it could receive a higher
> share of workers, which could either be exclusive or overlap with a
> share of another executable stage. Essentially, this is an extension to
> what you are proposing to ensure stages make progress.
>

If there is headroom, more workers can be allocated and they are going to
be used for any work available. The implementation as it stands ensures
fairness (first come first serve). I doubt a worker partitioning as you
suggest would improve the situation. It would essentially be a vertical
partitioning of resources, but we need all stages to make progress to
compute a result.


> (2) Another concern is that load balancing across multiple worker
> instances would render state caching useless. We need to make the Runner
> aware of it such that it can turn off state caching. With the approach
> of multiple workers per stage in (1), it would also be possible to keep
> the state caching, if we divided the key range across the workers.
>

State caching and load balancing are mutually exclusive and I added a check
to enforce that. For the use case that I'm looking at, the cost of state
access compared to that of running the expensive / high latency operations
is tiny to none.


>
> Cheers,
> Max
>
> On 23.11.19 18:42, Thomas Weise wrote:
> > JIRA: https://issues.apache.org/jira/browse/BEAM-8816
> >
> >
> > On Thu, Nov 21, 2019 at 10:44 AM Thomas Weise  > > wrote:
> >
> > Hi Luke,
> >
> > Thanks for the background and it is exciting to see the progress on
> > the SDF side. It will help with this use case and many other
> > challenges. I imagine the Python user code would be able to
> > determine that it is bogged down with high latency record processing
> > (based on the duration it actually took to process previous records)
> > and opt to send back remaining work to the runner.
> >
> > Until the Flink runner supports reassignment of work, I'm planning
> > to implement the simple bundle distribution approach referred to
> > before. We will test it in our environment and contribute it back if
> > the results are good.
> >
> > Thomas
> >
> >
> >
> > On Wed, Nov 20, 2019 at 11:34 AM Luke Cwik  > > wrote:
> >
> > Dataflow has run into this issue as well. Dataflow has "work
> > items" that are converted into bundles that are executed on the
> > SDK. Each work item does a greedy assignment to the SDK worker
> > with the fewest work items assigned. As you surmised, we use SDF
> > splitting in batch pipelines to balance work. We would like to
> > use splitting of SDFs in streaming pipelines as well but
> > Dataflow can't handle it as of right now.
> >
> > As part of a few PRs, I have added basic SDF expansion to the
> > shared runner lib and slowly exposed the runner side hooks[2, 3]
> > for SDK initiated checkpointing and bundle finalization. There
> > are still a few pieces left:
> > * exposing an API so the bundle can be split during execution
> > * adding the limited depth splitting logic that would add a
> > basic form of dynamic work rebalancing for all runners that
> > decide to use it
> >
> 

Re: Version Beam Website Documentation

2019-12-05 Thread Kenneth Knowles
Since we are not making breaking changes (we hope) and we try to be careful
about performance regressions, I think it is OK to simply encourage users
to upgrade to the latest if they expect the narrative documentation to
match their version. The versioned API docs are probably enough. We might
consider putting more info into the javadocs / pydocs to bridge the gap, if
you have seen any issues with users hitting trouble.

I am saying this for two reasons:

 - versioning the site is more work, and someone would need to do that work
 - but more than that, versioned site is more complex for users

Kenn

On Wed, Dec 4, 2019 at 1:48 PM Ankur Goenka  wrote:

> I agree, having a single website showcase the latest beam versions and
> encourages users to use the latest Beam version which is very useful.
> Calling out version limitations are definitely makes users life easier.
>
> The usecase I have in mind is more on the lines of best practices and
> recommended way of doing things.
> One such example is the way we recommend new users to try Portable Flink.
> We are overhauling and simplifying the user onboarding experience. Though
> the old way of doing things are still supported, the easier new
> recommendation for onboarding will only apply from Beam 2.18.
> We can ofcource create sections on documentation for this usecase but it
> seems like a poor man's way of versioning :)
>
> You also highlighted a great usecase about LTS release. Should we simply
> separate out the documentations for LTS release and current version to make
> it easy for the users to navigate the website and reduce management
> overhead of updating specific sections.
>
> A few areas which might benefit from having multiple versions are
> compatibility matrix, Common pipeline patterns, transform catalog and
> runner pages.
>
>
> On Wed, Dec 4, 2019 at 6:19 AM Jeff Klukas  wrote:
>
>> The API reference docs (Java and Python at least) are versioned, so we
>> have a durable reference there and it's possible to link to particular
>> sections of API docs for particular versions.
>>
>> For the major bits of introductory documentation (like the Beam
>> Programming Guide), I think it's a good thing to have only a single
>> version, so that people referencing it are always getting the most
>> up-to-date wording and explanations, although it may be worth adding
>> callouts there about minimum versions anywhere we discuss newer features.
>> We should be encouraging the community to stay reasonably current, so I
>> think any feature that's present in the latest LTS release should be fine
>> to assume is available to users, although perhaps we should also state that
>> explicitly on the website.
>>
>> Are there particular parts of the Beam website that you have in mind that
>> would benefit from versioning? Are there specific cases you see where the
>> current website would be confusing for someone using a Beam SDK that's a
>> few versions old?
>>
>> On Tue, Dec 3, 2019 at 6:46 PM Ankur Goenka  wrote:
>>
>>> Hi,
>>>
>>> We are constantly adding features to Beam which makes each new Beam
>>> version more feature rich and compelling.
>>> This also means that the old Beam released don't have the new features
>>> and might have different ways to do certain things.
>>>
>>> (I might be wrong here) - Our Beam website only publish a single version
>>> which is the latest version of documentation.
>>> This means that the users working with older SDK don't really have an
>>> easy way to lookup documentation for old versions of Beam.
>>>
>>> Proposal: Shall we consider publishing versioned Beam website to help
>>> users with old Beam version find the relevant information?
>>>
>>> Thanks,
>>> Ankur
>>>
>>


Re: [RELEASE] Tracking 2.18

2019-12-05 Thread Robert Bradshaw
Yeah, so I saw...

On Thu, Dec 5, 2019 at 4:31 PM Udi Meiri  wrote:
>
> Sorry Robert the release was already cut yesterday.
>
>
>
> On Thu, Dec 5, 2019 at 8:37 AM Ismaël Mejía  wrote:
>>
>> Colm, I just merged your PR and cherry picked it into 2.18.0
>> https://github.com/apache/beam/pull/10296
>>
>> On Thu, Dec 5, 2019 at 10:54 AM jincheng sun  
>> wrote:
>>>
>>> Thanks for the Tracking Udi!
>>>
>>> I have updated the status of some release blockers issues as follows:
>>>
>>> - BEAM-8733 closed
>>> - BEAM-8620 reset the fix version to 2.19
>>> - BEAM-8618 reset the fix version to 2.19
>>>
>>> Best,
>>> Jincheng
>>>
>>> Colm O hEigeartaigh  于2019年12月5日周四 下午5:38写道:

 Could we get this one in 2.18 as well? 
 https://issues.apache.org/jira/browse/BEAM-8861

 Colm.

 On Wed, Dec 4, 2019 at 8:02 PM Udi Meiri  wrote:
>
> Following the release calendar, I plan on cutting the 2.18 release branch 
> today.
>
> There are currently 8 release blockers.
>


Re: Python staging file weirdness

2019-12-05 Thread Valentyn Tymofieiev
Filed https://issues.apache.org/jira/browse/BEAM-8900 to address the
inefficiency discussed here. Thanks everyone.

On Thu, Dec 5, 2019 at 2:53 PM Valentyn Tymofieiev 
wrote:

> Note that so far we have not been staging wheels, since SDK does not have
> a knowledge of a target platform, but there is
> https://issues.apache.org/jira/browse/BEAM-4032 to add this support.
>
> On Thu, Dec 5, 2019 at 2:35 PM Chad Dombrova  wrote:
>
>> On Thu, Dec 5, 2019 at 12:36 PM Valentyn Tymofieiev 
>> wrote:
>>
>> Ah nice, so then the workflow would be: download [missing] deps from pypi
>>> into a long-lived cache directory, then download copy the same deps
>>> into a short-lived temporary directory, using  long-lived cache directory
>>> as SoT, then stage files from a short-lived temporary directory and clean
>>> it up. Is that what you are suggesting, Chad?
>>>
>> Yes, I just did a quick test to confirm:
>>
>> # download or build wheels of anything that's missing from the cache
>> # note: we're including gcp extras:
>> pip wheel apache_beam[gcp]==2.16 --wheel-dir /tmp/wheel-cache
>> # copy some of those wheels somewhere else
>> # note: we're excluding gcp extras
>> pip download apache_beam==2.16 --no-binary --find-links=/tmp/wheel-cache 
>> --dest /tmp/wheel-dest/
>> # rerun to confirm that cached wheels are being re-used instead of 
>> downloaded from pypi
>> pip wheel apache_beam[gcp]==2.16 --wheel-dir /tmp/wheel-cache
>>
>> /tmp/wheel-dest/ will now have a subset of the deps from
>> /tmp/wheel-cache, excluding the gcp extras.
>>
>> Note that for some reason the equal sign after —find-links is required,
>> at least for me on pip 19.1.1. Using a space resulted in an error.
>>
>> -chad
>>
>>
>>


Re: Python interactive runner: test dependencies removed

2019-12-05 Thread Ning Kang
Thanks Udi!

On Thu, Dec 5, 2019 at 2:07 PM Udi Meiri  wrote:

> The pytest tasks are there for me (or someone else) to verify that they
> can replace the nose ones.
> If you make changes to tox environments, please make changes to the
> corresponding -pytest env as well.
>
> Regarding extras, go ahead in adding "interactive" to the extras option
> (both py3x and py3x-pytest targets please).
>
> On Thu, Dec 5, 2019 at 1:55 PM Ning Kang  wrote:
>
>> Hi Udi,
>>
>> Are the temporary pytest tasks in use for pre-commit check or anything
>> currently?
>> I see there is still WIP for BEAM-3713
>> .
>>
>> There is only one task "pythonPreCommitPytest" depending on the pytest
>> tasks using the pytest environment configs.
>> And it's invoked here:
>>
>> PrecommitJobBuilder builderPytest = new PrecommitJobBuilder(
>> scope: this,
>> nameBase: 'Python_pytest',
>> gradleTask: ':pythonPreCommitPytest',
>> commitTriggering: false,
>> timeoutMins: 180,
>> )
>>
>> builderPytest.build {...}
>>
>>
>> On Wed, Dec 4, 2019 at 5:51 PM Ning Kang  wrote:
>>
>>> Thanks for the heads up! I was wondering why the interactive tests are
>>> skipped, lol.
>>> So we are moving away from the deprecated pytest-runner (with the
>>> changes in setup.py) but still sticking to pytest since it's replacing
>>> nosetest.
>>>
>>> Can I add "interactive" as "extras" to testenv "py37-pytest" and
>>> "py36-pytest" in tox.ini
>>> 
>>>  then?
>>>
>>> @Ahmet Altay  fyi
>>>
>>> On Wed, Dec 4, 2019 at 5:22 PM Pablo Estrada  wrote:
>>>
 +Ning Kang  +Sam Rohde  fyi

 On Wed, Nov 27, 2019 at 5:09 PM Udi Meiri  wrote:

> As part of a move to stop using the deprecated (and racey) setup.py
> keywords setup_requires and test_require, interactive runner dependencies
> have been removed from tests in
> https://github.com/apache/beam/pull/10227
>
> If this breaks any tests, please let me know.
>



Re: Python staging file weirdness

2019-12-05 Thread Valentyn Tymofieiev
Note that so far we have not been staging wheels, since SDK does not have a
knowledge of a target platform, but there is
https://issues.apache.org/jira/browse/BEAM-4032 to add this support.

On Thu, Dec 5, 2019 at 2:35 PM Chad Dombrova  wrote:

> On Thu, Dec 5, 2019 at 12:36 PM Valentyn Tymofieiev 
> wrote:
>
> Ah nice, so then the workflow would be: download [missing] deps from pypi
>> into a long-lived cache directory, then download copy the same deps into
>> a short-lived temporary directory, using  long-lived cache directory as
>> SoT, then stage files from a short-lived temporary directory and clean it
>> up. Is that what you are suggesting, Chad?
>>
> Yes, I just did a quick test to confirm:
>
> # download or build wheels of anything that's missing from the cache
> # note: we're including gcp extras:
> pip wheel apache_beam[gcp]==2.16 --wheel-dir /tmp/wheel-cache
> # copy some of those wheels somewhere else
> # note: we're excluding gcp extras
> pip download apache_beam==2.16 --no-binary --find-links=/tmp/wheel-cache 
> --dest /tmp/wheel-dest/
> # rerun to confirm that cached wheels are being re-used instead of downloaded 
> from pypi
> pip wheel apache_beam[gcp]==2.16 --wheel-dir /tmp/wheel-cache
>
> /tmp/wheel-dest/ will now have a subset of the deps from /tmp/wheel-cache,
> excluding the gcp extras.
>
> Note that for some reason the equal sign after —find-links is required, at
> least for me on pip 19.1.1. Using a space resulted in an error.
>
> -chad
>
>
>


Re: Python staging file weirdness

2019-12-05 Thread Chad Dombrova
On Thu, Dec 5, 2019 at 12:36 PM Valentyn Tymofieiev 
wrote:

Ah nice, so then the workflow would be: download [missing] deps from pypi
> into a long-lived cache directory, then download copy the same deps into
> a short-lived temporary directory, using  long-lived cache directory as
> SoT, then stage files from a short-lived temporary directory and clean it
> up. Is that what you are suggesting, Chad?
>
Yes, I just did a quick test to confirm:

# download or build wheels of anything that's missing from the cache
# note: we're including gcp extras:
pip wheel apache_beam[gcp]==2.16 --wheel-dir /tmp/wheel-cache
# copy some of those wheels somewhere else
# note: we're excluding gcp extras
pip download apache_beam==2.16 --no-binary
--find-links=/tmp/wheel-cache --dest /tmp/wheel-dest/
# rerun to confirm that cached wheels are being re-used instead of
downloaded from pypi
pip wheel apache_beam[gcp]==2.16 --wheel-dir /tmp/wheel-cache

/tmp/wheel-dest/ will now have a subset of the deps from /tmp/wheel-cache,
excluding the gcp extras.

Note that for some reason the equal sign after —find-links is required, at
least for me on pip 19.1.1. Using a space resulted in an error.

-chad


Re: Python interactive runner: test dependencies removed

2019-12-05 Thread Udi Meiri
The pytest tasks are there for me (or someone else) to verify that they can
replace the nose ones.
If you make changes to tox environments, please make changes to the
corresponding -pytest env as well.

Regarding extras, go ahead in adding "interactive" to the extras option
(both py3x and py3x-pytest targets please).

On Thu, Dec 5, 2019 at 1:55 PM Ning Kang  wrote:

> Hi Udi,
>
> Are the temporary pytest tasks in use for pre-commit check or anything
> currently?
> I see there is still WIP for BEAM-3713
> .
>
> There is only one task "pythonPreCommitPytest" depending on the pytest
> tasks using the pytest environment configs.
> And it's invoked here:
>
> PrecommitJobBuilder builderPytest = new PrecommitJobBuilder(
> scope: this,
> nameBase: 'Python_pytest',
> gradleTask: ':pythonPreCommitPytest',
> commitTriggering: false,
> timeoutMins: 180,
> )
>
> builderPytest.build {...}
>
>
> On Wed, Dec 4, 2019 at 5:51 PM Ning Kang  wrote:
>
>> Thanks for the heads up! I was wondering why the interactive tests are
>> skipped, lol.
>> So we are moving away from the deprecated pytest-runner (with the changes
>> in setup.py) but still sticking to pytest since it's replacing nosetest.
>>
>> Can I add "interactive" as "extras" to testenv "py37-pytest" and
>> "py36-pytest" in tox.ini
>> 
>>  then?
>>
>> @Ahmet Altay  fyi
>>
>> On Wed, Dec 4, 2019 at 5:22 PM Pablo Estrada  wrote:
>>
>>> +Ning Kang  +Sam Rohde  fyi
>>>
>>> On Wed, Nov 27, 2019 at 5:09 PM Udi Meiri  wrote:
>>>
 As part of a move to stop using the deprecated (and racey) setup.py
 keywords setup_requires and test_require, interactive runner dependencies
 have been removed from tests in
 https://github.com/apache/beam/pull/10227

 If this breaks any tests, please let me know.

>>>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: Python interactive runner: test dependencies removed

2019-12-05 Thread Ning Kang
Hi Udi,

Are the temporary pytest tasks in use for pre-commit check or anything
currently?
I see there is still WIP for BEAM-3713
.

There is only one task "pythonPreCommitPytest" depending on the pytest
tasks using the pytest environment configs.
And it's invoked here:

PrecommitJobBuilder builderPytest = new PrecommitJobBuilder(
scope: this,
nameBase: 'Python_pytest',
gradleTask: ':pythonPreCommitPytest',
commitTriggering: false,
timeoutMins: 180,
)

builderPytest.build {...}


On Wed, Dec 4, 2019 at 5:51 PM Ning Kang  wrote:

> Thanks for the heads up! I was wondering why the interactive tests are
> skipped, lol.
> So we are moving away from the deprecated pytest-runner (with the changes
> in setup.py) but still sticking to pytest since it's replacing nosetest.
>
> Can I add "interactive" as "extras" to testenv "py37-pytest" and
> "py36-pytest" in tox.ini
> 
>  then?
>
> @Ahmet Altay  fyi
>
> On Wed, Dec 4, 2019 at 5:22 PM Pablo Estrada  wrote:
>
>> +Ning Kang  +Sam Rohde  fyi
>>
>> On Wed, Nov 27, 2019 at 5:09 PM Udi Meiri  wrote:
>>
>>> As part of a move to stop using the deprecated (and racey) setup.py
>>> keywords setup_requires and test_require, interactive runner dependencies
>>> have been removed from tests in
>>> https://github.com/apache/beam/pull/10227
>>>
>>> If this breaks any tests, please let me know.
>>>
>>


Re: Python staging file weirdness

2019-12-05 Thread Valentyn Tymofieiev
Ah nice, so then the workflow would be: download [missing] deps from pypi
into a long-lived cache directory, then download copy the same deps into a
short-lived temporary directory, using  long-lived cache directory as SoT,
then stage files from a short-lived temporary directory and clean it up. Is
that what you are suggesting, Chad?


Re: Python staging file weirdness

2019-12-05 Thread Chad Dombrova
Another way to copy only the deps you care about is to use `pip download`
to do the copy.  I believe you can provide the cache dir to `pip download
--find-links` and it will read from that before reading from pypi (you may
also need to set --wheel-dir to the cache dir as well), and thus it acts as
a simple copy.

-chad


On Thu, Dec 5, 2019 at 12:07 PM Valentyn Tymofieiev 
wrote:

> Looked for a bit at pip download command. The alternative seems to parse
> the output of
>
> python -m pip download  --dest . -r requirements.txt  --exists-action i
> --no-binary :all:
>
> and see which files were downloaded and/or skipped since they were already
> present, and then stage only the files that appear in the log output. Seems
> doable but may break if pip output changes between pip implementations, so
> we'd have to add a test as well.
>
> On Thu, Dec 5, 2019 at 11:10 AM Luke Cwik  wrote:
>
>> I think reusing the same cache directory makes sense during downloading
>> but why do we upload everything that is there?
>>
>> On Thu, Dec 5, 2019 at 9:24 AM Udi Meiri  wrote:
>>
>>> Looking at the source, it seems that it should be using a
>>> os.path.join(tempfile.gettempdir(), 'dataflow-requirements-cache')
>>> to create a different tmp directory on each run.
>>>
>>> Also, sampling worker no. 2:
>>>
>>> *jenkins@apache-beam-jenkins-2*:*~*$ ls -l /tmp/dataflow-requirements-cache/
>>> total 7172
>>> -rw-rw-r-- 1 jenkins jenkins  27947 Sep  6 22:46 *funcsigs-1.0.2.tar.gz*
>>> -rw-rw-r-- 1 jenkins jenkins  28126 Sep  6 21:38 *mock-3.0.5.tar.gz*
>>> -rw-rw-r-- 1 jenkins jenkins 376623 Sep  6 21:38 *PyHamcrest-1.9.0.tar.gz*
>>> -rw-rw-r-- 1 jenkins jenkins 851251 Sep  6 21:38 *setuptools-41.2.0.zip*
>>> -rw-rw-r-- 1 jenkins jenkins 855608 Oct  7 06:03 *setuptools-41.4.0.zip*
>>> -rw-rw-r-- 1 jenkins jenkins 851068 Oct 28 06:10 *setuptools-41.5.0.zip*
>>> -rw-rw-r-- 1 jenkins jenkins 851097 Oct 28 19:46 *setuptools-41.5.1.zip*
>>> -rw-rw-r-- 1 jenkins jenkins 852541 Oct 29 14:06 *setuptools-41.6.0.zip*
>>> -rw-rw-r-- 1 jenkins jenkins 852125 Nov 24 08:10 *setuptools-42.0.0.zip*
>>> -rw-rw-r-- 1 jenkins jenkins 852264 Nov 25 20:55 *setuptools-42.0.1.zip*
>>> -rw-rw-r-- 1 jenkins jenkins 858444 Dec  1 18:12 *setuptools-42.0.2.zip*
>>> -rw-rw-r-- 1 jenkins jenkins  32725 Sep  6 21:38 *six-1.12.0.tar.gz*
>>> -rw-rw-r-- 1 jenkins jenkins  33726 Nov  5 19:18 *six-1.13.0.tar.gz*
>>>
>>>
>>> On Wed, Dec 4, 2019 at 8:00 PM Luke Cwik  wrote:
>>>
 Can we filter the cache directory only for the artifacts that we want
 and not everything that is there?

 On Wed, Dec 4, 2019 at 6:56 PM Valentyn Tymofieiev 
 wrote:

> Luke, I am not sure I understand the question. The caching that
> happens here is implemented in the SDK for requirements packages:
> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161
>
>
> On Wed, Dec 4, 2019 at 6:19 PM Luke Cwik  wrote:
>
>> Is there a way to use a cache on disk that is separate from the set
>> of packages we use as requirements?
>>
>> On Wed, Dec 4, 2019 at 5:58 PM Udi Meiri  wrote:
>>
>>> Thanks!
>>> Another reason to periodically referesh workers.
>>>
>>> On Wed, Nov 27, 2019 at 10:37 PM Valentyn Tymofieiev <
>>> valen...@google.com> wrote:
>>>
 Tests job specify[1] a requirements.txt file that contains two
 entries: pyhamcrest, mock.

 We download[2]  sources of packages specified in requirements file,
 and packages they depend on. While doing so, it appears that we use a 
 cache
 directory on jenkins to store the sources of the packages [3], perhaps 
 to
 save a trip to pypi and reduce pypi flakiness? Then, we stage the 
 entire
 cache directory[4], which includes all packages ever cached. Overtime 
 the
 versions that our requirements packages need change, but I guess we 
 don't
 clean the cache on Jenkins workers.

 [1]
 https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/scripts/run_integration_test.sh#L197
 [2]
 https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L469
 [3]
 https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161

 [4]
 https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L172

 On Wed, Nov 27, 2019 at 11:55 AM Udi Meiri 
 wrote:

> I was investigating a Dataflow postcommit test failure
> (endpoints_pb2 missing), and saw this in the staging directory:
>
> $ gsutil ls 
> 

Re: Python staging file weirdness

2019-12-05 Thread Valentyn Tymofieiev
Looked for a bit at pip download command. The alternative seems to parse
the output of

python -m pip download  --dest . -r requirements.txt  --exists-action i
--no-binary :all:

and see which files were downloaded and/or skipped since they were already
present, and then stage only the files that appear in the log output. Seems
doable but may break if pip output changes between pip implementations, so
we'd have to add a test as well.

On Thu, Dec 5, 2019 at 11:10 AM Luke Cwik  wrote:

> I think reusing the same cache directory makes sense during downloading
> but why do we upload everything that is there?
>
> On Thu, Dec 5, 2019 at 9:24 AM Udi Meiri  wrote:
>
>> Looking at the source, it seems that it should be using a
>> os.path.join(tempfile.gettempdir(), 'dataflow-requirements-cache')
>> to create a different tmp directory on each run.
>>
>> Also, sampling worker no. 2:
>>
>> *jenkins@apache-beam-jenkins-2*:*~*$ ls -l /tmp/dataflow-requirements-cache/
>> total 7172
>> -rw-rw-r-- 1 jenkins jenkins  27947 Sep  6 22:46 *funcsigs-1.0.2.tar.gz*
>> -rw-rw-r-- 1 jenkins jenkins  28126 Sep  6 21:38 *mock-3.0.5.tar.gz*
>> -rw-rw-r-- 1 jenkins jenkins 376623 Sep  6 21:38 *PyHamcrest-1.9.0.tar.gz*
>> -rw-rw-r-- 1 jenkins jenkins 851251 Sep  6 21:38 *setuptools-41.2.0.zip*
>> -rw-rw-r-- 1 jenkins jenkins 855608 Oct  7 06:03 *setuptools-41.4.0.zip*
>> -rw-rw-r-- 1 jenkins jenkins 851068 Oct 28 06:10 *setuptools-41.5.0.zip*
>> -rw-rw-r-- 1 jenkins jenkins 851097 Oct 28 19:46 *setuptools-41.5.1.zip*
>> -rw-rw-r-- 1 jenkins jenkins 852541 Oct 29 14:06 *setuptools-41.6.0.zip*
>> -rw-rw-r-- 1 jenkins jenkins 852125 Nov 24 08:10 *setuptools-42.0.0.zip*
>> -rw-rw-r-- 1 jenkins jenkins 852264 Nov 25 20:55 *setuptools-42.0.1.zip*
>> -rw-rw-r-- 1 jenkins jenkins 858444 Dec  1 18:12 *setuptools-42.0.2.zip*
>> -rw-rw-r-- 1 jenkins jenkins  32725 Sep  6 21:38 *six-1.12.0.tar.gz*
>> -rw-rw-r-- 1 jenkins jenkins  33726 Nov  5 19:18 *six-1.13.0.tar.gz*
>>
>>
>> On Wed, Dec 4, 2019 at 8:00 PM Luke Cwik  wrote:
>>
>>> Can we filter the cache directory only for the artifacts that we want
>>> and not everything that is there?
>>>
>>> On Wed, Dec 4, 2019 at 6:56 PM Valentyn Tymofieiev 
>>> wrote:
>>>
 Luke, I am not sure I understand the question. The caching that happens
 here is implemented in the SDK for requirements packages:
 https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161


 On Wed, Dec 4, 2019 at 6:19 PM Luke Cwik  wrote:

> Is there a way to use a cache on disk that is separate from the set of
> packages we use as requirements?
>
> On Wed, Dec 4, 2019 at 5:58 PM Udi Meiri  wrote:
>
>> Thanks!
>> Another reason to periodically referesh workers.
>>
>> On Wed, Nov 27, 2019 at 10:37 PM Valentyn Tymofieiev <
>> valen...@google.com> wrote:
>>
>>> Tests job specify[1] a requirements.txt file that contains two
>>> entries: pyhamcrest, mock.
>>>
>>> We download[2]  sources of packages specified in requirements file,
>>> and packages they depend on. While doing so, it appears that we use a 
>>> cache
>>> directory on jenkins to store the sources of the packages [3], perhaps 
>>> to
>>> save a trip to pypi and reduce pypi flakiness? Then, we stage the entire
>>> cache directory[4], which includes all packages ever cached. Overtime 
>>> the
>>> versions that our requirements packages need change, but I guess we 
>>> don't
>>> clean the cache on Jenkins workers.
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/scripts/run_integration_test.sh#L197
>>> [2]
>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L469
>>> [3]
>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161
>>>
>>> [4]
>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L172
>>>
>>> On Wed, Nov 27, 2019 at 11:55 AM Udi Meiri  wrote:
>>>
 I was investigating a Dataflow postcommit test failure
 (endpoints_pb2 missing), and saw this in the staging directory:

 $ gsutil ls 
 gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882
 gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/PyHamcrest-1.9.0.tar.gz
 gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow-worker.jar
 gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow_python_sdk.tar
 

Re: Python staging file weirdness

2019-12-05 Thread Luke Cwik
I think reusing the same cache directory makes sense during downloading but
why do we upload everything that is there?

On Thu, Dec 5, 2019 at 9:24 AM Udi Meiri  wrote:

> Looking at the source, it seems that it should be using a
> os.path.join(tempfile.gettempdir(), 'dataflow-requirements-cache')
> to create a different tmp directory on each run.
>
> Also, sampling worker no. 2:
>
> *jenkins@apache-beam-jenkins-2*:*~*$ ls -l /tmp/dataflow-requirements-cache/
> total 7172
> -rw-rw-r-- 1 jenkins jenkins  27947 Sep  6 22:46 *funcsigs-1.0.2.tar.gz*
> -rw-rw-r-- 1 jenkins jenkins  28126 Sep  6 21:38 *mock-3.0.5.tar.gz*
> -rw-rw-r-- 1 jenkins jenkins 376623 Sep  6 21:38 *PyHamcrest-1.9.0.tar.gz*
> -rw-rw-r-- 1 jenkins jenkins 851251 Sep  6 21:38 *setuptools-41.2.0.zip*
> -rw-rw-r-- 1 jenkins jenkins 855608 Oct  7 06:03 *setuptools-41.4.0.zip*
> -rw-rw-r-- 1 jenkins jenkins 851068 Oct 28 06:10 *setuptools-41.5.0.zip*
> -rw-rw-r-- 1 jenkins jenkins 851097 Oct 28 19:46 *setuptools-41.5.1.zip*
> -rw-rw-r-- 1 jenkins jenkins 852541 Oct 29 14:06 *setuptools-41.6.0.zip*
> -rw-rw-r-- 1 jenkins jenkins 852125 Nov 24 08:10 *setuptools-42.0.0.zip*
> -rw-rw-r-- 1 jenkins jenkins 852264 Nov 25 20:55 *setuptools-42.0.1.zip*
> -rw-rw-r-- 1 jenkins jenkins 858444 Dec  1 18:12 *setuptools-42.0.2.zip*
> -rw-rw-r-- 1 jenkins jenkins  32725 Sep  6 21:38 *six-1.12.0.tar.gz*
> -rw-rw-r-- 1 jenkins jenkins  33726 Nov  5 19:18 *six-1.13.0.tar.gz*
>
>
> On Wed, Dec 4, 2019 at 8:00 PM Luke Cwik  wrote:
>
>> Can we filter the cache directory only for the artifacts that we want and
>> not everything that is there?
>>
>> On Wed, Dec 4, 2019 at 6:56 PM Valentyn Tymofieiev 
>> wrote:
>>
>>> Luke, I am not sure I understand the question. The caching that happens
>>> here is implemented in the SDK for requirements packages:
>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161
>>>
>>>
>>> On Wed, Dec 4, 2019 at 6:19 PM Luke Cwik  wrote:
>>>
 Is there a way to use a cache on disk that is separate from the set of
 packages we use as requirements?

 On Wed, Dec 4, 2019 at 5:58 PM Udi Meiri  wrote:

> Thanks!
> Another reason to periodically referesh workers.
>
> On Wed, Nov 27, 2019 at 10:37 PM Valentyn Tymofieiev <
> valen...@google.com> wrote:
>
>> Tests job specify[1] a requirements.txt file that contains two
>> entries: pyhamcrest, mock.
>>
>> We download[2]  sources of packages specified in requirements file,
>> and packages they depend on. While doing so, it appears that we use a 
>> cache
>> directory on jenkins to store the sources of the packages [3], perhaps to
>> save a trip to pypi and reduce pypi flakiness? Then, we stage the entire
>> cache directory[4], which includes all packages ever cached. Overtime the
>> versions that our requirements packages need change, but I guess we don't
>> clean the cache on Jenkins workers.
>>
>> [1]
>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/scripts/run_integration_test.sh#L197
>> [2]
>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L469
>> [3]
>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161
>>
>> [4]
>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L172
>>
>> On Wed, Nov 27, 2019 at 11:55 AM Udi Meiri  wrote:
>>
>>> I was investigating a Dataflow postcommit test failure
>>> (endpoints_pb2 missing), and saw this in the staging directory:
>>>
>>> $ gsutil ls 
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/PyHamcrest-1.9.0.tar.gz
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow-worker.jar
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow_python_sdk.tar
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/funcsigs-1.0.2.tar.gz
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/mock-3.0.5.tar.gz
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/pipeline.pb
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/requirements.txt
>>> 

Re: [RELEASE] Tracking 2.18

2019-12-05 Thread Ismaël Mejía
Colm, I just merged your PR and cherry picked it into 2.18.0
https://github.com/apache/beam/pull/10296

On Thu, Dec 5, 2019 at 10:54 AM jincheng sun 
wrote:

> Thanks for the Tracking Udi!
>
> I have updated the status of some release blockers issues as follows:
>
> - BEAM-8733 closed
> - BEAM-8620 reset the fix version to 2.19
> - BEAM-8618 reset the fix version to 2.19
>
> Best,
> Jincheng
>
> Colm O hEigeartaigh  于2019年12月5日周四 下午5:38写道:
>
>> Could we get this one in 2.18 as well?
>> https://issues.apache.org/jira/browse/BEAM-8861
>>
>> Colm.
>>
>> On Wed, Dec 4, 2019 at 8:02 PM Udi Meiri  wrote:
>>
>>> Following the release calendar, I plan on cutting the 2.18 release
>>> branch today.
>>>
>>> There are currently 8 release blockers
>>> .
>>>
>>>


Precommits fire for wrong PRs

2019-12-05 Thread Michał Walenia
Hi all,
I noticed that sometimes the precommit jobs are launched for unrelated PRs,
eg. Python or website precommit runs for some Java and gradle changes.
AFAIK, the 'responsibility regions' for the jobs are defined in Job DSL
scripts and are regexes that file changed paths are checked against.
for Python PreCommit the paths are:
  '^model/.*$',
  '^sdks/python/.*$',
  '^release/.*$',
  '^build.gradle$',
  '^buildSrc/.*$',
  '^gradle/.*$',
  '^gradle.properties$',
  '^gradlew$',
  '^gradle.bat$',
  '^settings.gradle$'

and for Website:
  '^website/.*$'
  '^build.gradle$',
  '^buildSrc/.*$',
  '^gradle/.*$',
  '^gradle.properties$',
  '^gradlew$',
  '^gradle.bat$',
  '^settings.gradle$'

What I don't understand is why they both triggered for Łukasz's PR
, which touches some Java files,
Jenkins scripts and two non top-level gradle files.
Can anyone shed some light on this? I'd like to understand it thoroughly -
by fixing misfires of precommits it would be possible to alleviate the
strain on our Jenkins workers.
Thanks!
Michal
-- 

Michał Walenia
Polidea  | Software Engineer

M: +48 791 432 002 <+48791432002>
E: michal.wale...@polidea.com

Unique Tech
Check out our projects! 


Re: [RELEASE] Tracking 2.18

2019-12-05 Thread jincheng sun
Thanks for the Tracking Udi!

I have updated the status of some release blockers issues as follows:

- BEAM-8733 closed
- BEAM-8620 reset the fix version to 2.19
- BEAM-8618 reset the fix version to 2.19

Best,
Jincheng

Colm O hEigeartaigh  于2019年12月5日周四 下午5:38写道:

> Could we get this one in 2.18 as well?
> https://issues.apache.org/jira/browse/BEAM-8861
>
> Colm.
>
> On Wed, Dec 4, 2019 at 8:02 PM Udi Meiri  wrote:
>
>> Following the release calendar, I plan on cutting the 2.18 release branch
>> today.
>>
>> There are currently 8 release blockers
>> .
>>
>>


Re: [RELEASE] Tracking 2.18

2019-12-05 Thread Colm O hEigeartaigh
Could we get this one in 2.18 as well?
https://issues.apache.org/jira/browse/BEAM-8861

Colm.

On Wed, Dec 4, 2019 at 8:02 PM Udi Meiri  wrote:

> Following the release calendar, I plan on cutting the 2.18 release branch
> today.
>
> There are currently 8 release blockers
> .
>
>