Re: Python staging file weirdness

2019-12-04 Thread Luke Cwik
Can we filter the cache directory only for the artifacts that we want and
not everything that is there?

On Wed, Dec 4, 2019 at 6:56 PM Valentyn Tymofieiev 
wrote:

> Luke, I am not sure I understand the question. The caching that happens
> here is implemented in the SDK for requirements packages:
> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161
>
>
> On Wed, Dec 4, 2019 at 6:19 PM Luke Cwik  wrote:
>
>> Is there a way to use a cache on disk that is separate from the set of
>> packages we use as requirements?
>>
>> On Wed, Dec 4, 2019 at 5:58 PM Udi Meiri  wrote:
>>
>>> Thanks!
>>> Another reason to periodically referesh workers.
>>>
>>> On Wed, Nov 27, 2019 at 10:37 PM Valentyn Tymofieiev <
>>> valen...@google.com> wrote:
>>>
 Tests job specify[1] a requirements.txt file that contains two entries:
 pyhamcrest, mock.

 We download[2]  sources of packages specified in requirements file,
 and packages they depend on. While doing so, it appears that we use a cache
 directory on jenkins to store the sources of the packages [3], perhaps to
 save a trip to pypi and reduce pypi flakiness? Then, we stage the entire
 cache directory[4], which includes all packages ever cached. Overtime the
 versions that our requirements packages need change, but I guess we don't
 clean the cache on Jenkins workers.

 [1]
 https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/scripts/run_integration_test.sh#L197
 [2]
 https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L469
 [3]
 https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161

 [4]
 https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L172

 On Wed, Nov 27, 2019 at 11:55 AM Udi Meiri  wrote:

> I was investigating a Dataflow postcommit test failure (endpoints_pb2
> missing), and saw this in the staging directory:
>
> $ gsutil ls 
> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882
> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/PyHamcrest-1.9.0.tar.gz
> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow-worker.jar
> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow_python_sdk.tar
> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/funcsigs-1.0.2.tar.gz
> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/mock-3.0.5.tar.gz
> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/pipeline.pb
> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/requirements.txt
> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.2.0.zip
> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.4.0.zip
> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.0.zip
> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.1.zip
> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.6.0.zip
> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.0.zip
> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.1.zip
> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.12.0.tar.gz
> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.13.0.tar.gz
>
>
> Does anyone know why so many versions of setuptools need to be staged?
> Shouldn't 1 be enough?
>



Re: Python staging file weirdness

2019-12-04 Thread Valentyn Tymofieiev
Luke, I am not sure I understand the question. The caching that happens
here is implemented in the SDK for requirements packages:
https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161


On Wed, Dec 4, 2019 at 6:19 PM Luke Cwik  wrote:

> Is there a way to use a cache on disk that is separate from the set of
> packages we use as requirements?
>
> On Wed, Dec 4, 2019 at 5:58 PM Udi Meiri  wrote:
>
>> Thanks!
>> Another reason to periodically referesh workers.
>>
>> On Wed, Nov 27, 2019 at 10:37 PM Valentyn Tymofieiev 
>> wrote:
>>
>>> Tests job specify[1] a requirements.txt file that contains two entries:
>>> pyhamcrest, mock.
>>>
>>> We download[2]  sources of packages specified in requirements file,
>>> and packages they depend on. While doing so, it appears that we use a cache
>>> directory on jenkins to store the sources of the packages [3], perhaps to
>>> save a trip to pypi and reduce pypi flakiness? Then, we stage the entire
>>> cache directory[4], which includes all packages ever cached. Overtime the
>>> versions that our requirements packages need change, but I guess we don't
>>> clean the cache on Jenkins workers.
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/scripts/run_integration_test.sh#L197
>>> [2]
>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L469
>>> [3]
>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161
>>>
>>> [4]
>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L172
>>>
>>> On Wed, Nov 27, 2019 at 11:55 AM Udi Meiri  wrote:
>>>
 I was investigating a Dataflow postcommit test failure (endpoints_pb2
 missing), and saw this in the staging directory:

 $ gsutil ls 
 gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882
 gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/PyHamcrest-1.9.0.tar.gz
 gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow-worker.jar
 gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow_python_sdk.tar
 gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/funcsigs-1.0.2.tar.gz
 gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/mock-3.0.5.tar.gz
 gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/pipeline.pb
 gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/requirements.txt
 gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.2.0.zip
 gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.4.0.zip
 gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.0.zip
 gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.1.zip
 gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.6.0.zip
 gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.0.zip
 gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.1.zip
 gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.12.0.tar.gz
 gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.13.0.tar.gz


 Does anyone know why so many versions of setuptools need to be staged?
 Shouldn't 1 be enough?

>>>


Re: Python staging file weirdness

2019-12-04 Thread Luke Cwik
Is there a way to use a cache on disk that is separate from the set of
packages we use as requirements?

On Wed, Dec 4, 2019 at 5:58 PM Udi Meiri  wrote:

> Thanks!
> Another reason to periodically referesh workers.
>
> On Wed, Nov 27, 2019 at 10:37 PM Valentyn Tymofieiev 
> wrote:
>
>> Tests job specify[1] a requirements.txt file that contains two entries:
>> pyhamcrest, mock.
>>
>> We download[2]  sources of packages specified in requirements file,
>> and packages they depend on. While doing so, it appears that we use a cache
>> directory on jenkins to store the sources of the packages [3], perhaps to
>> save a trip to pypi and reduce pypi flakiness? Then, we stage the entire
>> cache directory[4], which includes all packages ever cached. Overtime the
>> versions that our requirements packages need change, but I guess we don't
>> clean the cache on Jenkins workers.
>>
>> [1]
>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/scripts/run_integration_test.sh#L197
>> [2]
>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L469
>> [3]
>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161
>>
>> [4]
>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L172
>>
>> On Wed, Nov 27, 2019 at 11:55 AM Udi Meiri  wrote:
>>
>>> I was investigating a Dataflow postcommit test failure (endpoints_pb2
>>> missing), and saw this in the staging directory:
>>>
>>> $ gsutil ls 
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/PyHamcrest-1.9.0.tar.gz
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow-worker.jar
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow_python_sdk.tar
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/funcsigs-1.0.2.tar.gz
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/mock-3.0.5.tar.gz
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/pipeline.pb
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/requirements.txt
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.2.0.zip
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.4.0.zip
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.0.zip
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.1.zip
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.6.0.zip
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.0.zip
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.1.zip
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.12.0.tar.gz
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.13.0.tar.gz
>>>
>>>
>>> Does anyone know why so many versions of setuptools need to be staged?
>>> Shouldn't 1 be enough?
>>>
>>


Re: Python staging file weirdness

2019-12-04 Thread Udi Meiri
Thanks!
Another reason to periodically referesh workers.

On Wed, Nov 27, 2019 at 10:37 PM Valentyn Tymofieiev 
wrote:

> Tests job specify[1] a requirements.txt file that contains two entries:
> pyhamcrest, mock.
>
> We download[2]  sources of packages specified in requirements file,
> and packages they depend on. While doing so, it appears that we use a cache
> directory on jenkins to store the sources of the packages [3], perhaps to
> save a trip to pypi and reduce pypi flakiness? Then, we stage the entire
> cache directory[4], which includes all packages ever cached. Overtime the
> versions that our requirements packages need change, but I guess we don't
> clean the cache on Jenkins workers.
>
> [1]
> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/scripts/run_integration_test.sh#L197
> [2]
> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L469
> [3]
> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161
>
> [4]
> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L172
>
> On Wed, Nov 27, 2019 at 11:55 AM Udi Meiri  wrote:
>
>> I was investigating a Dataflow postcommit test failure (endpoints_pb2
>> missing), and saw this in the staging directory:
>>
>> $ gsutil ls 
>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882
>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/PyHamcrest-1.9.0.tar.gz
>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow-worker.jar
>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow_python_sdk.tar
>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/funcsigs-1.0.2.tar.gz
>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/mock-3.0.5.tar.gz
>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/pipeline.pb
>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/requirements.txt
>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.2.0.zip
>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.4.0.zip
>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.0.zip
>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.1.zip
>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.6.0.zip
>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.0.zip
>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.1.zip
>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.12.0.tar.gz
>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.13.0.tar.gz
>>
>>
>> Does anyone know why so many versions of setuptools need to be staged?
>> Shouldn't 1 be enough?
>>
>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: Python interactive runner: test dependencies removed

2019-12-04 Thread Ning Kang
Thanks for the heads up! I was wondering why the interactive tests are
skipped, lol.
So we are moving away from the deprecated pytest-runner (with the changes
in setup.py) but still sticking to pytest since it's replacing nosetest.

Can I add "interactive" as "extras" to testenv "py37-pytest" and
"py36-pytest" in tox.ini
 then?

@Ahmet Altay  fyi

On Wed, Dec 4, 2019 at 5:22 PM Pablo Estrada  wrote:

> +Ning Kang  +Sam Rohde  fyi
>
> On Wed, Nov 27, 2019 at 5:09 PM Udi Meiri  wrote:
>
>> As part of a move to stop using the deprecated (and racey) setup.py
>> keywords setup_requires and test_require, interactive runner dependencies
>> have been removed from tests in
>> https://github.com/apache/beam/pull/10227
>>
>> If this breaks any tests, please let me know.
>>
>


Re: Python interactive runner: test dependencies removed

2019-12-04 Thread Pablo Estrada
+Ning Kang  +Sam Rohde  fyi

On Wed, Nov 27, 2019 at 5:09 PM Udi Meiri  wrote:

> As part of a move to stop using the deprecated (and racey) setup.py
> keywords setup_requires and test_require, interactive runner dependencies
> have been removed from tests in
> https://github.com/apache/beam/pull/10227
>
> If this breaks any tests, please let me know.
>


Re: Contributor permission for Beam Jira tickets

2019-12-04 Thread Luke Cwik
Welcome, I have added you as a contributor.

On Wed, Dec 4, 2019 at 4:19 PM Esun Kim  wrote:

> Hi,
>
> This is Esun Kim from Google. I'm working on GCS connector of beam IO.
> Can you add me as a contributor for Beam's Jira issue tracker? My Jira ID
> is veblush.
>
> Regards,
> Esun.
>
>


Contributor permission for Beam Jira tickets

2019-12-04 Thread Esun Kim
Hi,

This is Esun Kim from Google. I'm working on GCS connector of beam IO.
Can you add me as a contributor for Beam's Jira issue tracker? My Jira ID
is veblush.

Regards,
Esun.


Re: Version Beam Website Documentation

2019-12-04 Thread Ankur Goenka
I agree, having a single website showcase the latest beam versions and
encourages users to use the latest Beam version which is very useful.
Calling out version limitations are definitely makes users life easier.

The usecase I have in mind is more on the lines of best practices and
recommended way of doing things.
One such example is the way we recommend new users to try Portable Flink.
We are overhauling and simplifying the user onboarding experience. Though
the old way of doing things are still supported, the easier new
recommendation for onboarding will only apply from Beam 2.18.
We can ofcource create sections on documentation for this usecase but it
seems like a poor man's way of versioning :)

You also highlighted a great usecase about LTS release. Should we simply
separate out the documentations for LTS release and current version to make
it easy for the users to navigate the website and reduce management
overhead of updating specific sections.

A few areas which might benefit from having multiple versions are
compatibility matrix, Common pipeline patterns, transform catalog and
runner pages.


On Wed, Dec 4, 2019 at 6:19 AM Jeff Klukas  wrote:

> The API reference docs (Java and Python at least) are versioned, so we
> have a durable reference there and it's possible to link to particular
> sections of API docs for particular versions.
>
> For the major bits of introductory documentation (like the Beam
> Programming Guide), I think it's a good thing to have only a single
> version, so that people referencing it are always getting the most
> up-to-date wording and explanations, although it may be worth adding
> callouts there about minimum versions anywhere we discuss newer features.
> We should be encouraging the community to stay reasonably current, so I
> think any feature that's present in the latest LTS release should be fine
> to assume is available to users, although perhaps we should also state that
> explicitly on the website.
>
> Are there particular parts of the Beam website that you have in mind that
> would benefit from versioning? Are there specific cases you see where the
> current website would be confusing for someone using a Beam SDK that's a
> few versions old?
>
> On Tue, Dec 3, 2019 at 6:46 PM Ankur Goenka  wrote:
>
>> Hi,
>>
>> We are constantly adding features to Beam which makes each new Beam
>> version more feature rich and compelling.
>> This also means that the old Beam released don't have the new features
>> and might have different ways to do certain things.
>>
>> (I might be wrong here) - Our Beam website only publish a single version
>> which is the latest version of documentation.
>> This means that the users working with older SDK don't really have an
>> easy way to lookup documentation for old versions of Beam.
>>
>> Proposal: Shall we consider publishing versioned Beam website to help
>> users with old Beam version find the relevant information?
>>
>> Thanks,
>> Ankur
>>
>


[RELEASE] Tracking 2.18

2019-12-04 Thread Udi Meiri
Following the release calendar, I plan on cutting the 2.18 release branch
today.

There are currently 8 release blockers
.


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [Interactive Beam] Changes to what to cache

2019-12-04 Thread Ning Kang
Thanks Pablo!
I've sent a user-geared update too.

On Wed, Dec 4, 2019 at 11:16 AM Pablo Estrada  wrote:

> Thanks for sharing, Ning!
> Is this update valuable to users as well? If so, consider sending a
> user-geared update to u...@beam.apache.org.
>
> -P.
>
> On Wed, Dec 4, 2019 at 11:14 AM Ning Kang  wrote:
>
>> *If you are not an Interactive Beam user, you can ignore this email.*
>>
>> Hi everyone,
>>
>> Recently, we've been actively developing on top of the existing
>> InteractiveRunner for more Interactive Beam features
>> 
>> .
>>
>> One of the things we've changed is what PCollections will be cached and
>> available for *get_result(pcoll)*.
>>
>> If your unit tests or code depend on executing a pipeline with the
>> InteractiveRunner and check data of the PCollection through
>> *get_result(pcoll)*, those code might run into an error saying "raise
>> ValueError('PCollection not available, please run the pipeline.')".
>>
>> This is because now Interactive Beam automatically figures out what
>> PCollections have been assigned to variables in the user-defined pipelines
>> in your code/test/notebooks by looking at a "watched" scope of variable
>> definitions.
>> By default everything defined in "__main__" is watched.
>>
>> So if you've defined a pipeline in a local scope such as a function,
>> Interactive Beam will not be able to "watch" it and then cache data for
>> those PCollections.
>> There is only one line change needed to fix the usage: watch your local
>> scope.
>>
>> Something like,
>> from apache_beam.runners.interactive import interactive_beam
>> ...
>> def some_func(...):
>> p = beam.Pipeline(InteractiveRunner())
>> pcoll = p | 'SomeTransform' >> SomeTransform()
>> ...
>> interactive_beam.watch(locals())
>> result = p.run()
>> ...
>> ...
>>
>> Thanks for using Interactive Beam!
>>
>> Ning.
>>
>>
>>
>>


Re: [Interactive Beam] Changes to what to cache

2019-12-04 Thread Pablo Estrada
Thanks for sharing, Ning!
Is this update valuable to users as well? If so, consider sending a
user-geared update to u...@beam.apache.org.

-P.

On Wed, Dec 4, 2019 at 11:14 AM Ning Kang  wrote:

> *If you are not an Interactive Beam user, you can ignore this email.*
>
> Hi everyone,
>
> Recently, we've been actively developing on top of the existing
> InteractiveRunner for more Interactive Beam features
> 
> .
>
> One of the things we've changed is what PCollections will be cached and
> available for *get_result(pcoll)*.
>
> If your unit tests or code depend on executing a pipeline with the
> InteractiveRunner and check data of the PCollection through
> *get_result(pcoll)*, those code might run into an error saying "raise
> ValueError('PCollection not available, please run the pipeline.')".
>
> This is because now Interactive Beam automatically figures out what
> PCollections have been assigned to variables in the user-defined pipelines
> in your code/test/notebooks by looking at a "watched" scope of variable
> definitions.
> By default everything defined in "__main__" is watched.
>
> So if you've defined a pipeline in a local scope such as a function,
> Interactive Beam will not be able to "watch" it and then cache data for
> those PCollections.
> There is only one line change needed to fix the usage: watch your local
> scope.
>
> Something like,
> from apache_beam.runners.interactive import interactive_beam
> ...
> def some_func(...):
> p = beam.Pipeline(InteractiveRunner())
> pcoll = p | 'SomeTransform' >> SomeTransform()
> ...
> interactive_beam.watch(locals())
> result = p.run()
> ...
> ...
>
> Thanks for using Interactive Beam!
>
> Ning.
>
>
>
>


[Interactive Beam] Changes to what to cache

2019-12-04 Thread Ning Kang
*If you are not an Interactive Beam user, you can ignore this email.*

Hi everyone,

Recently, we've been actively developing on top of the existing
InteractiveRunner for more Interactive Beam features

.

One of the things we've changed is what PCollections will be cached and
available for *get_result(pcoll)*.

If your unit tests or code depend on executing a pipeline with the
InteractiveRunner and check data of the PCollection through
*get_result(pcoll)*, those code might run into an error saying "raise
ValueError('PCollection not available, please run the pipeline.')".

This is because now Interactive Beam automatically figures out what
PCollections have been assigned to variables in the user-defined pipelines
in your code/test/notebooks by looking at a "watched" scope of variable
definitions.
By default everything defined in "__main__" is watched.

So if you've defined a pipeline in a local scope such as a function,
Interactive Beam will not be able to "watch" it and then cache data for
those PCollections.
There is only one line change needed to fix the usage: watch your local
scope.

Something like,
from apache_beam.runners.interactive import interactive_beam
...
def some_func(...):
p = beam.Pipeline(InteractiveRunner())
pcoll = p | 'SomeTransform' >> SomeTransform()
...
interactive_beam.watch(locals())
result = p.run()
...
...

Thanks for using Interactive Beam!

Ning.


Re: Request for review of PR [Beam-8564]

2019-12-04 Thread Luke Cwik
Going with the Registrar/ServiceLoader route would allow for alternative
providers for the same compression algorithms so if they don't like one
they can always contribute a different one.

On Wed, Dec 4, 2019 at 8:22 AM Ismaël Mejía  wrote:

> (1) seems not to be the issue because it is Apache licensed.
> (2) and (3) are the big issues, because it requires a provided huge uber
> jar that essentially leaks Hadoop classes into core SDK [1] so it is
> definitely concerning.
>
> We discussed at some point during the PR that added ZStandard support
> about creating some sort of Registrar for compression algorithms [2] but we
> decided to not go ahead because we could achieve that for the zstd case via
> the optional dependencies of commons-compress. Maybe it is time to
> reconsider if such mechanism is worth. For example for users that may not
> care about having the hadoop leakage to be able to use LZO.
>
> Refs.
> [1] https://mvnrepository.com/artifact/io.airlift/aircompressor/0.16
> [2] https://issues.apache.org/jira/browse/BEAM-6422
>
>
>
>
> On Tue, Dec 3, 2019 at 7:01 PM Robert Bradshaw 
> wrote:
>
>> Is there a way to wrap this up as an optional dependency with multiple
>> possible providers, if there's no good library satisfying all of the
>> conditions (in particular (1))?
>>
>> On Tue, Dec 3, 2019 at 9:47 AM Luke Cwik  wrote:
>> >
>> > I was hoping that someone in the community would provide some
>> alternatives since there are quite a few implementations.
>> >
>> > On Tue, Dec 3, 2019 at 8:20 AM Amogh Tiwari  wrote:
>> >>
>> >> Hi Luke,
>> >>
>> >> I agree with your thoughts and observations. But,
>> airlift:aircompressor is the only implementation of LZO in pure java. That
>> straight away solves #5.
>> >> The other implementations that I found either have licensing issues
>> (since LZO natively uses GNU GPL licence) or are implemented using .c, .h
>> and jni (which again make them dependent on the OS). Please refer these:
>> twitter/hadoop-lzo and shevek/lzo-java.
>> >> These were the main reasons why we based this on airlift:aircompressor.
>> >>
>> >> Thanks and Regards,
>> >> Amogh
>> >>
>> >>
>> >>
>> >> On Tue, Dec 3, 2019 at 2:59 AM Luke Cwik  wrote:
>> >>>
>> >>> I took a look. My biggest concern is finding a good LZO
>> implementation. Looking for one that preferably has:
>> >>> 1) Apache license
>> >>> 2) Has zero transitive dependencies
>> >>> 3) Is small
>> >>> 4) Is performant
>> >>> 5) Is native java or supports execution on the three main OSs
>> (Windows, Linux, Mac)
>> >>>
>> >>> In your PR you suggested using io.airlift:aircompressor:0.16 which
>> doesn't meet item #2 and its transitive dependency fails #3.
>> >>>
>> >>> On Mon, Dec 2, 2019 at 12:16 PM Amogh Tiwari 
>> wrote:
>> 
>>  Hi,
>>  I have filed a PR for an extension that will enable Apache Beam to
>> work with LZO/LZOP compression. Please refer.
>>  I would love it if someone can take this up and review it.
>>  Please feel free to share your thoughts/suggestions.
>>  Regards,
>>  Amogh
>>
>


Re: Request for review of PR [Beam-8564]

2019-12-04 Thread Ismaël Mejía
(1) seems not to be the issue because it is Apache licensed.
(2) and (3) are the big issues, because it requires a provided huge uber
jar that essentially leaks Hadoop classes into core SDK [1] so it is
definitely concerning.

We discussed at some point during the PR that added ZStandard support about
creating some sort of Registrar for compression algorithms [2] but we
decided to not go ahead because we could achieve that for the zstd case via
the optional dependencies of commons-compress. Maybe it is time to
reconsider if such mechanism is worth. For example for users that may not
care about having the hadoop leakage to be able to use LZO.

Refs.
[1] https://mvnrepository.com/artifact/io.airlift/aircompressor/0.16
[2] https://issues.apache.org/jira/browse/BEAM-6422




On Tue, Dec 3, 2019 at 7:01 PM Robert Bradshaw  wrote:

> Is there a way to wrap this up as an optional dependency with multiple
> possible providers, if there's no good library satisfying all of the
> conditions (in particular (1))?
>
> On Tue, Dec 3, 2019 at 9:47 AM Luke Cwik  wrote:
> >
> > I was hoping that someone in the community would provide some
> alternatives since there are quite a few implementations.
> >
> > On Tue, Dec 3, 2019 at 8:20 AM Amogh Tiwari  wrote:
> >>
> >> Hi Luke,
> >>
> >> I agree with your thoughts and observations. But, airlift:aircompressor
> is the only implementation of LZO in pure java. That straight away solves
> #5.
> >> The other implementations that I found either have licensing issues
> (since LZO natively uses GNU GPL licence) or are implemented using .c, .h
> and jni (which again make them dependent on the OS). Please refer these:
> twitter/hadoop-lzo and shevek/lzo-java.
> >> These were the main reasons why we based this on airlift:aircompressor.
> >>
> >> Thanks and Regards,
> >> Amogh
> >>
> >>
> >>
> >> On Tue, Dec 3, 2019 at 2:59 AM Luke Cwik  wrote:
> >>>
> >>> I took a look. My biggest concern is finding a good LZO
> implementation. Looking for one that preferably has:
> >>> 1) Apache license
> >>> 2) Has zero transitive dependencies
> >>> 3) Is small
> >>> 4) Is performant
> >>> 5) Is native java or supports execution on the three main OSs
> (Windows, Linux, Mac)
> >>>
> >>> In your PR you suggested using io.airlift:aircompressor:0.16 which
> doesn't meet item #2 and its transitive dependency fails #3.
> >>>
> >>> On Mon, Dec 2, 2019 at 12:16 PM Amogh Tiwari 
> wrote:
> 
>  Hi,
>  I have filed a PR for an extension that will enable Apache Beam to
> work with LZO/LZOP compression. Please refer.
>  I would love it if someone can take this up and review it.
>  Please feel free to share your thoughts/suggestions.
>  Regards,
>  Amogh
>


Re: Wiki edit access

2019-12-04 Thread Kamil Wasilewski
Thanks!

On Wed, Dec 4, 2019 at 5:07 PM Maximilian Michels  wrote:

> Done ;)
>
> On 04.12.19 15:49, Kamil Wasilewski wrote:
> > Hi all,
> >
> > I'm going to make a contribution to documentation pages that describe
> > testing framework in Beam. May I get access to edit the Wiki? My
> > username is kamilwu.
> >
> > Kamil
>


Re: Wiki edit access

2019-12-04 Thread Maximilian Michels

Done ;)

On 04.12.19 15:49, Kamil Wasilewski wrote:

Hi all,

I'm going to make a contribution to documentation pages that describe 
testing framework in Beam. May I get access to edit the Wiki? My 
username is kamilwu.


Kamil


Re: real real-time beam

2019-12-04 Thread Jan Lukavský

Hi Kenn,

On 12/4/19 5:38 AM, Kenneth Knowles wrote:
Jan - let's try to defrag the threads on your time sorting proposal. 
This thread may have useful ideas but I want to focus on helping Aaron 
in this thread. You can link to this thread from other threads or from 
a design doc. Does this seem OK to you?


sure. :-)

I actually think the best thread to continue the discussion would be 
[1]. The reason why this discussion probably got fragmented is that the 
other threads seem to die out without any conclusion. :-(


Jan

[1] 
https://lists.apache.org/thread.html/e2f729c7cea22553fc34421d4547132fa1c2ec01035eb4fb1a426873%40%3Cdev.beam.apache.org%3E




Aaron - do you have the information you need to implement your sink? 
My impression is that you have quite a good grasp of the issues even 
before you asked.


Kenn

On Wed, Nov 27, 2019 at 3:05 AM Jan Lukavský > wrote:


> Trigger firings can have decreasing event timestamps w/ the
minimum timestamp combiner*. I do think the issue at hand is best
analyzed in terms of the explicit ordering on panes. And I do
think we need to have an explicit guarantee or annotation strong
enough to describe a correct-under-all-allowed runners sink. Today
an antagonistic runner could probably break a lot of things.

Thanks for this insight. I didn't know about the relation between
trigger firing (event) time - which is always non-decreasing - and
the resulting timestamp of output pane - which can be affected by
timestamp combiner and decrease in cases you describe. What
actually correlates with the pane index at all times is processing
time of trigger firings with the pane index. Would you say, that
if the "annotation that would guarantee ordering of panes" could
be viewed as a time ordering annotation with an additional time
domain (event time, processing time)? Could then these two be
viewed as a single one with some distinguishing parameter?

@RequiresTimeSortedInput(Domain.PANE_INDEX | Domain.EVENT_TIME)

?

Event time should be probably made the default, because that is
information that is accessible with every WindowedValue, while
pane index is available only after GBK (or generally might be
available after every keyed sequential operation, but is missing
after source for instance).

Jan

On 11/27/19 1:32 AM, Kenneth Knowles wrote:



On Tue, Nov 26, 2019 at 1:00 AM Jan Lukavský mailto:je...@seznam.cz>> wrote:

> I will not try to formalize this notion in this email. But
I will note that since it is universally assured, it would be
zero cost and significantly safer to formalize it and add an
annotation noting it was required. It has nothing to do with
event time ordering, only trigger firing ordering.

I cannot agree with the last sentence (and I'm really not
doing this on purpose :-)). Panes generally arrive out of
order, as mentioned several times in the discussions linked
from this thread. If we want to ensure "trigger firing
ordering", we can use the pane index, that is correct. But -
that is actually equivalent to sorting by event time, because
pane index order will be (nearly) the same as event time
order. This is due to the fact, that pane index and event
time correlate (both are monotonic).

Trigger firings can have decreasing event timestamps w/ the
minimum timestamp combiner*. I do think the issue at hand is best
analyzed in terms of the explicit ordering on panes. And I do
think we need to have an explicit guarantee or annotation strong
enough to describe a correct-under-all-allowed runners sink.
Today an antagonistic runner could probably break a lot of things.

Kenn

*In fact, they can decrease via the "maximum" timestamp combiner
because actually timestamp combiners only apply to the elements
that particular pane. This is weird, and maybe a design bug, but
good to know about.

The pane index "only" solves the issue of preserving ordering
even in case where there are multiple firings within the same
timestamp (regardless of granularity). This was mentioned in
the initial discussion about event time ordering, and is part
of the design doc - users should be allowed to provide UDF
for extracting time-correlated ordering field (which means
ability to choose a preferred, or authoritative, observer
which assigns unambiguous ordering to events). Example of
this might include Kafka offsets as well, or any queue index
for that matter. This is not yet implemented, but could
(should) be in the future.

The only case where these two things are (somewhat) different
is the case mentioned by @Steve - if the output is stateless
ParDo, which will get fused. But that is only because the

Wiki edit access

2019-12-04 Thread Kamil Wasilewski
Hi all,

I'm going to make a contribution to documentation pages that describe
testing framework in Beam. May I get access to edit the Wiki? My username
is kamilwu.

Kamil


Re: Version Beam Website Documentation

2019-12-04 Thread Jeff Klukas
The API reference docs (Java and Python at least) are versioned, so we have
a durable reference there and it's possible to link to particular sections
of API docs for particular versions.

For the major bits of introductory documentation (like the Beam Programming
Guide), I think it's a good thing to have only a single version, so that
people referencing it are always getting the most up-to-date wording and
explanations, although it may be worth adding callouts there about minimum
versions anywhere we discuss newer features. We should be encouraging the
community to stay reasonably current, so I think any feature that's present
in the latest LTS release should be fine to assume is available to users,
although perhaps we should also state that explicitly on the website.

Are there particular parts of the Beam website that you have in mind that
would benefit from versioning? Are there specific cases you see where the
current website would be confusing for someone using a Beam SDK that's a
few versions old?

On Tue, Dec 3, 2019 at 6:46 PM Ankur Goenka  wrote:

> Hi,
>
> We are constantly adding features to Beam which makes each new Beam
> version more feature rich and compelling.
> This also means that the old Beam released don't have the new features and
> might have different ways to do certain things.
>
> (I might be wrong here) - Our Beam website only publish a single version
> which is the latest version of documentation.
> This means that the users working with older SDK don't really have an easy
> way to lookup documentation for old versions of Beam.
>
> Proposal: Shall we consider publishing versioned Beam website to help
> users with old Beam version find the relevant information?
>
> Thanks,
> Ankur
>