Re: [EXT] Re: using avro instead of json for BigQueryIO.Write

2019-11-26 Thread Chamikara Jayalath
I don't believe so, please create one (we can dedup if we happen to find
another issue).

Even better if you can contribute to fix this :)

Thanks,
Cham

On Tue, Nov 26, 2019 at 7:07 PM Chuck Yang  wrote:

> Has anyone looked into implementing this for the Python SDK? It would
> be nice to have it if only for the ability to write float values with
> NaN and infinity values. I didn't see anything in Jira, happy to
> create a ticket, but wanted to ask around first.
>
> On Thu, Oct 17, 2019 at 12:53 PM Reuven Lax  wrote:
> >
> > I'll take a look as well. Thanks for doing this!
> >
> > On Fri, Oct 4, 2019 at 9:16 PM Pablo Estrada  wrote:
> >>
> >> Thanks Steve!
> >> I'll take a look next week. Sorry about the delay so far.
> >> Best
> >> -P.
> >>
> >> On Fri, Sep 27, 2019 at 10:37 AM Steve Niemitz 
> wrote:
> >>>
> >>> I put up a semi-WIP pull request
> https://github.com/apache/beam/pull/9665 for this.  The initial results
> look good.  I'll spend some time soon adding unit tests and documentation,
> but I'd appreciate it if someone could take a first pass over it.
> >>>
> >>> On Wed, Sep 18, 2019 at 6:14 PM Pablo Estrada 
> wrote:
> 
>  Thanks for offering to work on this! It would be awesome to have it.
> I can say that we don't have that for Python ATM.
> 
>  On Mon, Sep 16, 2019 at 10:56 AM Steve Niemitz 
> wrote:
> >
> > Our experience has actually been that avro is more efficient than
> even parquet, but that might also be skewed from our datasets.
> >
> > I might try to take a crack at this, I found
> https://issues.apache.org/jira/browse/BEAM-2879 tracking it (which
> coincidentally references my thread from a couple years ago on the read
> side of this :) ).
> >
> > On Mon, Sep 16, 2019 at 1:38 PM Reuven Lax  wrote:
> >>
> >> It's been talked about, but nobody's done anything. There as some
> difficulties related to type conversion (json and avro don't support the
> same types), but if those are overcome then an avro version would be much
> more efficient. I believe Parquet files would be even more efficient if you
> wanted to go that path, but there might be more code to write (as we
> already have some code in the codebase to convert between TableRows and
> Avro).
> >>
> >> Reuven
> >>
> >> On Mon, Sep 16, 2019 at 10:33 AM Steve Niemitz 
> wrote:
> >>>
> >>> Has anyone investigated using avro rather than json to load data
> into BigQuery using BigQueryIO (+ FILE_LOADS)?
> >>>
> >>> I'd be interested in enhancing it to support this, but I'm curious
> if there's any prior work here.
>
> --
>
>
> *Confidentiality Note:* We care about protecting our proprietary
> information, confidential material, and trade secrets. This message may
> contain some or all of those things. Cruise will suffer material harm if
> anyone other than the intended recipient disseminates or takes any action
> based on this message. If you have received this message (including any
> attachments) in error, please delete it immediately and notify the sender
> promptly.
>


Re: cython test instability

2019-11-26 Thread Chad Dombrova
yeah, I've excised both test_requires and setup_requires in my test
simplification PR:  https://github.com/apache/beam/pull/10038

I'm happy to see those go sooner rather than later, as it'll reduce the
scope of my PR.  The rest of my PR is about ensuring that build
dependencies like cython and grpc are available at "build" time (i.e. when
setup.py gets called), and the modern solution for this is a
pep517/518-compliant build system, of which tox is one.

-chad



On Tue, Nov 26, 2019 at 6:39 PM Udi Meiri  wrote:

> I'm not sure about where the error with the simplegeneric, timeloop .eggs
> directories come from,
> but I did figure out that they don't get installed as eggs if you add them
> to the "test" extras in setup.py, e.g.:
>
> extras_require={
> 'docs': ['Sphinx>=1.5.2,<2.0'],
> 'test': REQUIRED_TEST_PACKAGES + INTERACTIVE_BEAM,
> 'gcp': GCP_REQUIREMENTS,
> 'interactive': INTERACTIVE_BEAM,
> },
>
> This is further proof of the wisdom of the pytest-runner deprecation
> notice  (emphasis mine):
> """
> Remove ‘pytest’ and any other testing requirements from ‘*tests_require*’,
> preferably removing the setup_requires option.
> """
>
> I believe we don't rely on the tests_require definition. Removing it might
> break developers running "python setup.py test", but the alternative is a
> simple "setup.py && pip install".
>
>
> On Tue, Nov 26, 2019 at 5:14 PM Chad Dombrova  wrote:
>
>> Sorry wrong link:  https://github.com/apache/beam/pull/9915
>>
>>
>>
>> On Tue, Nov 26, 2019 at 5:12 PM Udi Meiri  wrote:
>>
>>> I looked at #9959 but it doesn't seem to modify setup.py?
>>> The additional eggs for timeloop etc. are troubling though. Not sure
>>> where those come from.
>>>
>>> On Tue, Nov 26, 2019 at 4:59 PM Chad Dombrova  wrote:
>>>
 Is setup_requires being used somewhere else, because I'm still getting
 errors after removing it from sdks/python/setup.py.

 I removed it from this PR: https://github.com/apache/beam/pull/9959

 Here's the gradle scan:
 https://scans.gradle.com/s/oinh5xpaly3dk/failure#top=0

 The error shows up differently than before when
 setup_requries=['pytest-runner'] was present -- it's in a gradle traceback
 now rather than the console log.  I've also seen different packages listed
 as the culprit (simplegeneric, timeloop).

 -chad



 On Tue, Nov 26, 2019 at 4:47 PM Udi Meiri  wrote:

> Chad, I believe the answer is the "setup_requires" line is causing the
> sdks/python/.eggs directory to be created.
>
> This command fails with the setup_requires line (same Errno 17), but
> succeeds without it:
> $ \rm -r .eggs/; ../../gradlew installGcpTest
> [~8 failed tasks]
> $ ls .eggs
> pytest_runner-5.2-py2.7.egg  pytest_runner-5.2-py3.5.egg
>  pytest_runner-5.2-py3.6.egg  pytest_runner-5.2-py3.7.egg  README.txt
>
> I'll go ahead and create a PR to remove setup_requires from setup.py.
>
> On Tue, Nov 26, 2019 at 4:16 PM Chad Dombrova 
> wrote:
>
>> It seems like the offending packages are those that only have source
>> distributions (i.e. no wheels).  But why are the eggs being installed in
>> sdks/python/.eggs instead of into the virtualenv created by 
>> setupVirtualenv
>> gradle task or by tox?
>>
>>
>> On Tue, Nov 26, 2019 at 3:59 PM Udi Meiri  wrote:
>>
>>> Basically, I believe what's happening is that a new Gradle task was
>>> added that uses setup.py but doesn't have the same dependency on some 
>>> main
>>> setup.py task that all others depend on (list sdist).
>>>
>>> On Tue, Nov 26, 2019 at 3:49 PM Udi Meiri  wrote:
>>>
 Correction: the error is not gone after removing the line. I get
 instead:
 error: [Errno 17] File exists:
 '/usr/local/google/home/ehudm/src/beam/sdks/python/.eggs/dill-0.3.1.1-py2.7.egg'


 On Tue, Nov 26, 2019 at 3:45 PM Udi Meiri  wrote:

> I managed to recreate one of the issues with this command:
> ~/src/beam/sdks/python$ \rm -r .eggs/ && for i in $(seq 2); do
> echo "python setup.py -q nosetests --tests
> apache_beam.pipeline_test:DoFnTest.test_incomparable_default &" | sh 
> ; done
>
> This reliably gives me:
> OSError: [Errno 17] File exists:
> '/usr/local/google/home/ehudm/src/beam/sdks/python/.eggs/pytest_runner-5.2-py2.7.egg'
>
> If I remove this line from setup.py the error is gone:
>   setup_requires=['pytest_runner'],
>
>
> On Tue, Nov 26, 2019 at 2:54 PM Chad Dombrova 
> wrote:
>
>> Thanks for looking into this. It seems like it might be something
>> to do with data that is cached on the Jenkins slaves between runs, 
>> which
>> may be what prevents this from showing up 

Re: [DISCUSS] AWS IOs V1 Deprecation Plan

2019-11-26 Thread Chamikara Jayalath
On Tue, Nov 26, 2019 at 6:17 PM Reza Rokni  wrote:

> Hi Alexey,
>
> With regards to @Experimental there are a couple of discussions around its
> usage ( or rather over usage! ) on dev@. It is something that we need to
> clean up ( some of those IO are now being used on production env for
> years!).
>

I agree that we should move some IO connectors out of experimental state
and probably this should be a separate discussion. Also, this issue is
probably more than for IO connectors since there are other parts of the
code that is marked as experimental as well, sometimes for a good reason
(for example, SDF).



>
> Cheers
>
> Reza
>
> On Wed, 27 Nov 2019 at 04:50, Luke Cwik  wrote:
>
>> I suggested the wrapper because sometimes the intent of the APIs can be
>> translated easily but this is not always the case.
>>
>> Good to know that it is all marked @Experimental.
>>
>> On Tue, Nov 26, 2019 at 12:30 PM Cam Mach  wrote:
>>
>>> Thank you, Alex for sharing the information, and Luke for the questions.
>>> I like the idea that just depreciate the V1 IOs, and just maintain V2
>>> IOs, so we can support whoever want continue with V1.
>>> Just as Alex said, a lot of users, including my teams :-) , use the V1
>>> IOs in production for real workload. So it'll be hard to remove V1 IOs or
>>> force them migrate to V2. But let hear if there are any other ideas?
>>>
>>> Btw, making V1 a wrapper around V2 is not very positive, code will get
>>> more complicated since V2 API is very different from V1's.
>>>
>>> Thanks,
>>>
>>>
>>>
>>> On Tue, Nov 26, 2019 at 8:21 AM Alexey Romanenko <
>>> aromanenko@gmail.com> wrote:
>>>
 AFAICT, all AWS SDK V1 IOs (SnsIO, SqsIO, DynamoDBIO, KinesisIO) are
 marked as "Experimental". So, it should not be a problem to gracefully
 deprecate and finally remove them. We already did the similar procedure for
 “HadoopInputFormatIO”, which was renamed to just “HadoopFormatIO” (since it
 started to support HadoopOutputFormatI as well). Old “HadoopInputFormatIO”
 was deprecated and removed after *3 consecutive* Beam releases (as we
 agreed on mailing list).

 In the same time, some users for some reasons would not be able or to
 want to move on AWS SDK V2. So, I’d prefer to just deprecate AWS SDK V1 IOs
 and accept new features/fixes *only* for V2 IOs.

>>>
+1 for deprecating AWS V1 IO connectors as opposed to removing as well
unless we can confirm that usage is extremely limited.


>
 Talking about “Experimental” annotation. Sorry in advance If I missed
 that and switch a subject a bit, but do we have clear rules or an agreement
 when IO becomes stable and should not be marked as experimental anymore?
 *Most* of our Java IOs are marked as Experimental but many of them
 were using in production by real users under real load. Does it mean that
 they are ready to be stable in terms of API? Perhaps, this topic deserves a
 new discussion if there are several opinions on that.

>>>
Probably, decision to move component APIs (for example, an IO connector)
out of experimental state should be done on a case-by-case basis.

Thanks,
Cham


>
 On 26 Nov 2019, at 00:39, Luke Cwik  wrote:

 Phase I sounds fine.

 Apache Beam follows semantic versioning and I believe removing the IOs
 will be a backwards incompatible change unless they were marked
 experimental which will be a problem for Phase 2.

 What is the feasibility of making the V1 transforms wrappers around V2?

 On Mon, Nov 25, 2019 at 1:46 PM Cam Mach  wrote:

> Hello Beam Devs,
>
> I have been working on the migration of Amazon Web Services IO
> connectors into the new AWS SDK for Java V2. The goal is to have an 
> updated
> implementation aligned with the most recent AWS improvements. So far we
> have already migrated the connectors for AWS SNS, SQS and  DynamoDB.
>
> In the meantime some contributions are still going on V1 IOs. So far
> we have dealt with those by porting (or asking contributors) to port the
> changes into V2 IOs too because we don’t want features of both versions to
> be unaligned but this may quickly become a maintenance issue, so we want 
> to
> discuss a plan to stop supporting (deprecate) V1 IOs and encourage users 
> to
> move to V2.
>
> Phase I (ASAP):
>
>- Mark migrated AWS V1 IOs as deprecated
>- Document migration path to V2
>
> Phase II (end of 2020):
>
>- Decide a date or Beam release to remove the V1 IOs
>- Send a notification to the community 3 months before we remove
>them
>- Completely get rid of V1 IOs
>
>
> Please let me know what you think or if you see any potential issues?
>
> Thanks,
> Cam Mach
>
>

>
> --
>
> This email may be confidential and privileged. If you received this
> communication by mistake, please 

Re: cython test instability

2019-11-26 Thread Udi Meiri
I'm not sure about where the error with the simplegeneric, timeloop .eggs
directories come from,
but I did figure out that they don't get installed as eggs if you add them
to the "test" extras in setup.py, e.g.:

extras_require={
'docs': ['Sphinx>=1.5.2,<2.0'],
'test': REQUIRED_TEST_PACKAGES + INTERACTIVE_BEAM,
'gcp': GCP_REQUIREMENTS,
'interactive': INTERACTIVE_BEAM,
},

This is further proof of the wisdom of the pytest-runner deprecation notice
 (emphasis mine):
"""
Remove ‘pytest’ and any other testing requirements from ‘*tests_require*’,
preferably removing the setup_requires option.
"""

I believe we don't rely on the tests_require definition. Removing it might
break developers running "python setup.py test", but the alternative is a
simple "setup.py && pip install".


On Tue, Nov 26, 2019 at 5:14 PM Chad Dombrova  wrote:

> Sorry wrong link:  https://github.com/apache/beam/pull/9915
>
>
>
> On Tue, Nov 26, 2019 at 5:12 PM Udi Meiri  wrote:
>
>> I looked at #9959 but it doesn't seem to modify setup.py?
>> The additional eggs for timeloop etc. are troubling though. Not sure
>> where those come from.
>>
>> On Tue, Nov 26, 2019 at 4:59 PM Chad Dombrova  wrote:
>>
>>> Is setup_requires being used somewhere else, because I'm still getting
>>> errors after removing it from sdks/python/setup.py.
>>>
>>> I removed it from this PR: https://github.com/apache/beam/pull/9959
>>>
>>> Here's the gradle scan:
>>> https://scans.gradle.com/s/oinh5xpaly3dk/failure#top=0
>>>
>>> The error shows up differently than before when
>>> setup_requries=['pytest-runner'] was present -- it's in a gradle traceback
>>> now rather than the console log.  I've also seen different packages listed
>>> as the culprit (simplegeneric, timeloop).
>>>
>>> -chad
>>>
>>>
>>>
>>> On Tue, Nov 26, 2019 at 4:47 PM Udi Meiri  wrote:
>>>
 Chad, I believe the answer is the "setup_requires" line is causing the
 sdks/python/.eggs directory to be created.

 This command fails with the setup_requires line (same Errno 17), but
 succeeds without it:
 $ \rm -r .eggs/; ../../gradlew installGcpTest
 [~8 failed tasks]
 $ ls .eggs
 pytest_runner-5.2-py2.7.egg  pytest_runner-5.2-py3.5.egg
  pytest_runner-5.2-py3.6.egg  pytest_runner-5.2-py3.7.egg  README.txt

 I'll go ahead and create a PR to remove setup_requires from setup.py.

 On Tue, Nov 26, 2019 at 4:16 PM Chad Dombrova 
 wrote:

> It seems like the offending packages are those that only have source
> distributions (i.e. no wheels).  But why are the eggs being installed in
> sdks/python/.eggs instead of into the virtualenv created by 
> setupVirtualenv
> gradle task or by tox?
>
>
> On Tue, Nov 26, 2019 at 3:59 PM Udi Meiri  wrote:
>
>> Basically, I believe what's happening is that a new Gradle task was
>> added that uses setup.py but doesn't have the same dependency on some 
>> main
>> setup.py task that all others depend on (list sdist).
>>
>> On Tue, Nov 26, 2019 at 3:49 PM Udi Meiri  wrote:
>>
>>> Correction: the error is not gone after removing the line. I get
>>> instead:
>>> error: [Errno 17] File exists:
>>> '/usr/local/google/home/ehudm/src/beam/sdks/python/.eggs/dill-0.3.1.1-py2.7.egg'
>>>
>>>
>>> On Tue, Nov 26, 2019 at 3:45 PM Udi Meiri  wrote:
>>>
 I managed to recreate one of the issues with this command:
 ~/src/beam/sdks/python$ \rm -r .eggs/ && for i in $(seq 2); do echo
 "python setup.py -q nosetests --tests
 apache_beam.pipeline_test:DoFnTest.test_incomparable_default &" | sh ; 
 done

 This reliably gives me:
 OSError: [Errno 17] File exists:
 '/usr/local/google/home/ehudm/src/beam/sdks/python/.eggs/pytest_runner-5.2-py2.7.egg'

 If I remove this line from setup.py the error is gone:
   setup_requires=['pytest_runner'],


 On Tue, Nov 26, 2019 at 2:54 PM Chad Dombrova 
 wrote:

> Thanks for looking into this. It seems like it might be something
> to do with data that is cached on the Jenkins slaves between runs, 
> which
> may be what prevents this from showing up locally?
>
> If your theory about setuptools is correct, and it sounds likely,
> we should be able to lock down the version, which we should 
> definitely be
> doing for all of our dependencies.
>
> -chad
>
>
>
> On Tue, Nov 26, 2019 at 1:33 PM Ahmet Altay 
> wrote:
>
>> I tried to debug but did not make much progress. I cannot
>> reproduce locally, however all python precommits and postcommits are
>> failing.
>>
>> One guess is, setuptools released a new version that does not
>> support eggs a few 

Re: [DISCUSS] AWS IOs V1 Deprecation Plan

2019-11-26 Thread Reza Rokni
Hi Alexey,

With regards to @Experimental there are a couple of discussions around its
usage ( or rather over usage! ) on dev@. It is something that we need to
clean up ( some of those IO are now being used on production env for
years!).

Cheers

Reza

On Wed, 27 Nov 2019 at 04:50, Luke Cwik  wrote:

> I suggested the wrapper because sometimes the intent of the APIs can be
> translated easily but this is not always the case.
>
> Good to know that it is all marked @Experimental.
>
> On Tue, Nov 26, 2019 at 12:30 PM Cam Mach  wrote:
>
>> Thank you, Alex for sharing the information, and Luke for the questions.
>> I like the idea that just depreciate the V1 IOs, and just maintain V2
>> IOs, so we can support whoever want continue with V1.
>> Just as Alex said, a lot of users, including my teams :-) , use the V1
>> IOs in production for real workload. So it'll be hard to remove V1 IOs or
>> force them migrate to V2. But let hear if there are any other ideas?
>>
>> Btw, making V1 a wrapper around V2 is not very positive, code will get
>> more complicated since V2 API is very different from V1's.
>>
>> Thanks,
>>
>>
>>
>> On Tue, Nov 26, 2019 at 8:21 AM Alexey Romanenko <
>> aromanenko@gmail.com> wrote:
>>
>>> AFAICT, all AWS SDK V1 IOs (SnsIO, SqsIO, DynamoDBIO, KinesisIO) are
>>> marked as "Experimental". So, it should not be a problem to gracefully
>>> deprecate and finally remove them. We already did the similar procedure for
>>> “HadoopInputFormatIO”, which was renamed to just “HadoopFormatIO” (since it
>>> started to support HadoopOutputFormatI as well). Old “HadoopInputFormatIO”
>>> was deprecated and removed after *3 consecutive* Beam releases (as we
>>> agreed on mailing list).
>>>
>>> In the same time, some users for some reasons would not be able or to
>>> want to move on AWS SDK V2. So, I’d prefer to just deprecate AWS SDK V1 IOs
>>> and accept new features/fixes *only* for V2 IOs.
>>>
>>> Talking about “Experimental” annotation. Sorry in advance If I missed
>>> that and switch a subject a bit, but do we have clear rules or an agreement
>>> when IO becomes stable and should not be marked as experimental anymore?
>>> *Most* of our Java IOs are marked as Experimental but many of them were
>>> using in production by real users under real load. Does it mean that they
>>> are ready to be stable in terms of API? Perhaps, this topic deserves a new
>>> discussion if there are several opinions on that.
>>>
>>> On 26 Nov 2019, at 00:39, Luke Cwik  wrote:
>>>
>>> Phase I sounds fine.
>>>
>>> Apache Beam follows semantic versioning and I believe removing the IOs
>>> will be a backwards incompatible change unless they were marked
>>> experimental which will be a problem for Phase 2.
>>>
>>> What is the feasibility of making the V1 transforms wrappers around V2?
>>>
>>> On Mon, Nov 25, 2019 at 1:46 PM Cam Mach  wrote:
>>>
 Hello Beam Devs,

 I have been working on the migration of Amazon Web Services IO
 connectors into the new AWS SDK for Java V2. The goal is to have an updated
 implementation aligned with the most recent AWS improvements. So far we
 have already migrated the connectors for AWS SNS, SQS and  DynamoDB.

 In the meantime some contributions are still going on V1 IOs. So far we
 have dealt with those by porting (or asking contributors) to port the
 changes into V2 IOs too because we don’t want features of both versions to
 be unaligned but this may quickly become a maintenance issue, so we want to
 discuss a plan to stop supporting (deprecate) V1 IOs and encourage users to
 move to V2.

 Phase I (ASAP):

- Mark migrated AWS V1 IOs as deprecated
- Document migration path to V2

 Phase II (end of 2020):

- Decide a date or Beam release to remove the V1 IOs
- Send a notification to the community 3 months before we remove
them
- Completely get rid of V1 IOs


 Please let me know what you think or if you see any potential issues?

 Thanks,
 Cam Mach


>>>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.


Re: [discuss] Using a logger hierarchy in Python

2019-11-26 Thread Pablo Estrada
Ah I'll try to add this tomorrow before going out for the weekend.
-P.

On Wed, Nov 20, 2019 at 12:15 PM Valentyn Tymofieiev 
wrote:

> Based on my recent debugging experience for
> https://issues.apache.org/jira/browse/BEAM-8651, I think it may be
> helpful to include thread IDs, into the log entries, or have an option to
> easily enable this. I imagine that having process ID may also be helpful in
> other situations.
>
> On Tue, Nov 19, 2019 at 11:17 AM Chad Dombrova  wrote:
>
>> Pablo, it might be necessary to setup a root logging handler if one does
>> not exist already.  I noticed that a LocalJobServicer that I was testing
>> against stopped emitting tracebacks when I rebased onto the latest from
>> master.  Setting up the root handler fixed it.  I'm still testing this, and
>> I might be misinterpreting what I saw, but I wanted to get eyes on it in
>> case I don't have time to get a definitive answer.
>>
>> -chad
>>
>>
>>
>> On Fri, Nov 15, 2019 at 4:30 PM Pablo Estrada  wrote:
>>
>>> Thanks all,
>>> 2/3 of PRs are merged (using _LOGGER). It should be pretty easy to
>>> switch the variable name to _log via sed.
>>> Best
>>> -P.
>>>
>>> On Fri, Nov 15, 2019 at 2:08 PM Kyle Weaver  wrote:
>>>
 +1 for per-module loggers (what Robert said).

 On Fri, Nov 15, 2019 at 1:48 PM Udi Meiri  wrote:

> +1, but can we use something less verbose and shift key heavy than
> _LOGGER like log or _log?
>
> Also please dedupe with these existing bugs:
> https://issues.apache.org/jira/browse/BEAM-3523
> https://issues.apache.org/jira/browse/BEAM-1825
>
> On Thu, Nov 14, 2019 at 8:02 AM Thomas Weise  wrote:
>
>> Awesome, thanks Chad!
>>
>> On Wed, Nov 13, 2019 at 10:26 PM Chad Dombrova 
>> wrote:
>>
>>> Hi Thomas,
>>>
>>>
 Will this include the ability for users to configure logging via
 pipeline options?

>>>
>>> We're working on a proposal to allow pluggable logging handlers that
>>> can be configured via pipeline options.  For example, it would allow 
>>> you to
>>> add a new logging handler for StackDriver or Elasticsearch.  Will 
>>> hopefully
>>> have a document to share soon.
>>>
>>> -chad
>>>
>>>


Re: cython test instability

2019-11-26 Thread Chad Dombrova
Sorry wrong link:  https://github.com/apache/beam/pull/9915



On Tue, Nov 26, 2019 at 5:12 PM Udi Meiri  wrote:

> I looked at #9959 but it doesn't seem to modify setup.py?
> The additional eggs for timeloop etc. are troubling though. Not sure where
> those come from.
>
> On Tue, Nov 26, 2019 at 4:59 PM Chad Dombrova  wrote:
>
>> Is setup_requires being used somewhere else, because I'm still getting
>> errors after removing it from sdks/python/setup.py.
>>
>> I removed it from this PR: https://github.com/apache/beam/pull/9959
>>
>> Here's the gradle scan:
>> https://scans.gradle.com/s/oinh5xpaly3dk/failure#top=0
>>
>> The error shows up differently than before when
>> setup_requries=['pytest-runner'] was present -- it's in a gradle traceback
>> now rather than the console log.  I've also seen different packages listed
>> as the culprit (simplegeneric, timeloop).
>>
>> -chad
>>
>>
>>
>> On Tue, Nov 26, 2019 at 4:47 PM Udi Meiri  wrote:
>>
>>> Chad, I believe the answer is the "setup_requires" line is causing the
>>> sdks/python/.eggs directory to be created.
>>>
>>> This command fails with the setup_requires line (same Errno 17), but
>>> succeeds without it:
>>> $ \rm -r .eggs/; ../../gradlew installGcpTest
>>> [~8 failed tasks]
>>> $ ls .eggs
>>> pytest_runner-5.2-py2.7.egg  pytest_runner-5.2-py3.5.egg
>>>  pytest_runner-5.2-py3.6.egg  pytest_runner-5.2-py3.7.egg  README.txt
>>>
>>> I'll go ahead and create a PR to remove setup_requires from setup.py.
>>>
>>> On Tue, Nov 26, 2019 at 4:16 PM Chad Dombrova  wrote:
>>>
 It seems like the offending packages are those that only have source
 distributions (i.e. no wheels).  But why are the eggs being installed in
 sdks/python/.eggs instead of into the virtualenv created by setupVirtualenv
 gradle task or by tox?


 On Tue, Nov 26, 2019 at 3:59 PM Udi Meiri  wrote:

> Basically, I believe what's happening is that a new Gradle task was
> added that uses setup.py but doesn't have the same dependency on some main
> setup.py task that all others depend on (list sdist).
>
> On Tue, Nov 26, 2019 at 3:49 PM Udi Meiri  wrote:
>
>> Correction: the error is not gone after removing the line. I get
>> instead:
>> error: [Errno 17] File exists:
>> '/usr/local/google/home/ehudm/src/beam/sdks/python/.eggs/dill-0.3.1.1-py2.7.egg'
>>
>>
>> On Tue, Nov 26, 2019 at 3:45 PM Udi Meiri  wrote:
>>
>>> I managed to recreate one of the issues with this command:
>>> ~/src/beam/sdks/python$ \rm -r .eggs/ && for i in $(seq 2); do echo
>>> "python setup.py -q nosetests --tests
>>> apache_beam.pipeline_test:DoFnTest.test_incomparable_default &" | sh ; 
>>> done
>>>
>>> This reliably gives me:
>>> OSError: [Errno 17] File exists:
>>> '/usr/local/google/home/ehudm/src/beam/sdks/python/.eggs/pytest_runner-5.2-py2.7.egg'
>>>
>>> If I remove this line from setup.py the error is gone:
>>>   setup_requires=['pytest_runner'],
>>>
>>>
>>> On Tue, Nov 26, 2019 at 2:54 PM Chad Dombrova 
>>> wrote:
>>>
 Thanks for looking into this. It seems like it might be something
 to do with data that is cached on the Jenkins slaves between runs, 
 which
 may be what prevents this from showing up locally?

 If your theory about setuptools is correct, and it sounds likely,
 we should be able to lock down the version, which we should definitely 
 be
 doing for all of our dependencies.

 -chad



 On Tue, Nov 26, 2019 at 1:33 PM Ahmet Altay 
 wrote:

> I tried to debug but did not make much progress. I cannot
> reproduce locally, however all python precommits and postcommits are
> failing.
>
> One guess is, setuptools released a new version that does not
> support eggs a few days ago, that might be the cause (
> https://github.com/pypa/setuptools/blob/master/CHANGES.rst) but
> that should have reproduced locally.
> Maybe something is wrong with the jenkins machines, and we could
> perhaps bring them to a clean state.
>
> I suspected this being related to pytest somehow (as the first 4
> JIRAs had pytest in the error line) but the error Chad saw is 
> different.
>
> +Valentyn Tymofieiev  and +Yifan Zou
>  could you help with looking into this?
>
>
> Ahmet
>
>
>
> On Tue, Nov 26, 2019 at 9:14 AM Luke Cwik 
> wrote:
>
>> I also started to see this on PRs that I'm reviewing.
>> BEAM-8793, BEAM-8653, BEAM-8631, BEAM-8249 mention issues with 
>> setup.py and
>> egg_info but this looks different then all of those so I filed 
>> BEAM-8831.
>>
>>
>> On Mon, Nov 25, 2019 at 

Re: cython test instability

2019-11-26 Thread Udi Meiri
I looked at #9959 but it doesn't seem to modify setup.py?
The additional eggs for timeloop etc. are troubling though. Not sure where
those come from.

On Tue, Nov 26, 2019 at 4:59 PM Chad Dombrova  wrote:

> Is setup_requires being used somewhere else, because I'm still getting
> errors after removing it from sdks/python/setup.py.
>
> I removed it from this PR: https://github.com/apache/beam/pull/9959
>
> Here's the gradle scan:
> https://scans.gradle.com/s/oinh5xpaly3dk/failure#top=0
>
> The error shows up differently than before when
> setup_requries=['pytest-runner'] was present -- it's in a gradle traceback
> now rather than the console log.  I've also seen different packages listed
> as the culprit (simplegeneric, timeloop).
>
> -chad
>
>
>
> On Tue, Nov 26, 2019 at 4:47 PM Udi Meiri  wrote:
>
>> Chad, I believe the answer is the "setup_requires" line is causing the
>> sdks/python/.eggs directory to be created.
>>
>> This command fails with the setup_requires line (same Errno 17), but
>> succeeds without it:
>> $ \rm -r .eggs/; ../../gradlew installGcpTest
>> [~8 failed tasks]
>> $ ls .eggs
>> pytest_runner-5.2-py2.7.egg  pytest_runner-5.2-py3.5.egg
>>  pytest_runner-5.2-py3.6.egg  pytest_runner-5.2-py3.7.egg  README.txt
>>
>> I'll go ahead and create a PR to remove setup_requires from setup.py.
>>
>> On Tue, Nov 26, 2019 at 4:16 PM Chad Dombrova  wrote:
>>
>>> It seems like the offending packages are those that only have source
>>> distributions (i.e. no wheels).  But why are the eggs being installed in
>>> sdks/python/.eggs instead of into the virtualenv created by setupVirtualenv
>>> gradle task or by tox?
>>>
>>>
>>> On Tue, Nov 26, 2019 at 3:59 PM Udi Meiri  wrote:
>>>
 Basically, I believe what's happening is that a new Gradle task was
 added that uses setup.py but doesn't have the same dependency on some main
 setup.py task that all others depend on (list sdist).

 On Tue, Nov 26, 2019 at 3:49 PM Udi Meiri  wrote:

> Correction: the error is not gone after removing the line. I get
> instead:
> error: [Errno 17] File exists:
> '/usr/local/google/home/ehudm/src/beam/sdks/python/.eggs/dill-0.3.1.1-py2.7.egg'
>
>
> On Tue, Nov 26, 2019 at 3:45 PM Udi Meiri  wrote:
>
>> I managed to recreate one of the issues with this command:
>> ~/src/beam/sdks/python$ \rm -r .eggs/ && for i in $(seq 2); do echo
>> "python setup.py -q nosetests --tests
>> apache_beam.pipeline_test:DoFnTest.test_incomparable_default &" | sh ; 
>> done
>>
>> This reliably gives me:
>> OSError: [Errno 17] File exists:
>> '/usr/local/google/home/ehudm/src/beam/sdks/python/.eggs/pytest_runner-5.2-py2.7.egg'
>>
>> If I remove this line from setup.py the error is gone:
>>   setup_requires=['pytest_runner'],
>>
>>
>> On Tue, Nov 26, 2019 at 2:54 PM Chad Dombrova 
>> wrote:
>>
>>> Thanks for looking into this. It seems like it might be something to
>>> do with data that is cached on the Jenkins slaves between runs, which 
>>> may
>>> be what prevents this from showing up locally?
>>>
>>> If your theory about setuptools is correct, and it sounds likely, we
>>> should be able to lock down the version, which we should definitely be
>>> doing for all of our dependencies.
>>>
>>> -chad
>>>
>>>
>>>
>>> On Tue, Nov 26, 2019 at 1:33 PM Ahmet Altay 
>>> wrote:
>>>
 I tried to debug but did not make much progress. I cannot reproduce
 locally, however all python precommits and postcommits are failing.

 One guess is, setuptools released a new version that does not
 support eggs a few days ago, that might be the cause (
 https://github.com/pypa/setuptools/blob/master/CHANGES.rst) but
 that should have reproduced locally.
 Maybe something is wrong with the jenkins machines, and we could
 perhaps bring them to a clean state.

 I suspected this being related to pytest somehow (as the first 4
 JIRAs had pytest in the error line) but the error Chad saw is 
 different.

 +Valentyn Tymofieiev  and +Yifan Zou
  could you help with looking into this?


 Ahmet



 On Tue, Nov 26, 2019 at 9:14 AM Luke Cwik  wrote:

> I also started to see this on PRs that I'm reviewing.
> BEAM-8793, BEAM-8653, BEAM-8631, BEAM-8249 mention issues with 
> setup.py and
> egg_info but this looks different then all of those so I filed 
> BEAM-8831.
>
>
> On Mon, Nov 25, 2019 at 10:27 PM Chad Dombrova 
> wrote:
>
>> Actually, it looks like I'm getting the same error on multiple
>> PRs: https://scans.gradle.com/s/ihfmrxr7evslw
>>
>>
>>
>>
>> On Mon, Nov 25, 2019 at 10:26 PM Chad 

Re: cython test instability

2019-11-26 Thread Chad Dombrova
Is setup_requires being used somewhere else, because I'm still getting
errors after removing it from sdks/python/setup.py.

I removed it from this PR: https://github.com/apache/beam/pull/9959

Here's the gradle scan:
https://scans.gradle.com/s/oinh5xpaly3dk/failure#top=0

The error shows up differently than before when
setup_requries=['pytest-runner'] was present -- it's in a gradle traceback
now rather than the console log.  I've also seen different packages listed
as the culprit (simplegeneric, timeloop).

-chad



On Tue, Nov 26, 2019 at 4:47 PM Udi Meiri  wrote:

> Chad, I believe the answer is the "setup_requires" line is causing the
> sdks/python/.eggs directory to be created.
>
> This command fails with the setup_requires line (same Errno 17), but
> succeeds without it:
> $ \rm -r .eggs/; ../../gradlew installGcpTest
> [~8 failed tasks]
> $ ls .eggs
> pytest_runner-5.2-py2.7.egg  pytest_runner-5.2-py3.5.egg
>  pytest_runner-5.2-py3.6.egg  pytest_runner-5.2-py3.7.egg  README.txt
>
> I'll go ahead and create a PR to remove setup_requires from setup.py.
>
> On Tue, Nov 26, 2019 at 4:16 PM Chad Dombrova  wrote:
>
>> It seems like the offending packages are those that only have source
>> distributions (i.e. no wheels).  But why are the eggs being installed in
>> sdks/python/.eggs instead of into the virtualenv created by setupVirtualenv
>> gradle task or by tox?
>>
>>
>> On Tue, Nov 26, 2019 at 3:59 PM Udi Meiri  wrote:
>>
>>> Basically, I believe what's happening is that a new Gradle task was
>>> added that uses setup.py but doesn't have the same dependency on some main
>>> setup.py task that all others depend on (list sdist).
>>>
>>> On Tue, Nov 26, 2019 at 3:49 PM Udi Meiri  wrote:
>>>
 Correction: the error is not gone after removing the line. I get
 instead:
 error: [Errno 17] File exists:
 '/usr/local/google/home/ehudm/src/beam/sdks/python/.eggs/dill-0.3.1.1-py2.7.egg'


 On Tue, Nov 26, 2019 at 3:45 PM Udi Meiri  wrote:

> I managed to recreate one of the issues with this command:
> ~/src/beam/sdks/python$ \rm -r .eggs/ && for i in $(seq 2); do echo
> "python setup.py -q nosetests --tests
> apache_beam.pipeline_test:DoFnTest.test_incomparable_default &" | sh ; 
> done
>
> This reliably gives me:
> OSError: [Errno 17] File exists:
> '/usr/local/google/home/ehudm/src/beam/sdks/python/.eggs/pytest_runner-5.2-py2.7.egg'
>
> If I remove this line from setup.py the error is gone:
>   setup_requires=['pytest_runner'],
>
>
> On Tue, Nov 26, 2019 at 2:54 PM Chad Dombrova 
> wrote:
>
>> Thanks for looking into this. It seems like it might be something to
>> do with data that is cached on the Jenkins slaves between runs, which may
>> be what prevents this from showing up locally?
>>
>> If your theory about setuptools is correct, and it sounds likely, we
>> should be able to lock down the version, which we should definitely be
>> doing for all of our dependencies.
>>
>> -chad
>>
>>
>>
>> On Tue, Nov 26, 2019 at 1:33 PM Ahmet Altay  wrote:
>>
>>> I tried to debug but did not make much progress. I cannot reproduce
>>> locally, however all python precommits and postcommits are failing.
>>>
>>> One guess is, setuptools released a new version that does not
>>> support eggs a few days ago, that might be the cause (
>>> https://github.com/pypa/setuptools/blob/master/CHANGES.rst) but
>>> that should have reproduced locally.
>>> Maybe something is wrong with the jenkins machines, and we could
>>> perhaps bring them to a clean state.
>>>
>>> I suspected this being related to pytest somehow (as the first 4
>>> JIRAs had pytest in the error line) but the error Chad saw is different.
>>>
>>> +Valentyn Tymofieiev  and +Yifan Zou
>>>  could you help with looking into this?
>>>
>>>
>>> Ahmet
>>>
>>>
>>>
>>> On Tue, Nov 26, 2019 at 9:14 AM Luke Cwik  wrote:
>>>
 I also started to see this on PRs that I'm reviewing.
 BEAM-8793, BEAM-8653, BEAM-8631, BEAM-8249 mention issues with 
 setup.py and
 egg_info but this looks different then all of those so I filed 
 BEAM-8831.


 On Mon, Nov 25, 2019 at 10:27 PM Chad Dombrova 
 wrote:

> Actually, it looks like I'm getting the same error on multiple
> PRs: https://scans.gradle.com/s/ihfmrxr7evslw
>
>
>
>
> On Mon, Nov 25, 2019 at 10:26 PM Chad Dombrova 
> wrote:
>
>> Hi all,
>> The cython tests started failing on one of my PRs which were
>> succeeding before.   The error is one that I've never seen before
>> (separated onto different lines to make it easier to read):
>>
>> Caused by: org.gradle.api.GradleException:
>> Could not copy 

Re: cython test instability

2019-11-26 Thread Udi Meiri
Chad, I believe the answer is the "setup_requires" line is causing the
sdks/python/.eggs directory to be created.

This command fails with the setup_requires line (same Errno 17), but
succeeds without it:
$ \rm -r .eggs/; ../../gradlew installGcpTest
[~8 failed tasks]
$ ls .eggs
pytest_runner-5.2-py2.7.egg  pytest_runner-5.2-py3.5.egg
 pytest_runner-5.2-py3.6.egg  pytest_runner-5.2-py3.7.egg  README.txt

I'll go ahead and create a PR to remove setup_requires from setup.py.

On Tue, Nov 26, 2019 at 4:16 PM Chad Dombrova  wrote:

> It seems like the offending packages are those that only have source
> distributions (i.e. no wheels).  But why are the eggs being installed in
> sdks/python/.eggs instead of into the virtualenv created by setupVirtualenv
> gradle task or by tox?
>
>
> On Tue, Nov 26, 2019 at 3:59 PM Udi Meiri  wrote:
>
>> Basically, I believe what's happening is that a new Gradle task was added
>> that uses setup.py but doesn't have the same dependency on some main
>> setup.py task that all others depend on (list sdist).
>>
>> On Tue, Nov 26, 2019 at 3:49 PM Udi Meiri  wrote:
>>
>>> Correction: the error is not gone after removing the line. I get instead:
>>> error: [Errno 17] File exists:
>>> '/usr/local/google/home/ehudm/src/beam/sdks/python/.eggs/dill-0.3.1.1-py2.7.egg'
>>>
>>>
>>> On Tue, Nov 26, 2019 at 3:45 PM Udi Meiri  wrote:
>>>
 I managed to recreate one of the issues with this command:
 ~/src/beam/sdks/python$ \rm -r .eggs/ && for i in $(seq 2); do echo
 "python setup.py -q nosetests --tests
 apache_beam.pipeline_test:DoFnTest.test_incomparable_default &" | sh ; done

 This reliably gives me:
 OSError: [Errno 17] File exists:
 '/usr/local/google/home/ehudm/src/beam/sdks/python/.eggs/pytest_runner-5.2-py2.7.egg'

 If I remove this line from setup.py the error is gone:
   setup_requires=['pytest_runner'],


 On Tue, Nov 26, 2019 at 2:54 PM Chad Dombrova 
 wrote:

> Thanks for looking into this. It seems like it might be something to
> do with data that is cached on the Jenkins slaves between runs, which may
> be what prevents this from showing up locally?
>
> If your theory about setuptools is correct, and it sounds likely, we
> should be able to lock down the version, which we should definitely be
> doing for all of our dependencies.
>
> -chad
>
>
>
> On Tue, Nov 26, 2019 at 1:33 PM Ahmet Altay  wrote:
>
>> I tried to debug but did not make much progress. I cannot reproduce
>> locally, however all python precommits and postcommits are failing.
>>
>> One guess is, setuptools released a new version that does not support
>> eggs a few days ago, that might be the cause (
>> https://github.com/pypa/setuptools/blob/master/CHANGES.rst) but that
>> should have reproduced locally.
>> Maybe something is wrong with the jenkins machines, and we could
>> perhaps bring them to a clean state.
>>
>> I suspected this being related to pytest somehow (as the first 4
>> JIRAs had pytest in the error line) but the error Chad saw is different.
>>
>> +Valentyn Tymofieiev  and +Yifan Zou
>>  could you help with looking into this?
>>
>>
>> Ahmet
>>
>>
>>
>> On Tue, Nov 26, 2019 at 9:14 AM Luke Cwik  wrote:
>>
>>> I also started to see this on PRs that I'm reviewing.
>>> BEAM-8793, BEAM-8653, BEAM-8631, BEAM-8249 mention issues with setup.py 
>>> and
>>> egg_info but this looks different then all of those so I filed 
>>> BEAM-8831.
>>>
>>>
>>> On Mon, Nov 25, 2019 at 10:27 PM Chad Dombrova 
>>> wrote:
>>>
 Actually, it looks like I'm getting the same error on multiple PRs:
 https://scans.gradle.com/s/ihfmrxr7evslw




 On Mon, Nov 25, 2019 at 10:26 PM Chad Dombrova 
 wrote:

> Hi all,
> The cython tests started failing on one of my PRs which were
> succeeding before.   The error is one that I've never seen before
> (separated onto different lines to make it easier to read):
>
> Caused by: org.gradle.api.GradleException:
> Could not copy file
>
> '/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit@2
> /src/sdks/python/.eggs/simplegeneric-0.8.1-py2.7.egg'
> to
>
> '/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit@2
> /src/sdks/python/test-suites/tox/py2/build/srcs/sdks/python/.eggs/simplegeneric-0.8.1-py2.7.egg'.
>
> Followed immediately by an error about could not create a
> directory of the same name.  Here's the gradle scan:
>
>
> https://scans.gradle.com/s/ihfmrxr7evslw/failure?openFailures=WzFd=WzZd#top=0
>
> Any ideas?
>
> -chad
>
>
>
>

Re: real real-time beam

2019-11-26 Thread Kenneth Knowles
On Tue, Nov 26, 2019 at 1:00 AM Jan Lukavský  wrote:

> > I will not try to formalize this notion in this email. But I will note
> that since it is universally assured, it would be zero cost and
> significantly safer to formalize it and add an annotation noting it was
> required. It has nothing to do with event time ordering, only trigger
> firing ordering.
>
> I cannot agree with the last sentence (and I'm really not doing this on
> purpose :-)). Panes generally arrive out of order, as mentioned several
> times in the discussions linked from this thread. If we want to ensure
> "trigger firing ordering", we can use the pane index, that is correct. But
> - that is actually equivalent to sorting by event time, because pane index
> order will be (nearly) the same as event time order. This is due to the
> fact, that pane index and event time correlate (both are monotonic).
>
Trigger firings can have decreasing event timestamps w/ the minimum
timestamp combiner*. I do think the issue at hand is best analyzed in terms
of the explicit ordering on panes. And I do think we need to have an
explicit guarantee or annotation strong enough to describe a
correct-under-all-allowed runners sink. Today an antagonistic runner could
probably break a lot of things.

Kenn

*In fact, they can decrease via the "maximum" timestamp combiner because
actually timestamp combiners only apply to the elements that particular
pane. This is weird, and maybe a design bug, but good to know about.


> The pane index "only" solves the issue of preserving ordering even in case
> where there are multiple firings within the same timestamp (regardless of
> granularity). This was mentioned in the initial discussion about event time
> ordering, and is part of the design doc - users should be allowed to
> provide UDF for extracting time-correlated ordering field (which means
> ability to choose a preferred, or authoritative, observer which assigns
> unambiguous ordering to events). Example of this might include Kafka
> offsets as well, or any queue index for that matter. This is not yet
> implemented, but could (should) be in the future.
>
> The only case where these two things are (somewhat) different is the case
> mentioned by @Steve - if the output is stateless ParDo, which will get
> fused. But that is only because the processing is single-threaded per key,
> and therefore the ordering is implied by timer ordering (and careful here,
> many runners don't have this ordering 100% correct, as of now - this
> problem luckily appears only when there are multiple timers per key).
> Moreover, if there should be a failure, then the output might (would) get
> back in time anyway. If there would be a shuffle operation after
> GBK/Combine, then the ordering is no longer guaranteed and must be
> explicitly taken care of.
>
> Last note, I must agree with @Rui that all these discussions are very much
> related to retractions (precisely the ability to implement them).
>
> Jan
> On 11/26/19 7:34 AM, Kenneth Knowles wrote:
>
> Hi Aaron,
>
> Another insightful observation.
>
> Whenever an aggregation (GBK / Combine per key) has a trigger firing,
> there is a per-key sequence number attached. It is included in metadata
> known as "PaneInfo" [1]. The value of PaneInfo.getIndex() is colloquially
> referred to as the "pane index". You can also make use of the "on time
> index" if you like. The best way to access this metadata is to add a
> parameter of type PaneInfo to your DoFn's @ProcessElement method. This
> works for stateful or stateless DoFn.
>
> Most of Beam's IO connectors do not explicitly enforce that outputs occur
> in pane index order but instead rely on the hope that the runner delivers
> panes in order to the sink. IMO this is dangerous but it has not yet caused
> a known issue. In practice, each "input key to output key 'path' " through
> a pipeline's logic does preserve order for all existing runners AFAIK and
> it is the formalization that is missing. It is related to an observation by 
> +Rui
> Wang  that processing retractions requires the same
> key-to-key ordering.
>
> I will not try to formalize this notion in this email. But I will note
> that since it is universally assured, it would be zero cost and
> significantly safer to formalize it and add an annotation noting it was
> required. It has nothing to do with event time ordering, only trigger
> firing ordering.
>
> Kenn
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/windowing/PaneInfo.java
> [2]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/DoFn.java#L557
>
>
> On Mon, Nov 25, 2019 at 4:06 PM Pablo Estrada  wrote:
>
>> The blog posts on stateful and timely computation with Beam should help
>> clarify a lot about how to use state and timers to do this:
>> https://beam.apache.org/blog/2017/02/13/stateful-processing.html
>> 

Re: cython test instability

2019-11-26 Thread Chad Dombrova
It seems like the offending packages are those that only have source
distributions (i.e. no wheels).  But why are the eggs being installed in
sdks/python/.eggs instead of into the virtualenv created by setupVirtualenv
gradle task or by tox?


On Tue, Nov 26, 2019 at 3:59 PM Udi Meiri  wrote:

> Basically, I believe what's happening is that a new Gradle task was added
> that uses setup.py but doesn't have the same dependency on some main
> setup.py task that all others depend on (list sdist).
>
> On Tue, Nov 26, 2019 at 3:49 PM Udi Meiri  wrote:
>
>> Correction: the error is not gone after removing the line. I get instead:
>> error: [Errno 17] File exists:
>> '/usr/local/google/home/ehudm/src/beam/sdks/python/.eggs/dill-0.3.1.1-py2.7.egg'
>>
>>
>> On Tue, Nov 26, 2019 at 3:45 PM Udi Meiri  wrote:
>>
>>> I managed to recreate one of the issues with this command:
>>> ~/src/beam/sdks/python$ \rm -r .eggs/ && for i in $(seq 2); do echo
>>> "python setup.py -q nosetests --tests
>>> apache_beam.pipeline_test:DoFnTest.test_incomparable_default &" | sh ; done
>>>
>>> This reliably gives me:
>>> OSError: [Errno 17] File exists:
>>> '/usr/local/google/home/ehudm/src/beam/sdks/python/.eggs/pytest_runner-5.2-py2.7.egg'
>>>
>>> If I remove this line from setup.py the error is gone:
>>>   setup_requires=['pytest_runner'],
>>>
>>>
>>> On Tue, Nov 26, 2019 at 2:54 PM Chad Dombrova  wrote:
>>>
 Thanks for looking into this. It seems like it might be something to do
 with data that is cached on the Jenkins slaves between runs, which may be
 what prevents this from showing up locally?

 If your theory about setuptools is correct, and it sounds likely, we
 should be able to lock down the version, which we should definitely be
 doing for all of our dependencies.

 -chad



 On Tue, Nov 26, 2019 at 1:33 PM Ahmet Altay  wrote:

> I tried to debug but did not make much progress. I cannot reproduce
> locally, however all python precommits and postcommits are failing.
>
> One guess is, setuptools released a new version that does not support
> eggs a few days ago, that might be the cause (
> https://github.com/pypa/setuptools/blob/master/CHANGES.rst) but that
> should have reproduced locally.
> Maybe something is wrong with the jenkins machines, and we could
> perhaps bring them to a clean state.
>
> I suspected this being related to pytest somehow (as the first 4 JIRAs
> had pytest in the error line) but the error Chad saw is different.
>
> +Valentyn Tymofieiev  and +Yifan Zou
>  could you help with looking into this?
>
>
> Ahmet
>
>
>
> On Tue, Nov 26, 2019 at 9:14 AM Luke Cwik  wrote:
>
>> I also started to see this on PRs that I'm reviewing.
>> BEAM-8793, BEAM-8653, BEAM-8631, BEAM-8249 mention issues with setup.py 
>> and
>> egg_info but this looks different then all of those so I filed BEAM-8831.
>>
>>
>> On Mon, Nov 25, 2019 at 10:27 PM Chad Dombrova 
>> wrote:
>>
>>> Actually, it looks like I'm getting the same error on multiple PRs:
>>> https://scans.gradle.com/s/ihfmrxr7evslw
>>>
>>>
>>>
>>>
>>> On Mon, Nov 25, 2019 at 10:26 PM Chad Dombrova 
>>> wrote:
>>>
 Hi all,
 The cython tests started failing on one of my PRs which were
 succeeding before.   The error is one that I've never seen before
 (separated onto different lines to make it easier to read):

 Caused by: org.gradle.api.GradleException:
 Could not copy file

 '/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit@2
 /src/sdks/python/.eggs/simplegeneric-0.8.1-py2.7.egg'
 to

 '/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit@2
 /src/sdks/python/test-suites/tox/py2/build/srcs/sdks/python/.eggs/simplegeneric-0.8.1-py2.7.egg'.

 Followed immediately by an error about could not create a directory
 of the same name.  Here's the gradle scan:


 https://scans.gradle.com/s/ihfmrxr7evslw/failure?openFailures=WzFd=WzZd#top=0

 Any ideas?

 -chad







Re: cython test instability

2019-11-26 Thread Udi Meiri
Basically, I believe what's happening is that a new Gradle task was added
that uses setup.py but doesn't have the same dependency on some main
setup.py task that all others depend on (list sdist).

On Tue, Nov 26, 2019 at 3:49 PM Udi Meiri  wrote:

> Correction: the error is not gone after removing the line. I get instead:
> error: [Errno 17] File exists:
> '/usr/local/google/home/ehudm/src/beam/sdks/python/.eggs/dill-0.3.1.1-py2.7.egg'
>
>
> On Tue, Nov 26, 2019 at 3:45 PM Udi Meiri  wrote:
>
>> I managed to recreate one of the issues with this command:
>> ~/src/beam/sdks/python$ \rm -r .eggs/ && for i in $(seq 2); do echo
>> "python setup.py -q nosetests --tests
>> apache_beam.pipeline_test:DoFnTest.test_incomparable_default &" | sh ; done
>>
>> This reliably gives me:
>> OSError: [Errno 17] File exists:
>> '/usr/local/google/home/ehudm/src/beam/sdks/python/.eggs/pytest_runner-5.2-py2.7.egg'
>>
>> If I remove this line from setup.py the error is gone:
>>   setup_requires=['pytest_runner'],
>>
>>
>> On Tue, Nov 26, 2019 at 2:54 PM Chad Dombrova  wrote:
>>
>>> Thanks for looking into this. It seems like it might be something to do
>>> with data that is cached on the Jenkins slaves between runs, which may be
>>> what prevents this from showing up locally?
>>>
>>> If your theory about setuptools is correct, and it sounds likely, we
>>> should be able to lock down the version, which we should definitely be
>>> doing for all of our dependencies.
>>>
>>> -chad
>>>
>>>
>>>
>>> On Tue, Nov 26, 2019 at 1:33 PM Ahmet Altay  wrote:
>>>
 I tried to debug but did not make much progress. I cannot reproduce
 locally, however all python precommits and postcommits are failing.

 One guess is, setuptools released a new version that does not support
 eggs a few days ago, that might be the cause (
 https://github.com/pypa/setuptools/blob/master/CHANGES.rst) but that
 should have reproduced locally.
 Maybe something is wrong with the jenkins machines, and we could
 perhaps bring them to a clean state.

 I suspected this being related to pytest somehow (as the first 4 JIRAs
 had pytest in the error line) but the error Chad saw is different.

 +Valentyn Tymofieiev  and +Yifan Zou
  could you help with looking into this?


 Ahmet



 On Tue, Nov 26, 2019 at 9:14 AM Luke Cwik  wrote:

> I also started to see this on PRs that I'm reviewing.
> BEAM-8793, BEAM-8653, BEAM-8631, BEAM-8249 mention issues with setup.py 
> and
> egg_info but this looks different then all of those so I filed BEAM-8831.
>
>
> On Mon, Nov 25, 2019 at 10:27 PM Chad Dombrova 
> wrote:
>
>> Actually, it looks like I'm getting the same error on multiple PRs:
>> https://scans.gradle.com/s/ihfmrxr7evslw
>>
>>
>>
>>
>> On Mon, Nov 25, 2019 at 10:26 PM Chad Dombrova 
>> wrote:
>>
>>> Hi all,
>>> The cython tests started failing on one of my PRs which were
>>> succeeding before.   The error is one that I've never seen before
>>> (separated onto different lines to make it easier to read):
>>>
>>> Caused by: org.gradle.api.GradleException:
>>> Could not copy file
>>> '/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit@2
>>> /src/sdks/python/.eggs/simplegeneric-0.8.1-py2.7.egg'
>>> to
>>> '/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit@2
>>> /src/sdks/python/test-suites/tox/py2/build/srcs/sdks/python/.eggs/simplegeneric-0.8.1-py2.7.egg'.
>>>
>>> Followed immediately by an error about could not create a directory
>>> of the same name.  Here's the gradle scan:
>>>
>>>
>>> https://scans.gradle.com/s/ihfmrxr7evslw/failure?openFailures=WzFd=WzZd#top=0
>>>
>>> Any ideas?
>>>
>>> -chad
>>>
>>>
>>>
>>>
>>>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: cython test instability

2019-11-26 Thread Udi Meiri
Correction: the error is not gone after removing the line. I get instead:
error: [Errno 17] File exists:
'/usr/local/google/home/ehudm/src/beam/sdks/python/.eggs/dill-0.3.1.1-py2.7.egg'


On Tue, Nov 26, 2019 at 3:45 PM Udi Meiri  wrote:

> I managed to recreate one of the issues with this command:
> ~/src/beam/sdks/python$ \rm -r .eggs/ && for i in $(seq 2); do echo
> "python setup.py -q nosetests --tests
> apache_beam.pipeline_test:DoFnTest.test_incomparable_default &" | sh ; done
>
> This reliably gives me:
> OSError: [Errno 17] File exists:
> '/usr/local/google/home/ehudm/src/beam/sdks/python/.eggs/pytest_runner-5.2-py2.7.egg'
>
> If I remove this line from setup.py the error is gone:
>   setup_requires=['pytest_runner'],
>
>
> On Tue, Nov 26, 2019 at 2:54 PM Chad Dombrova  wrote:
>
>> Thanks for looking into this. It seems like it might be something to do
>> with data that is cached on the Jenkins slaves between runs, which may be
>> what prevents this from showing up locally?
>>
>> If your theory about setuptools is correct, and it sounds likely, we
>> should be able to lock down the version, which we should definitely be
>> doing for all of our dependencies.
>>
>> -chad
>>
>>
>>
>> On Tue, Nov 26, 2019 at 1:33 PM Ahmet Altay  wrote:
>>
>>> I tried to debug but did not make much progress. I cannot reproduce
>>> locally, however all python precommits and postcommits are failing.
>>>
>>> One guess is, setuptools released a new version that does not support
>>> eggs a few days ago, that might be the cause (
>>> https://github.com/pypa/setuptools/blob/master/CHANGES.rst) but that
>>> should have reproduced locally.
>>> Maybe something is wrong with the jenkins machines, and we could perhaps
>>> bring them to a clean state.
>>>
>>> I suspected this being related to pytest somehow (as the first 4 JIRAs
>>> had pytest in the error line) but the error Chad saw is different.
>>>
>>> +Valentyn Tymofieiev  and +Yifan Zou
>>>  could you help with looking into this?
>>>
>>>
>>> Ahmet
>>>
>>>
>>>
>>> On Tue, Nov 26, 2019 at 9:14 AM Luke Cwik  wrote:
>>>
 I also started to see this on PRs that I'm reviewing.
 BEAM-8793, BEAM-8653, BEAM-8631, BEAM-8249 mention issues with setup.py and
 egg_info but this looks different then all of those so I filed BEAM-8831.


 On Mon, Nov 25, 2019 at 10:27 PM Chad Dombrova 
 wrote:

> Actually, it looks like I'm getting the same error on multiple PRs:
> https://scans.gradle.com/s/ihfmrxr7evslw
>
>
>
>
> On Mon, Nov 25, 2019 at 10:26 PM Chad Dombrova 
> wrote:
>
>> Hi all,
>> The cython tests started failing on one of my PRs which were
>> succeeding before.   The error is one that I've never seen before
>> (separated onto different lines to make it easier to read):
>>
>> Caused by: org.gradle.api.GradleException:
>> Could not copy file
>> '/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit@2
>> /src/sdks/python/.eggs/simplegeneric-0.8.1-py2.7.egg'
>> to
>> '/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit@2
>> /src/sdks/python/test-suites/tox/py2/build/srcs/sdks/python/.eggs/simplegeneric-0.8.1-py2.7.egg'.
>>
>> Followed immediately by an error about could not create a directory
>> of the same name.  Here's the gradle scan:
>>
>>
>> https://scans.gradle.com/s/ihfmrxr7evslw/failure?openFailures=WzFd=WzZd#top=0
>>
>> Any ideas?
>>
>> -chad
>>
>>
>>
>>
>>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: cython test instability

2019-11-26 Thread Udi Meiri
I managed to recreate one of the issues with this command:
~/src/beam/sdks/python$ \rm -r .eggs/ && for i in $(seq 2); do echo "python
setup.py -q nosetests --tests
apache_beam.pipeline_test:DoFnTest.test_incomparable_default &" | sh ; done

This reliably gives me:
OSError: [Errno 17] File exists:
'/usr/local/google/home/ehudm/src/beam/sdks/python/.eggs/pytest_runner-5.2-py2.7.egg'

If I remove this line from setup.py the error is gone:
  setup_requires=['pytest_runner'],


On Tue, Nov 26, 2019 at 2:54 PM Chad Dombrova  wrote:

> Thanks for looking into this. It seems like it might be something to do
> with data that is cached on the Jenkins slaves between runs, which may be
> what prevents this from showing up locally?
>
> If your theory about setuptools is correct, and it sounds likely, we
> should be able to lock down the version, which we should definitely be
> doing for all of our dependencies.
>
> -chad
>
>
>
> On Tue, Nov 26, 2019 at 1:33 PM Ahmet Altay  wrote:
>
>> I tried to debug but did not make much progress. I cannot reproduce
>> locally, however all python precommits and postcommits are failing.
>>
>> One guess is, setuptools released a new version that does not support
>> eggs a few days ago, that might be the cause (
>> https://github.com/pypa/setuptools/blob/master/CHANGES.rst) but that
>> should have reproduced locally.
>> Maybe something is wrong with the jenkins machines, and we could perhaps
>> bring them to a clean state.
>>
>> I suspected this being related to pytest somehow (as the first 4 JIRAs
>> had pytest in the error line) but the error Chad saw is different.
>>
>> +Valentyn Tymofieiev  and +Yifan Zou
>>  could you help with looking into this?
>>
>>
>> Ahmet
>>
>>
>>
>> On Tue, Nov 26, 2019 at 9:14 AM Luke Cwik  wrote:
>>
>>> I also started to see this on PRs that I'm reviewing.
>>> BEAM-8793, BEAM-8653, BEAM-8631, BEAM-8249 mention issues with setup.py and
>>> egg_info but this looks different then all of those so I filed BEAM-8831.
>>>
>>>
>>> On Mon, Nov 25, 2019 at 10:27 PM Chad Dombrova 
>>> wrote:
>>>
 Actually, it looks like I'm getting the same error on multiple PRs:
 https://scans.gradle.com/s/ihfmrxr7evslw




 On Mon, Nov 25, 2019 at 10:26 PM Chad Dombrova 
 wrote:

> Hi all,
> The cython tests started failing on one of my PRs which were
> succeeding before.   The error is one that I've never seen before
> (separated onto different lines to make it easier to read):
>
> Caused by: org.gradle.api.GradleException:
> Could not copy file
> '/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit@2
> /src/sdks/python/.eggs/simplegeneric-0.8.1-py2.7.egg'
> to
> '/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit@2
> /src/sdks/python/test-suites/tox/py2/build/srcs/sdks/python/.eggs/simplegeneric-0.8.1-py2.7.egg'.
>
> Followed immediately by an error about could not create a directory of
> the same name.  Here's the gradle scan:
>
>
> https://scans.gradle.com/s/ihfmrxr7evslw/failure?openFailures=WzFd=WzZd#top=0
>
> Any ideas?
>
> -chad
>
>
>
>
>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: cython test instability

2019-11-26 Thread Chad Dombrova
Thanks for looking into this. It seems like it might be something to do
with data that is cached on the Jenkins slaves between runs, which may be
what prevents this from showing up locally?

If your theory about setuptools is correct, and it sounds likely, we should
be able to lock down the version, which we should definitely be doing for
all of our dependencies.

-chad



On Tue, Nov 26, 2019 at 1:33 PM Ahmet Altay  wrote:

> I tried to debug but did not make much progress. I cannot reproduce
> locally, however all python precommits and postcommits are failing.
>
> One guess is, setuptools released a new version that does not support eggs
> a few days ago, that might be the cause (
> https://github.com/pypa/setuptools/blob/master/CHANGES.rst) but that
> should have reproduced locally.
> Maybe something is wrong with the jenkins machines, and we could perhaps
> bring them to a clean state.
>
> I suspected this being related to pytest somehow (as the first 4 JIRAs had
> pytest in the error line) but the error Chad saw is different.
>
> +Valentyn Tymofieiev  and +Yifan Zou
>  could you help with looking into this?
>
>
> Ahmet
>
>
>
> On Tue, Nov 26, 2019 at 9:14 AM Luke Cwik  wrote:
>
>> I also started to see this on PRs that I'm reviewing.
>> BEAM-8793, BEAM-8653, BEAM-8631, BEAM-8249 mention issues with setup.py and
>> egg_info but this looks different then all of those so I filed BEAM-8831.
>>
>>
>> On Mon, Nov 25, 2019 at 10:27 PM Chad Dombrova  wrote:
>>
>>> Actually, it looks like I'm getting the same error on multiple PRs:
>>> https://scans.gradle.com/s/ihfmrxr7evslw
>>>
>>>
>>>
>>>
>>> On Mon, Nov 25, 2019 at 10:26 PM Chad Dombrova 
>>> wrote:
>>>
 Hi all,
 The cython tests started failing on one of my PRs which were succeeding
 before.   The error is one that I've never seen before (separated onto
 different lines to make it easier to read):

 Caused by: org.gradle.api.GradleException:
 Could not copy file
 '/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit@2
 /src/sdks/python/.eggs/simplegeneric-0.8.1-py2.7.egg'
 to
 '/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit@2
 /src/sdks/python/test-suites/tox/py2/build/srcs/sdks/python/.eggs/simplegeneric-0.8.1-py2.7.egg'.

 Followed immediately by an error about could not create a directory of
 the same name.  Here's the gradle scan:


 https://scans.gradle.com/s/ihfmrxr7evslw/failure?openFailures=WzFd=WzZd#top=0

 Any ideas?

 -chad







Re: cython test instability

2019-11-26 Thread Ahmet Altay
I tried to debug but did not make much progress. I cannot reproduce
locally, however all python precommits and postcommits are failing.

One guess is, setuptools released a new version that does not support eggs
a few days ago, that might be the cause (
https://github.com/pypa/setuptools/blob/master/CHANGES.rst) but that should
have reproduced locally.
Maybe something is wrong with the jenkins machines, and we could perhaps
bring them to a clean state.

I suspected this being related to pytest somehow (as the first 4 JIRAs had
pytest in the error line) but the error Chad saw is different.

+Valentyn Tymofieiev  and +Yifan Zou
 could you help with looking into this?

Ahmet



On Tue, Nov 26, 2019 at 9:14 AM Luke Cwik  wrote:

> I also started to see this on PRs that I'm reviewing.
> BEAM-8793, BEAM-8653, BEAM-8631, BEAM-8249 mention issues with setup.py and
> egg_info but this looks different then all of those so I filed BEAM-8831.
>
>
> On Mon, Nov 25, 2019 at 10:27 PM Chad Dombrova  wrote:
>
>> Actually, it looks like I'm getting the same error on multiple PRs:
>> https://scans.gradle.com/s/ihfmrxr7evslw
>>
>>
>>
>>
>> On Mon, Nov 25, 2019 at 10:26 PM Chad Dombrova  wrote:
>>
>>> Hi all,
>>> The cython tests started failing on one of my PRs which were succeeding
>>> before.   The error is one that I've never seen before (separated onto
>>> different lines to make it easier to read):
>>>
>>> Caused by: org.gradle.api.GradleException:
>>> Could not copy file
>>> '/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit@2
>>> /src/sdks/python/.eggs/simplegeneric-0.8.1-py2.7.egg'
>>> to
>>> '/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit@2
>>> /src/sdks/python/test-suites/tox/py2/build/srcs/sdks/python/.eggs/simplegeneric-0.8.1-py2.7.egg'.
>>>
>>> Followed immediately by an error about could not create a directory of
>>> the same name.  Here's the gradle scan:
>>>
>>>
>>> https://scans.gradle.com/s/ihfmrxr7evslw/failure?openFailures=WzFd=WzZd#top=0
>>>
>>> Any ideas?
>>>
>>> -chad
>>>
>>>
>>>
>>>
>>>


Re: [DISCUSS] AWS IOs V1 Deprecation Plan

2019-11-26 Thread Luke Cwik
I suggested the wrapper because sometimes the intent of the APIs can be
translated easily but this is not always the case.

Good to know that it is all marked @Experimental.

On Tue, Nov 26, 2019 at 12:30 PM Cam Mach  wrote:

> Thank you, Alex for sharing the information, and Luke for the questions.
> I like the idea that just depreciate the V1 IOs, and just maintain V2 IOs,
> so we can support whoever want continue with V1.
> Just as Alex said, a lot of users, including my teams :-) , use the V1 IOs
> in production for real workload. So it'll be hard to remove V1 IOs or force
> them migrate to V2. But let hear if there are any other ideas?
>
> Btw, making V1 a wrapper around V2 is not very positive, code will get
> more complicated since V2 API is very different from V1's.
>
> Thanks,
>
>
>
> On Tue, Nov 26, 2019 at 8:21 AM Alexey Romanenko 
> wrote:
>
>> AFAICT, all AWS SDK V1 IOs (SnsIO, SqsIO, DynamoDBIO, KinesisIO) are
>> marked as "Experimental". So, it should not be a problem to gracefully
>> deprecate and finally remove them. We already did the similar procedure for
>> “HadoopInputFormatIO”, which was renamed to just “HadoopFormatIO” (since it
>> started to support HadoopOutputFormatI as well). Old “HadoopInputFormatIO”
>> was deprecated and removed after *3 consecutive* Beam releases (as we
>> agreed on mailing list).
>>
>> In the same time, some users for some reasons would not be able or to
>> want to move on AWS SDK V2. So, I’d prefer to just deprecate AWS SDK V1 IOs
>> and accept new features/fixes *only* for V2 IOs.
>>
>> Talking about “Experimental” annotation. Sorry in advance If I missed
>> that and switch a subject a bit, but do we have clear rules or an agreement
>> when IO becomes stable and should not be marked as experimental anymore?
>> *Most* of our Java IOs are marked as Experimental but many of them were
>> using in production by real users under real load. Does it mean that they
>> are ready to be stable in terms of API? Perhaps, this topic deserves a new
>> discussion if there are several opinions on that.
>>
>> On 26 Nov 2019, at 00:39, Luke Cwik  wrote:
>>
>> Phase I sounds fine.
>>
>> Apache Beam follows semantic versioning and I believe removing the IOs
>> will be a backwards incompatible change unless they were marked
>> experimental which will be a problem for Phase 2.
>>
>> What is the feasibility of making the V1 transforms wrappers around V2?
>>
>> On Mon, Nov 25, 2019 at 1:46 PM Cam Mach  wrote:
>>
>>> Hello Beam Devs,
>>>
>>> I have been working on the migration of Amazon Web Services IO
>>> connectors into the new AWS SDK for Java V2. The goal is to have an updated
>>> implementation aligned with the most recent AWS improvements. So far we
>>> have already migrated the connectors for AWS SNS, SQS and  DynamoDB.
>>>
>>> In the meantime some contributions are still going on V1 IOs. So far we
>>> have dealt with those by porting (or asking contributors) to port the
>>> changes into V2 IOs too because we don’t want features of both versions to
>>> be unaligned but this may quickly become a maintenance issue, so we want to
>>> discuss a plan to stop supporting (deprecate) V1 IOs and encourage users to
>>> move to V2.
>>>
>>> Phase I (ASAP):
>>>
>>>- Mark migrated AWS V1 IOs as deprecated
>>>- Document migration path to V2
>>>
>>> Phase II (end of 2020):
>>>
>>>- Decide a date or Beam release to remove the V1 IOs
>>>- Send a notification to the community 3 months before we remove them
>>>- Completely get rid of V1 IOs
>>>
>>>
>>> Please let me know what you think or if you see any potential issues?
>>>
>>> Thanks,
>>> Cam Mach
>>>
>>>
>>


Re: [DISCUSS] AWS IOs V1 Deprecation Plan

2019-11-26 Thread Cam Mach
Thank you, Alex for sharing the information, and Luke for the questions.
I like the idea that just depreciate the V1 IOs, and just maintain V2 IOs,
so we can support whoever want continue with V1.
Just as Alex said, a lot of users, including my teams :-) , use the V1 IOs
in production for real workload. So it'll be hard to remove V1 IOs or force
them migrate to V2. But let hear if there are any other ideas?

Btw, making V1 a wrapper around V2 is not very positive, code will get more
complicated since V2 API is very different from V1's.

Thanks,



On Tue, Nov 26, 2019 at 8:21 AM Alexey Romanenko 
wrote:

> AFAICT, all AWS SDK V1 IOs (SnsIO, SqsIO, DynamoDBIO, KinesisIO) are
> marked as "Experimental". So, it should not be a problem to gracefully
> deprecate and finally remove them. We already did the similar procedure for
> “HadoopInputFormatIO”, which was renamed to just “HadoopFormatIO” (since it
> started to support HadoopOutputFormatI as well). Old “HadoopInputFormatIO”
> was deprecated and removed after *3 consecutive* Beam releases (as we
> agreed on mailing list).
>
> In the same time, some users for some reasons would not be able or to want
> to move on AWS SDK V2. So, I’d prefer to just deprecate AWS SDK V1 IOs and
> accept new features/fixes *only* for V2 IOs.
>
> Talking about “Experimental” annotation. Sorry in advance If I missed that
> and switch a subject a bit, but do we have clear rules or an agreement when
> IO becomes stable and should not be marked as experimental anymore? *Most*
> of our Java IOs are marked as Experimental but many of them were using in
> production by real users under real load. Does it mean that they are ready
> to be stable in terms of API? Perhaps, this topic deserves a new discussion
> if there are several opinions on that.
>
> On 26 Nov 2019, at 00:39, Luke Cwik  wrote:
>
> Phase I sounds fine.
>
> Apache Beam follows semantic versioning and I believe removing the IOs
> will be a backwards incompatible change unless they were marked
> experimental which will be a problem for Phase 2.
>
> What is the feasibility of making the V1 transforms wrappers around V2?
>
> On Mon, Nov 25, 2019 at 1:46 PM Cam Mach  wrote:
>
>> Hello Beam Devs,
>>
>> I have been working on the migration of Amazon Web Services IO connectors
>> into the new AWS SDK for Java V2. The goal is to have an updated
>> implementation aligned with the most recent AWS improvements. So far we
>> have already migrated the connectors for AWS SNS, SQS and  DynamoDB.
>>
>> In the meantime some contributions are still going on V1 IOs. So far we
>> have dealt with those by porting (or asking contributors) to port the
>> changes into V2 IOs too because we don’t want features of both versions to
>> be unaligned but this may quickly become a maintenance issue, so we want to
>> discuss a plan to stop supporting (deprecate) V1 IOs and encourage users to
>> move to V2.
>>
>> Phase I (ASAP):
>>
>>- Mark migrated AWS V1 IOs as deprecated
>>- Document migration path to V2
>>
>> Phase II (end of 2020):
>>
>>- Decide a date or Beam release to remove the V1 IOs
>>- Send a notification to the community 3 months before we remove them
>>- Completely get rid of V1 IOs
>>
>>
>> Please let me know what you think or if you see any potential issues?
>>
>> Thanks,
>> Cam Mach
>>
>>
>


Update on push-down for SQL IOs.

2019-11-26 Thread Kirill Kozlov
Hello everyone!

I have been working on a push-down feature and would like to give a brief
update on what is done and is still under works.

*Things that are done*:
General API for SQL IOs to provide information about what filters/projects
they support [1]:
- *Filter* can be unsupported, supported with field reordering, and
supported without field reordering.
- *Predicate* is broken down into a conjunctive normal form (CNF) and
passed to a validator class to check what parts are supported or
unsupported by an IO.

A Calcite rule [2] that checks for push-down support, constructs a new IO
source Rel [3] with pushed-down projects and filters when applicable, and
preserves unsupported filters/projects.

BigQuery should perform push-down when running queries in DIRECT_READ
method [4].

MongoDB project push-down support is in a PR [5] and predicate support will
be added soon.


*Things that are in progress:*
Documenting how developers can enable push-down for IOs that support it.

Documenting certain limitation for BigQuery push-down (ex: comparing values
of 2 columns is not supported at the moment, so it is being preserved in a
Calc).

Updating google-cloud-bigquerystorage to 0.117.0-beta. Earlier versions
have a gRPC message limit set to ~11MB, which may cause some pipelies to
break when reading from a table with rows larger than the limit.

Adding some sort of performance tests to run continuously to
measure speed-up and detect regressions.

Deciding how cost should be computed for the IO source Rel with push-down
[6]. Right now the following formula is used: cost of an IO without
push-down minus the normalized (between 0.0 and 1.0) benefit of a performed
push-down.
The challenge here is to make the change to the cost small enough to not
break join reordering, but large enough to make the optimizer favor
pushed-down IO.


If you have any suggestions/questions/concerns I would love to hear them.

[1]
https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/BeamSqlTable.java#L36
[2]
https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/BeamIOPushDownRule.java
[3]
https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamPushDownIOSourceRel.java
[4]
https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/bigquery/BigQueryTable.java#L128
[5] https://github.com/apache/beam/pull/10095
[6] https://github.com/apache/beam/pull/10060

--
Kirill


Re: Beam Testing Tools FAQ

2019-11-26 Thread Pablo Estrada
Very cool. Thanks Lukasz!

On Tue, Nov 26, 2019 at 9:41 AM Alan Myrvold  wrote:

> Nice, thanks!
>
> On Tue, Nov 26, 2019 at 8:04 AM Robert Bradshaw 
> wrote:
>
>> Thanks!
>>
>> On Tue, Nov 26, 2019 at 7:43 AM Łukasz Gajowy  wrote:
>> >
>> > Hi all,
>> >
>> > our documentation (either confluence or the website docs) describes how
>> to create various integration and performance tests - there already are
>> core operations tests, nexmark and IO test documentation pages. However, we
>> are lacking some general docs to describe what tools do we have and what is
>> the purpose of them.
>> >
>> > Therefore, I took the liberty of creating the Beam Testing Tools FAQ on
>> our confluence:
>> > https://cwiki.apache.org/confluence/display/BEAM/Beam+Testing+Tools+FAQ
>> >
>> > Hopefully, this is helpful and sheds some more light on that important
>> part of our infrastructure. If you feel that something is missing there,
>> feel free to let me know or add it yourself. :)
>> >
>> > Thanks,
>> > Łukasz
>>
>


Re: Cleaning up Approximate Algorithms in Beam

2019-11-26 Thread Robert Bradshaw
I think this thread is sufficient.

On Mon, Nov 25, 2019 at 5:59 PM Reza Rokni  wrote:

> Hi,
>
> So do we need a vote for the final list of actions? Or is this thread
> enough to go ahead and raise the PR's?
>
> Cheers
>
> Reza
>
> On Tue, 26 Nov 2019 at 06:01, Ahmet Altay  wrote:
>
>>
>>
>> On Mon, Nov 18, 2019 at 10:57 AM Robert Bradshaw 
>> wrote:
>>
>>> On Sun, Nov 17, 2019 at 5:16 PM Reza Rokni  wrote:
>>>
 *Ahmet: FWIW, There is a python implementation only for this
 version: 
 https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
 
  *
 Eventually we will be able to make use of cross language transforms to
 help with feature parity. Until then, are we ok with marking this
 deprecated in python, even though we do not have another solution. Or leave
 it as is in Python now, as it does not have sketch capability so can only
 be used for outputting results directly from the pipeline.

>>>
>> If it is our intention to add the capability eventually, IMO it makes
>> sense to mark the existing functionality deprecated in Python as well.
>>
>>
 *Reuven: I think this is the sort of thing that has been experimental
 forever, and therefore not experimental (e.g. the entire triggering API is
 experimental as are all our file-based sinks). I think that many users use
 this, and probably store the state implicitly in streaming pipelines.*
 True, I have an old action item to try and go through and PR against
 old @experimental annotations but need to find time. So for this
 discussion; I guess this should be marked as deprecated if we change it
 even though its @experimental.

>>>
>>> Agreed.
>>>
>>>
 *Rob: I'm not following this--by naming things after their
 implementation rather than their intent I think they will be harder to
 search for. *
 This is to add to the name the implementation, after the intent. For
 example ApproximateCountDistinctZetaSketch, I believe should be easy to
 search for and it is clear which implementation is used. Allowing for a
 potentially better implementation ApproximateCountDistinct.

>>>
>>> OK, if we have both I'm more OK with that. This is better than the names
>>> like HllCount, which seems to be what was suggested.
>>>
>>> Another approach would be to have a required  parameter which is an enum
 of the implementation options.
 ApproximateCountDistinct.of().usingImpl(ZETA) ?

>>>
>>> Ideally this could be an optional parameter, or possibly only required
>>> during update until we figure out a good way for the runner to plug this in
>>> appropreately.
>>>
>>> Rob/Kenn: On Combiner discussion, should we tie action items from the
 needs of this thread to this larger discussion?

 Cheers
 Reza

 On Fri, 15 Nov 2019 at 08:32, Robert Bradshaw 
 wrote:

> On Thu, Nov 14, 2019 at 1:06 AM Kenneth Knowles 
> wrote:
>
>> Wow. Nice summary, yes. Major calls to action:
>>
>> 0. Never allow a combiner that does not include the format of its
>> state clear in its name/URN. The "update compatibility" problem makes 
>> their
>> internal accumulator state essentially part of their public API. 
>> Combiners
>> named for what they do are an inherent risk, since we might have a new 
>> way
>> to do the same operation with different implementation-detail state.
>>
>
> It seems this will make for a worse user experience, motivated solely
> by limitations in our implementation. I think we can do better.
> Hypothetical idea: what if upgrade required access to the original graph
> (or at least metadata about it) during construction? In this case an
> ApproximateDistinct could look at what was used last time and try to do 
> the
> same, but be free to do something better when unconstrained. Another
> approach would be to encode several alternative expansions in the Beam
> graph and let the runner do the picking (based on prior submission).
> (Making the CombineFn, as opposed to the composite, have several
> alternatives seems harder to reason about, but maybe worth pursuing as
> well).
>
> This is not unique to Combiners, but any stateful DoFn, or composite
> operations with non-trivial internal structure (and coders). This has been
> discussed a lot, perhaps there are some ideas there we could borrow?
>
> And they will match search terms better, which is a major problem.
>>
>
> I'm not following this--by naming things after their implementation
> rather than their intent I think they will be harder to search for.
>
>
>> 1. Point users to HllCount. This seems to be the best of the three.
>> Does it have a name that is clear enough about the format of its state?

Re: [UPDATE] Preparing for Beam 2.17.0 release

2019-11-26 Thread Mikhail Gryzykhin
Hello everybody,

Got release branch green except gradle build that timeout and fails with go
tests that look like flake.

I'll go over remaining PRs and Jiras today and do final tests validation.
Will start RC process afterwards.

--Mikhail

On Fri, Nov 22, 2019 at 9:29 PM Jan Lukavský  wrote:

> Hi Mikhail,
> I created PR for [BEAM-8812]. It is linked in the JIRA.
> Jan
>
> Dne 23. 11. 2019 0:45 napsal uživatel Mikhail Gryzykhin  >:
>
> UPD:
> on current branch there's timeout on gradle build job, I'm mitigating it
> by increasing job time. Seems that this job runs most of python tests. We
> might look into adjusting the target.
>
> Second failure is https://issues.apache.org/jira/browse/BEAM-8812 . I
> would really appreciate if someone can help me debug this one.
>
> --Mikhail
>
> On Tue, Nov 19, 2019 at 10:14 PM Kenneth Knowles  wrote:
>
> I've poked through the bugs and there do seem to be a few that are
> finished and a few that may not be started that should probably be deferred
> if they can be triaged to not be blockers.
>
> Kenn
>
> On Fri, Nov 15, 2019 at 2:13 PM Mikhail Gryzykhin 
> wrote:
>
> Hi everyone,
>
> There's still an outstanding cherry-pick PR that I can't merge due to
> tests failing on it and release branch validation PR
> . Once I get tests green, I'll
> send another update and review outstanding open issues.
>
> --Mikhail
>
> On Fri, Nov 15, 2019 at 10:40 AM Thomas Weise  wrote:
>
> Any update regarding the release?
>
> The list still shows 10 open issues:
>
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20fixVersion%20%3D%202.17.0%20and%20resolution%20is%20EMPTY
>
> Is the RC blocked on those?
>
>
>
>
>
>
> On Mon, Oct 28, 2019 at 12:46 PM Ahmet Altay  wrote:
>
>
>
> On Mon, Oct 28, 2019 at 12:44 PM Gleb Kanterov  wrote:
>
> It looks like BigQueryIO DIRECT_READ is broken since 2.16.0, I've added a
> ticket describing the problem and possible fix, see BEAM-8504
>  [1].
>
>
> Should this be added to 2.16 blog post as a known issue?
>
>
>
> [1]: https://issues.apache.org/jira/browse/BEAM-8504
>
> On Wed, Oct 23, 2019 at 9:19 PM Kenneth Knowles  wrote:
>
> I opened https://github.com/apache/beam/pull/9862 to raise the
> documentation of Fix Version to the top level. It also includes the write
> up of Jira priorities, to make clear that "Blocker" priority does not refer
> to release blocking.
>
> On Wed, Oct 23, 2019 at 11:16 AM Kenneth Knowles  wrote:
>
> I've gone over the tickets and removed Fix Version from many of them that
> do not seem to be critical defects. If I removed Fix Version from a ticket
> you care about, please feel free to add it back. I am not trying to decide
> what is in/out of the release, just trying to triage the Jira data to match
> expected practices.
>
> It should probably be documented somewhere outside of the release guide.
> As far as I can tell, the fact that we triage them down to zero is the only
> place we mention that it is used to indicate release blockers and not used
> for feature targets.
>
> Kenn
>
> On Wed, Oct 23, 2019 at 10:40 AM Kenneth Knowles  wrote:
>
>  Wow, 28 release blocking tickets! That is the most I've ever seen, by
> far. Many appear to be feature requests, not release-blocking defects. I
> believe this is not according to our normal best practice. The release
> cadence should not wait for features in progress, with exceptions discussed
> on dev@. As a matter of best practice, I think we should triage feature
> requests to not have Fix Version set until it has been discussed on dev@.
>
> Kenn
>
> On Wed, Oct 23, 2019 at 9:55 AM Mikhail Gryzykhin 
> wrote:
>
> Hi all,
>
> Beam 2.17 release branch cut is scheduled today (2019/10/23) according to
> the release calendar [1].  I'll start working on the branch cutoff and
> later work on cherry picking blocker fixes.
>
> If you have release blocking issues for 2.17 please mark their "Fix
> Version" as 2.17.0 [2]. This tag is already created in JIRA in case you
> would like to move any non-blocking issues to that version.
>
> There is a decent amount of open bugs to be resolved in 2.17.0 [2] and
> only 4 [3] are marked as blockers. Please, review those if these bugs are
> actually to be resolved in 2.17.0 and prioritize fixes if possible.
>
> Any thoughts, comments, objections?
>
> Regards.
> Mikhail.
>
>
> [1]
> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
> [2]
> https://issues.apache.org/jira/browse/BEAM-8457?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Reopened%2C%20Open%2C%20%22In%20Progress%22%2C%20%22Under%20Discussion%22%2C%20%22In%20Implementation%22%2C%20%22Triage%20Needed%22)%20AND%20fixVersion%20%3D%202.17.0
> 

Re: Beam Testing Tools FAQ

2019-11-26 Thread Alan Myrvold
Nice, thanks!

On Tue, Nov 26, 2019 at 8:04 AM Robert Bradshaw  wrote:

> Thanks!
>
> On Tue, Nov 26, 2019 at 7:43 AM Łukasz Gajowy  wrote:
> >
> > Hi all,
> >
> > our documentation (either confluence or the website docs) describes how
> to create various integration and performance tests - there already are
> core operations tests, nexmark and IO test documentation pages. However, we
> are lacking some general docs to describe what tools do we have and what is
> the purpose of them.
> >
> > Therefore, I took the liberty of creating the Beam Testing Tools FAQ on
> our confluence:
> > https://cwiki.apache.org/confluence/display/BEAM/Beam+Testing+Tools+FAQ
> >
> > Hopefully, this is helpful and sheds some more light on that important
> part of our infrastructure. If you feel that something is missing there,
> feel free to let me know or add it yourself. :)
> >
> > Thanks,
> > Łukasz
>


Re: Contributor Permission for Beam Jira tickets

2019-11-26 Thread Pablo Estrada
I've added you as a contributor! Thanks!
-P.

On Mon, Nov 25, 2019 at 11:13 PM David Song 
wrote:

> Hi,
>
> This is David from DataPLS EngProd team (wintermelons@). I am working on
> integration tests with some Beam runners over Dataflow.
> Can someone add me as a contributor for the Beam's Jira tracker? I have an
> open bug, and would like to assign myself to it.
> My Jira username is wintermelons, and the Jira ticket is
> https://issues.apache.org/jira/browse/BEAM-8814
>
> Thanks,
> David
>
>


Re: Failed retrieving service account

2019-11-26 Thread Pablo Estrada
Great catch. Thanks Yifan!

On Tue, Nov 26, 2019 at 8:54 AM Tomo Suzuki  wrote:

> Thank you very much. Looking forward to the next dependency report email.
>
> Regards,
> Tomo
>
> On Mon, Nov 25, 2019 at 4:17 PM Yifan Zou  wrote:
>
>> Hi,
>>
>> I've looked into this issue and found that the default service account
>> was removed during the weekend for some reason log viewer
>> 
>> .
>> I restored the default service account. All workers should be backing to
>> normal and jobs start passing now.
>>
>> -yifan
>>
>> On Mon, Nov 25, 2019 at 11:17 AM Tomo Suzuki  wrote:
>>
>>> Thank you for looking into this.
>>>
>>> On Mon, Nov 25, 2019 at 12:59 PM Yifan Zou  wrote:
>>>
 Greetings,

 We're seeing some tests encountering permission issues such as *'Failed
 to retrieve
 http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/844138762903-comp...@developer.gserviceaccount.com/token
 
 from the Google Compute Enginemetadata service. Status: 404
 Response:\nb\'"The service account was not found.'*

 I am looking onto it. We might need to reboot some build workers to
 restore the service account access. I'll try to make as little impact as
 possible on current running jobs.

 -yifan

>>>
>>>
>>> --
>>> Regards,
>>> Tomo
>>>
>>
>
> --
> Regards,
> Tomo
>


Re: cython test instability

2019-11-26 Thread Luke Cwik
I also started to see this on PRs that I'm reviewing.
BEAM-8793, BEAM-8653, BEAM-8631, BEAM-8249 mention issues with setup.py and
egg_info but this looks different then all of those so I filed BEAM-8831.


On Mon, Nov 25, 2019 at 10:27 PM Chad Dombrova  wrote:

> Actually, it looks like I'm getting the same error on multiple PRs:
> https://scans.gradle.com/s/ihfmrxr7evslw
>
>
>
>
> On Mon, Nov 25, 2019 at 10:26 PM Chad Dombrova  wrote:
>
>> Hi all,
>> The cython tests started failing on one of my PRs which were succeeding
>> before.   The error is one that I've never seen before (separated onto
>> different lines to make it easier to read):
>>
>> Caused by: org.gradle.api.GradleException:
>> Could not copy file
>> '/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit@2
>> /src/sdks/python/.eggs/simplegeneric-0.8.1-py2.7.egg'
>> to
>> '/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit@2
>> /src/sdks/python/test-suites/tox/py2/build/srcs/sdks/python/.eggs/simplegeneric-0.8.1-py2.7.egg'.
>>
>> Followed immediately by an error about could not create a directory of
>> the same name.  Here's the gradle scan:
>>
>>
>> https://scans.gradle.com/s/ihfmrxr7evslw/failure?openFailures=WzFd=WzZd#top=0
>>
>> Any ideas?
>>
>> -chad
>>
>>
>>
>>
>>


Re: Failed retrieving service account

2019-11-26 Thread Tomo Suzuki
Thank you very much. Looking forward to the next dependency report email.

Regards,
Tomo

On Mon, Nov 25, 2019 at 4:17 PM Yifan Zou  wrote:

> Hi,
>
> I've looked into this issue and found that the default service account was
> removed during the weekend for some reason log viewer
> 
> .
> I restored the default service account. All workers should be backing to
> normal and jobs start passing now.
>
> -yifan
>
> On Mon, Nov 25, 2019 at 11:17 AM Tomo Suzuki  wrote:
>
>> Thank you for looking into this.
>>
>> On Mon, Nov 25, 2019 at 12:59 PM Yifan Zou  wrote:
>>
>>> Greetings,
>>>
>>> We're seeing some tests encountering permission issues such as *'Failed
>>> to retrieve
>>> http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/844138762903-comp...@developer.gserviceaccount.com/token
>>> 
>>> from the Google Compute Enginemetadata service. Status: 404
>>> Response:\nb\'"The service account was not found.'*
>>>
>>> I am looking onto it. We might need to reboot some build workers to
>>> restore the service account access. I'll try to make as little impact as
>>> possible on current running jobs.
>>>
>>> -yifan
>>>
>>
>>
>> --
>> Regards,
>> Tomo
>>
>

-- 
Regards,
Tomo


Re: [DISCUSS] AWS IOs V1 Deprecation Plan

2019-11-26 Thread Alexey Romanenko
AFAICT, all AWS SDK V1 IOs (SnsIO, SqsIO, DynamoDBIO, KinesisIO) are marked as 
"Experimental". So, it should not be a problem to gracefully deprecate and 
finally remove them. We already did the similar procedure for 
“HadoopInputFormatIO”, which was renamed to just “HadoopFormatIO” (since it 
started to support HadoopOutputFormatI as well). Old “HadoopInputFormatIO” was 
deprecated and removed after 3 consecutive Beam releases (as we agreed on 
mailing list).

In the same time, some users for some reasons would not be able or to want to 
move on AWS SDK V2. So, I’d prefer to just deprecate AWS SDK V1 IOs and accept 
new features/fixes only for V2 IOs.

Talking about “Experimental” annotation. Sorry in advance If I missed that and 
switch a subject a bit, but do we have clear rules or an agreement when IO 
becomes stable and should not be marked as experimental anymore? Most of our 
Java IOs are marked as Experimental but many of them were using in production 
by real users under real load. Does it mean that they are ready to be stable in 
terms of API? Perhaps, this topic deserves a new discussion if there are 
several opinions on that.

> On 26 Nov 2019, at 00:39, Luke Cwik  wrote:
> 
> Phase I sounds fine. 
> 
> Apache Beam follows semantic versioning and I believe removing the IOs will 
> be a backwards incompatible change unless they were marked experimental which 
> will be a problem for Phase 2.
> 
> What is the feasibility of making the V1 transforms wrappers around V2?
> 
> On Mon, Nov 25, 2019 at 1:46 PM Cam Mach  > wrote:
> Hello Beam Devs,
> 
> I have been working on the migration of Amazon Web Services IO connectors 
> into the new AWS SDK for Java V2. The goal is to have an updated 
> implementation aligned with the most recent AWS improvements. So far we have 
> already migrated the connectors for AWS SNS, SQS and  DynamoDB.
> 
> In the meantime some contributions are still going on V1 IOs. So far we have 
> dealt with those by porting (or asking contributors) to port the changes into 
> V2 IOs too because we don’t want features of both versions to be unaligned 
> but this may quickly become a maintenance issue, so we want to discuss a plan 
> to stop supporting (deprecate) V1 IOs and encourage users to move to V2.
> 
> Phase I (ASAP):
> Mark migrated AWS V1 IOs as deprecated
> Document migration path to V2
> Phase II (end of 2020):
> Decide a date or Beam release to remove the V1 IOs
> Send a notification to the community 3 months before we remove them
> Completely get rid of V1 IOs
> 
> Please let me know what you think or if you see any potential issues?
> 
> Thanks,
> Cam Mach
> 



Re: consurrent PRs

2019-11-26 Thread Robert Bradshaw
On Tue, Nov 26, 2019 at 6:15 AM Etienne Chauchot  wrote:
>
> Hi guys,
>
> I wanted your opinion about something:
>
> I have 2 concurrent PRs that do the same:
>
> https://github.com/apache/beam/pull/10010
>
> https://github.com/apache/beam/pull/10025
>
> The first one is a bit better because it addresses a deprecation that
> the other does not address. Except that they are the same. The first one
> the the older (1 day before) but the second one is the one that received
> reviews.
>
> I guess the problem is that there were 3 duplicate tickets of
> Elasticsearch7 upgrade (because people do not search for existing
> tickets before opening). As a result concurrent PRs were submitted
> despite the PR link on jira. I removed the duplicates but I need to
> close one of the PRs.
>
> The question is: which one do you think should be closed?

Are there (summary) pros and cons of the various pros and cons that
you're looking for feedback on? Otherwise, I think you could make the
call. (It's a good reminder of trying to search for issues on JIRA
before filing a new one though.)


Re: Beam Testing Tools FAQ

2019-11-26 Thread Robert Bradshaw
Thanks!

On Tue, Nov 26, 2019 at 7:43 AM Łukasz Gajowy  wrote:
>
> Hi all,
>
> our documentation (either confluence or the website docs) describes how to 
> create various integration and performance tests - there already are core 
> operations tests, nexmark and IO test documentation pages. However, we are 
> lacking some general docs to describe what tools do we have and what is the 
> purpose of them.
>
> Therefore, I took the liberty of creating the Beam Testing Tools FAQ on our 
> confluence:
> https://cwiki.apache.org/confluence/display/BEAM/Beam+Testing+Tools+FAQ
>
> Hopefully, this is helpful and sheds some more light on that important part 
> of our infrastructure. If you feel that something is missing there, feel free 
> to let me know or add it yourself. :)
>
> Thanks,
> Łukasz


Re: consurrent PRs

2019-11-26 Thread Maximilian Michels

Hi Etienne,

That is hard to tell from the outside. Based on the activity in the PRs, 
it looks like you already chose the second PR (#10025).


You should know best which one to merge. Make a call.

Cheers,
Max

On 26.11.19 15:14, Etienne Chauchot wrote:

Hi guys,

I wanted your opinion about something:

I have 2 concurrent PRs that do the same:

https://github.com/apache/beam/pull/10010

https://github.com/apache/beam/pull/10025

The first one is a bit better because it addresses a deprecation that 
the other does not address. Except that they are the same. The first one 
the the older (1 day before) but the second one is the one that received 
reviews.


I guess the problem is that there were 3 duplicate tickets of 
Elasticsearch7 upgrade (because people do not search for existing 
tickets before opening). As a result concurrent PRs were submitted 
despite the PR link on jira. I removed the duplicates but I need to 
close one of the PRs.


The question is: which one do you think should be closed?

Thanks for you opinion guys

Etienne





Beam Testing Tools FAQ

2019-11-26 Thread Łukasz Gajowy
Hi all,

our documentation (either confluence or the website docs) describes how to
create various integration and performance tests - there already are core
operations tests
,
nexmark 
and IO test

documentation pages. However, we are lacking some general docs to describe
what tools do we have and what is the purpose of them.

Therefore, I took the liberty of creating the Beam Testing Tools FAQ on our
confluence:
https://cwiki.apache.org/confluence/display/BEAM/Beam+Testing+Tools+FAQ

Hopefully, this is helpful and sheds some more light on that important part
of our infrastructure. If you feel that something is missing there, feel
free to let me know or add it yourself. :)

Thanks,
Łukasz


consurrent PRs

2019-11-26 Thread Etienne Chauchot

Hi guys,

I wanted your opinion about something:

I have 2 concurrent PRs that do the same:

https://github.com/apache/beam/pull/10010

https://github.com/apache/beam/pull/10025

The first one is a bit better because it addresses a deprecation that 
the other does not address. Except that they are the same. The first one 
the the older (1 day before) but the second one is the one that received 
reviews.


I guess the problem is that there were 3 duplicate tickets of 
Elasticsearch7 upgrade (because people do not search for existing 
tickets before opening). As a result concurrent PRs were submitted 
despite the PR link on jira. I removed the duplicates but I need to 
close one of the PRs.


The question is: which one do you think should be closed?

Thanks for you opinion guys

Etienne





Re: Full stream-stream join semantics

2019-11-26 Thread David Morávek
Yes, in batch case with long-term historical data, this would be O(n^2) as
it basically a bubble sort. If you have large # of updates for a single
key, this would be super expensive.

Kenn, can this be re-implemented with your solution?

On Tue, Nov 26, 2019 at 1:10 PM Jan Lukavský  wrote:

> Functionally yes. But this straightforward solution is not working for me
> for two main reasons:
>
>  - it either blows state in batch case or the time complexity of the sort
> would be O(n^2) (and reprocessing several years of dense time-series data
> makes it a no go)
>
>  - it is not reusable for different time-ordering needs, because the logic
> implemented purely in user-space cannot be transferred to different problem
> (there are two states needed, one for buffer, the other for user-state) and
> extending DoFns does not work (cannot create abstract SortedDoFn, because
> of the state annotation definitions)
>
> Jan
> On 11/26/19 12:56 PM, David Morávek wrote:
>
> Hi,
>
> I think what Jan has in mind would look something like this
> , if
> implemented in user code. Am I right?
>
> D.
>
>
> On Tue, Nov 26, 2019 at 10:23 AM Jan Lukavský  wrote:
>
>>
>> On 11/25/19 11:45 PM, Kenneth Knowles wrote:
>>
>>
>>
>> On Mon, Nov 25, 2019 at 1:56 PM Jan Lukavský  wrote:
>>
>>> Hi Rui,
>>>
>>> > Hi Kenn, you think stateful DoFn based join can emit joined rows that
>>> never to be retracted because in stateful DoFn case joined rows will be
>>> controlled by timers and emit will be only once? If so I will agree with
>>> it. Generally speaking, if only emit once is the factor of needing
>>> retraction or not.
>>>
>>> that would imply buffering elements up until watermark, then sorting and
>>> so reduces to the option a) again, is that true? This also has to deal with
>>> allowed lateness, that would mean, that with allowed lateness greater than
>>> zero, there can still be multiple firings and so retractions are needed.
>>>
>> Specifically, when I say "bi-temporal join" I mean unbounded-to-unbounded
>> join where one of the join conditions is that elements are within event
>> time distance d of one another. An element at time t will be saved until
>> time t + 2d and then garbage collected. Every matching pair can be emitted
>> immediately.
>>
>> OK, this might simplify things a little. Is there a design doc for that?
>> If there are multiple LHS elements within event time distance from RHS
>> element, which one should be joined? I suppose all of them, but that is not
>> "(time-varying-)relational" join semantics. In that semantics only the last
>> element must be joined, because that is how a (classical) relational
>> database would see the relation at time T (the old record would have been
>> overwritten and not be part of the output). Because of the time distance
>> constraint this is different from the join I have in mind, because that
>> simply joins every LHS element(s) to most recent RHS element(s) and vice
>> versa, without any additional time constraints (that is the RHS "update"
>> can happen arbitrarily far in past).
>>
>> Jan
>>
>>
>> In the triggered CoGBK + join-product implementation, you do need
>> retractions as a model concept. But you don't need full support, since they
>> only need to be shipped as deltas and only from the CoGBK to the
>> join-product transform where they are all consumed to create only positive
>> elements. Again a delay is not required; this yields correct results with
>> the "always" trigger.
>>
>> Neither case requires waiting or time sorting a whole buffer. The
>> bi-temporal join requires something more, in a way, since you need to query
>> by time range and GC time prefixes.
>>
>> Kenn
>>
>> Jan
>>> On 11/25/19 10:17 PM, Rui Wang wrote:
>>>
>>>
>>>
>>> On Mon, Nov 25, 2019 at 11:29 AM Jan Lukavský  wrote:
>>>

 On 11/25/19 7:47 PM, Kenneth Knowles wrote:



 On Sun, Nov 24, 2019 at 12:57 AM Jan Lukavský  wrote:

> I can put down a design document, but before that I need to clarify
> some things for me. I'm struggling to put all of this into a bigger
> picture. Sorry if the arguments are circulating, but I didn't notice any
> proposal of how to solve these. If anyone can disprove any of this logic 
> it
> would be very much appreciated as I might be able to get from a dead end:
>
>  a) in the bi-temporal join you can either buffer until watermark, or
> emit false data that has to be retracted
>
 This is not the case. A stateful DoFn based join can emit immediately
 joined rows that will never need to be retracted. The need for retractions
 has to do with CoGBK-based implementation of a join.

 I fail to see how this could work. If I emit joined rows immediately
 without waiting for watermark to pass, I can join two elements, that don't
 belong to each other, because later can arrive element with lower time
 distance, that 

Re: Full stream-stream join semantics

2019-11-26 Thread Jan Lukavský
Functionally yes. But this straightforward solution is not working for 
me for two main reasons:


 - it either blows state in batch case or the time complexity of the 
sort would be O(n^2) (and reprocessing several years of dense 
time-series data makes it a no go)


 - it is not reusable for different time-ordering needs, because the 
logic implemented purely in user-space cannot be transferred to 
different problem (there are two states needed, one for buffer, the 
other for user-state) and extending DoFns does not work (cannot create 
abstract SortedDoFn, because of the state annotation definitions)


Jan

On 11/26/19 12:56 PM, David Morávek wrote:

Hi,

I think what Jan has in mind would look something like this 
, if 
implemented in user code. Am I right?


D.


On Tue, Nov 26, 2019 at 10:23 AM Jan Lukavský > wrote:



On 11/25/19 11:45 PM, Kenneth Knowles wrote:



On Mon, Nov 25, 2019 at 1:56 PM Jan Lukavský mailto:je...@seznam.cz>> wrote:

Hi Rui,

> Hi Kenn, you think stateful DoFn based join can emit joined
rows that never to be retracted because in stateful DoFn case
joined rows will be controlled by timers and emit will be
only once? If so I will agree with it. Generally speaking, if
only emit once is the factor of needing retraction or not.

that would imply buffering elements up until watermark, then
sorting and so reduces to the option a) again, is that true?
This also has to deal with allowed lateness, that would mean,
that with allowed lateness greater than zero, there can still
be multiple firings and so retractions are needed.

Specifically, when I say "bi-temporal join" I mean
unbounded-to-unbounded join where one of the join conditions is
that elements are within event time distance d of one another. An
element at time t will be saved until time t + 2d and then
garbage collected. Every matching pair can be emitted immediately.


OK, this might simplify things a little. Is there a design doc for
that? If there are multiple LHS elements within event time
distance from RHS element, which one should be joined? I suppose
all of them, but that is not "(time-varying-)relational" join
semantics. In that semantics only the last element must be joined,
because that is how a (classical) relational database would see
the relation at time T (the old record would have been overwritten
and not be part of the output). Because of the time distance
constraint this is different from the join I have in mind, because
that simply joins every LHS element(s) to most recent RHS
element(s) and vice versa, without any additional time constraints
(that is the RHS "update" can happen arbitrarily far in past).

Jan



In the triggered CoGBK + join-product implementation, you do need
retractions as a model concept. But you don't need full support,
since they only need to be shipped as deltas and only from the
CoGBK to the join-product transform where they are all consumed
to create only positive elements. Again a delay is not required;
this yields correct results with the "always" trigger.

Neither case requires waiting or time sorting a whole buffer. The
bi-temporal join requires something more, in a way, since you
need to query by time range and GC time prefixes.

Kenn

Jan

On 11/25/19 10:17 PM, Rui Wang wrote:



On Mon, Nov 25, 2019 at 11:29 AM Jan Lukavský
mailto:je...@seznam.cz>> wrote:


On 11/25/19 7:47 PM, Kenneth Knowles wrote:



On Sun, Nov 24, 2019 at 12:57 AM Jan Lukavský
mailto:je...@seznam.cz>> wrote:

I can put down a design document, but before that I
need to clarify some things for me. I'm struggling
to put all of this into a bigger picture. Sorry if
the arguments are circulating, but I didn't notice
any proposal of how to solve these. If anyone can
disprove any of this logic it would be very much
appreciated as I might be able to get from a dead end:

 a) in the bi-temporal join you can either buffer
until watermark, or emit false data that has to be
retracted

This is not the case. A stateful DoFn based join can
emit immediately joined rows that will never need to be
retracted. The need for retractions has to do with
CoGBK-based implementation of a join.

I fail to see how this could work. If I emit joined rows
immediately without waiting for watermark to pass, I can
join two elements, that don't belong to each other,
because later can arrive element with lower time

Re: Full stream-stream join semantics

2019-11-26 Thread David Morávek
Hi,

I think what Jan has in mind would look something like this
, if
implemented in user code. Am I right?

D.


On Tue, Nov 26, 2019 at 10:23 AM Jan Lukavský  wrote:

>
> On 11/25/19 11:45 PM, Kenneth Knowles wrote:
>
>
>
> On Mon, Nov 25, 2019 at 1:56 PM Jan Lukavský  wrote:
>
>> Hi Rui,
>>
>> > Hi Kenn, you think stateful DoFn based join can emit joined rows that
>> never to be retracted because in stateful DoFn case joined rows will be
>> controlled by timers and emit will be only once? If so I will agree with
>> it. Generally speaking, if only emit once is the factor of needing
>> retraction or not.
>>
>> that would imply buffering elements up until watermark, then sorting and
>> so reduces to the option a) again, is that true? This also has to deal with
>> allowed lateness, that would mean, that with allowed lateness greater than
>> zero, there can still be multiple firings and so retractions are needed.
>>
> Specifically, when I say "bi-temporal join" I mean unbounded-to-unbounded
> join where one of the join conditions is that elements are within event
> time distance d of one another. An element at time t will be saved until
> time t + 2d and then garbage collected. Every matching pair can be emitted
> immediately.
>
> OK, this might simplify things a little. Is there a design doc for that?
> If there are multiple LHS elements within event time distance from RHS
> element, which one should be joined? I suppose all of them, but that is not
> "(time-varying-)relational" join semantics. In that semantics only the last
> element must be joined, because that is how a (classical) relational
> database would see the relation at time T (the old record would have been
> overwritten and not be part of the output). Because of the time distance
> constraint this is different from the join I have in mind, because that
> simply joins every LHS element(s) to most recent RHS element(s) and vice
> versa, without any additional time constraints (that is the RHS "update"
> can happen arbitrarily far in past).
>
> Jan
>
>
> In the triggered CoGBK + join-product implementation, you do need
> retractions as a model concept. But you don't need full support, since they
> only need to be shipped as deltas and only from the CoGBK to the
> join-product transform where they are all consumed to create only positive
> elements. Again a delay is not required; this yields correct results with
> the "always" trigger.
>
> Neither case requires waiting or time sorting a whole buffer. The
> bi-temporal join requires something more, in a way, since you need to query
> by time range and GC time prefixes.
>
> Kenn
>
> Jan
>> On 11/25/19 10:17 PM, Rui Wang wrote:
>>
>>
>>
>> On Mon, Nov 25, 2019 at 11:29 AM Jan Lukavský  wrote:
>>
>>>
>>> On 11/25/19 7:47 PM, Kenneth Knowles wrote:
>>>
>>>
>>>
>>> On Sun, Nov 24, 2019 at 12:57 AM Jan Lukavský  wrote:
>>>
 I can put down a design document, but before that I need to clarify
 some things for me. I'm struggling to put all of this into a bigger
 picture. Sorry if the arguments are circulating, but I didn't notice any
 proposal of how to solve these. If anyone can disprove any of this logic it
 would be very much appreciated as I might be able to get from a dead end:

  a) in the bi-temporal join you can either buffer until watermark, or
 emit false data that has to be retracted

>>> This is not the case. A stateful DoFn based join can emit immediately
>>> joined rows that will never need to be retracted. The need for retractions
>>> has to do with CoGBK-based implementation of a join.
>>>
>>> I fail to see how this could work. If I emit joined rows immediately
>>> without waiting for watermark to pass, I can join two elements, that don't
>>> belong to each other, because later can arrive element with lower time
>>> distance, that should have been joint in the place of the previously
>>> emitted one. This is wrong result that has to be retracted. Or what I'm
>>> missing?
>>>
>>
>> Hi Kenn, you think stateful DoFn based join can emit joined rows that
>> never to be retracted because in stateful DoFn case joined rows will be
>> controlled by timers and emit will be only once? If so I will agree with
>> it. Generally speaking, if only emit once is the factor of needing
>> retraction or not.
>>
>> In the past brainstorming, even having retractions ready, streaming join
>> with windowing are likely be implemented by a style of CoGBK + stateful
>> DoFn.
>>
>>
>>
>> I suggest that you work out the definition of the join you are interested
>>> in, with a good amount of mathematical rigor, and then consider the ways
>>> you can implement it. That is where a design doc will probably clarify
>>> things.
>>>
>>> Kenn
>>>
>>>  b) until retractions are 100% functional (and that is sort of holy
 grail for now), then the only solution is using a buffer holding data up to
 watermark *and then sort by 

Re: Artifact staging in cross-language pipelines

2019-11-26 Thread Maximilian Michels

Hey Heejong,

I don't think so. It would be great to push this forward.

Thanks,
Max

On 26.11.19 02:49, Heejong Lee wrote:

Hi,

Is anyone actively working on artifact staging extension for 
cross-language pipelines? I'm thinking I can contribute to it in coming 
Dec. If anyone has any progress on this and needs help, please let me know.


Thanks,

On Wed, Jun 12, 2019 at 2:42 AM Ismaël Mejía > wrote:


Can you please add this to the design documents webpage.
https://beam.apache.org/contribute/design-documents/

On Wed, May 8, 2019 at 7:29 PM Chamikara Jayalath
mailto:chamik...@google.com>> wrote:
 >
 >
 >
 > On Tue, May 7, 2019 at 10:21 AM Maximilian Michels
mailto:m...@apache.org>> wrote:
 >>
 >> Here's the first draft:
 >>

https://docs.google.com/document/d/1XaiNekAY2sptuQRIXpjGAyaYdSc-wlJ-VKjl04c8N48/edit?usp=sharing
 >>
 >> It's rather high-level. We may want to add more details once we have
 >> finalized the design. Feel free to make comments and edits.
 >
 >
 > Thanks Max. Added some comments.
 >
 >>
 >>
 >> > All of this goes back to the idea that I think the listing of
 >> > artifacts (or more general dependencies) should be a property
of the
 >> > environment themselves.
 >>
 >> +1 I came to the same conclusion while thinking about how to store
 >> artifact information for deferred execution of the pipeline.
 >>
 >> -Max
 >>
 >> On 07.05.19 18:10, Robert Bradshaw wrote:
 >> > Looking forward to your writeup, Max. In the meantime, some
comments below.
 >> >
 >> >
 >> > From: Lukasz Cwik mailto:lc...@google.com>>
 >> > Date: Thu, May 2, 2019 at 6:45 PM
 >> > To: dev
 >> >
 >> >>
 >> >>
 >> >> On Thu, May 2, 2019 at 7:20 AM Robert Bradshaw
mailto:rober...@google.com>> wrote:
 >> >>>
 >> >>> On Sat, Apr 27, 2019 at 1:14 AM Lukasz Cwik
mailto:lc...@google.com>> wrote:
 >> 
 >>  We should stick with URN + payload + artifact metadata[1]
where the only mandatory one that all SDKs and expansion services
understand is the "bytes" artifact type. This allows us to add
optional URNs for file://, http://, Maven, PyPi, ... in the future.
I would make the artifact staging service use the same URN + payload
mechanism to get compatibility of artifacts across the different
services and also have the artifact staging service be able to be
queried for the list of artifact types it supports.
 >> >>>
 >> >>> +1
 >> >>>
 >>  Finally, we would need to have environments enumerate the
artifact types that they support.
 >> >>>
 >> >>> Meaning at runtime, or as another field statically set in
the proto?
 >> >>
 >> >>
 >> >> I don't believe runners/SDKs should have to know what
artifacts each environment supports at runtime and instead have
environments enumerate them explicitly in the proto. I have been
thinking about a more general "capabilities" block on environments
which allow them to enumerate URNs that the environment understands.
This would include artifact type URNs, PTransform URNs, coder URNs,
... I haven't proposed anything specific down this line yet because
I was wondering how environment resources (CPU, min memory, hardware
like GPU, AWS/GCP/Azure/... machine types) should/could tie into this.
 >> >>
 >> >>>
 >>  Having everyone have the same "artifact" representation
would be beneficial since:
 >>  a) Python environments could install dependencies from a
requirements.txt file (something that the Google Cloud Dataflow
Python docker container allows for today)
 >>  b) It provides an extensible and versioned mechanism for
SDKs, environments, and artifact staging/retrieval services to
support additional artifact types
 >>  c) Allow for expressing a canonical representation of an
artifact like a Maven package so a runner could merge environments
that the runner deems compatible.
 >> 
 >>  The flow I could see is:
 >>  1) (optional) query artifact staging service for supported
artifact types
 >>  2) SDK request expansion service to expand transform
passing in a list of artifact types the SDK and artifact staging
service support, the expansion service returns a list of artifact
types limited to those supported types + any supported by the
environment
 >> >>>
 >> >>> The crux of the issue seems to be how the expansion service
returns
 >> >>> the artifacts themselves. Is this going with the approach
that the
 >> >>> caller of the expansion service must host an artifact
staging service?
 >> >>
 >> >>
 >> >> The caller would not need to host an artifact staging service
(but would become effectively a proxy service, see 

Re: Full stream-stream join semantics

2019-11-26 Thread Jan Lukavský


On 11/25/19 11:45 PM, Kenneth Knowles wrote:



On Mon, Nov 25, 2019 at 1:56 PM Jan Lukavský > wrote:


Hi Rui,

> Hi Kenn, you think stateful DoFn based join can emit joined rows
that never to be retracted because in stateful DoFn case joined
rows will be controlled by timers and emit will be only once? If
so I will agree with it. Generally speaking, if only emit once is
the factor of needing retraction or not.

that would imply buffering elements up until watermark, then
sorting and so reduces to the option a) again, is that true? This
also has to deal with allowed lateness, that would mean, that with
allowed lateness greater than zero, there can still be multiple
firings and so retractions are needed.

Specifically, when I say "bi-temporal join" I mean 
unbounded-to-unbounded join where one of the join conditions is that 
elements are within event time distance d of one another. An element 
at time t will be saved until time t + 2d and then garbage collected. 
Every matching pair can be emitted immediately.


OK, this might simplify things a little. Is there a design doc for that? 
If there are multiple LHS elements within event time distance from RHS 
element, which one should be joined? I suppose all of them, but that is 
not "(time-varying-)relational" join semantics. In that semantics only 
the last element must be joined, because that is how a (classical) 
relational database would see the relation at time T (the old record 
would have been overwritten and not be part of the output). Because of 
the time distance constraint this is different from the join I have in 
mind, because that simply joins every LHS element(s) to most recent RHS 
element(s) and vice versa, without any additional time constraints (that 
is the RHS "update" can happen arbitrarily far in past).


Jan



In the triggered CoGBK + join-product implementation, you do need 
retractions as a model concept. But you don't need full support, since 
they only need to be shipped as deltas and only from the CoGBK to the 
join-product transform where they are all consumed to create only 
positive elements. Again a delay is not required; this yields correct 
results with the "always" trigger.


Neither case requires waiting or time sorting a whole buffer. The 
bi-temporal join requires something more, in a way, since you need to 
query by time range and GC time prefixes.


Kenn

Jan

On 11/25/19 10:17 PM, Rui Wang wrote:



On Mon, Nov 25, 2019 at 11:29 AM Jan Lukavský mailto:je...@seznam.cz>> wrote:


On 11/25/19 7:47 PM, Kenneth Knowles wrote:



On Sun, Nov 24, 2019 at 12:57 AM Jan Lukavský
mailto:je...@seznam.cz>> wrote:

I can put down a design document, but before that I need
to clarify some things for me. I'm struggling to put all
of this into a bigger picture. Sorry if the arguments
are circulating, but I didn't notice any proposal of how
to solve these. If anyone can disprove any of this logic
it would be very much appreciated as I might be able to
get from a dead end:

 a) in the bi-temporal join you can either buffer until
watermark, or emit false data that has to be retracted

This is not the case. A stateful DoFn based join can emit
immediately joined rows that will never need to be
retracted. The need for retractions has to do with
CoGBK-based implementation of a join.

I fail to see how this could work. If I emit joined rows
immediately without waiting for watermark to pass, I can join
two elements, that don't belong to each other, because later
can arrive element with lower time distance, that should have
been joint in the place of the previously emitted one. This
is wrong result that has to be retracted. Or what I'm missing?


Hi Kenn, you think stateful DoFn based join can emit joined rows
that never to be retracted because in stateful DoFn case joined
rows will be controlled by timers and emit will be only once? If
so I will agree with it. Generally speaking, if only emit once is
the factor of needing retraction or not.

In the past brainstorming, even having retractions ready,
streaming join with windowing are likely be implemented by a
style of CoGBK + stateful DoFn.



I suggest that you work out the definition of the join you
are interested in, with a good amount of mathematical rigor,
and then consider the ways you can implement it. That is
where a design doc will probably clarify things.

Kenn

 b) until retractions are 100% functional (and that is
sort of holy grail for now), then the only solution is
using a buffer holding data up to watermark *and then
sort by event time*

 c) even if 

Re: real real-time beam

2019-11-26 Thread Jan Lukavský
> I will not try to formalize this notion in this email. But I will 
note that since it is universally assured, it would be zero cost and 
significantly safer to formalize it and add an annotation noting it was 
required. It has nothing to do with event time ordering, only trigger 
firing ordering.


I cannot agree with the last sentence (and I'm really not doing this on 
purpose :-)). Panes generally arrive out of order, as mentioned several 
times in the discussions linked from this thread. If we want to ensure 
"trigger firing ordering", we can use the pane index, that is correct. 
But - that is actually equivalent to sorting by event time, because pane 
index order will be (nearly) the same as event time order. This is due 
to the fact, that pane index and event time correlate (both are 
monotonic). The pane index "only" solves the issue of preserving 
ordering even in case where there are multiple firings within the same 
timestamp (regardless of granularity). This was mentioned in the initial 
discussion about event time ordering, and is part of the design doc - 
users should be allowed to provide UDF for extracting time-correlated 
ordering field (which means ability to choose a preferred, or 
authoritative, observer which assigns unambiguous ordering to events). 
Example of this might include Kafka offsets as well, or any queue index 
for that matter. This is not yet implemented, but could (should) be in 
the future.


The only case where these two things are (somewhat) different is the 
case mentioned by @Steve - if the output is stateless ParDo, which will 
get fused. But that is only because the processing is single-threaded 
per key, and therefore the ordering is implied by timer ordering (and 
careful here, many runners don't have this ordering 100% correct, as of 
now - this problem luckily appears only when there are multiple timers 
per key). Moreover, if there should be a failure, then the output might 
(would) get back in time anyway. If there would be a shuffle operation 
after GBK/Combine, then the ordering is no longer guaranteed and must be 
explicitly taken care of.


Last note, I must agree with @Rui that all these discussions are very 
much related to retractions (precisely the ability to implement them).


Jan

On 11/26/19 7:34 AM, Kenneth Knowles wrote:

Hi Aaron,

Another insightful observation.

Whenever an aggregation (GBK / Combine per key) has a trigger firing, 
there is a per-key sequence number attached. It is included in 
metadata known as "PaneInfo" [1]. The value of PaneInfo.getIndex() is 
colloquially referred to as the "pane index". You can also make use of 
the "on time index" if you like. The best way to access this metadata 
is to add a parameter of type PaneInfo to your DoFn's @ProcessElement 
method. This works for stateful or stateless DoFn.


Most of Beam's IO connectors do not explicitly enforce that outputs 
occur in pane index order but instead rely on the hope that the runner 
delivers panes in order to the sink. IMO this is dangerous but it has 
not yet caused a known issue. In practice, each "input key to output 
key 'path' " through a pipeline's logic does preserve order for all 
existing runners AFAIK and it is the formalization that is missing. It 
is related to an observation by +Rui Wang 
 that processing retractions requires the 
same key-to-key ordering.


I will not try to formalize this notion in this email. But I will note 
that since it is universally assured, it would be zero cost and 
significantly safer to formalize it and add an annotation noting it 
was required. It has nothing to do with event time ordering, only 
trigger firing ordering.


Kenn

[1] 
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/windowing/PaneInfo.java
[2] 
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/DoFn.java#L557



On Mon, Nov 25, 2019 at 4:06 PM Pablo Estrada > wrote:


The blog posts on stateful and timely computation with Beam should
help clarify a lot about how to use state and timers to do this:
https://beam.apache.org/blog/2017/02/13/stateful-processing.html
https://beam.apache.org/blog/2017/08/28/timely-processing.html

You'll see there how there's an implicit per-single-element
grouping for each key, so state and timers should support your use
case very well.

Best
-P.

On Mon, Nov 25, 2019 at 3:47 PM Steve Niemitz mailto:sniem...@apache.org>> wrote:

If you have a pipeline that looks like Input -> GroupByKey ->
ParDo, while it is not guaranteed, in practice the sink will
observe the trigger firings in order (per key), since it'll be
fused to the output of the GBK operation (in all runners I
know of).

There have been a couple threads about trigger ordering as
well on the list recently that might