Re: Labels on PR

2020-02-13 Thread Kyle Weaver
I'm really enjoying this feature so far! The "Pull Requests" page for Beam
is now way more readable. Thanks Alex :)

On Wed, Feb 12, 2020 at 9:18 PM Alex Van Boxel  wrote:

> What do you exactly mean with github grep... where is it an issue. I find
> it useful for searching here:
>
> [image: Screen Shot 2020-02-13 at 06.11.33.png]
>
> OK, you get some false positives, but then the color coding works. You
> can't search on a category so this looks like the only alternative. I was
> even thinking of adding more text in the description as it could help new
> contributors to identify something they could help with.
>
> It's also nice when you hover over the label.
>
> So, could you exactly pinpoint where you see a problem?
>
>  _/
> _/ Alex Van Boxel
>
>
> On Wed, Feb 12, 2020 at 10:22 PM Ismaël Mejía  wrote:
>
>> Alex would you consider removing the descriptions from the labels? It
>> seems that
>> github greps not only the text of the label but also the text of the
>> description
>> producing false positives, e.g. if I search the label `io` it resolves
>> not only
>> all the IOs but also results like `core` because it matches the
>> description
>> `core-constructIOn-java` and also `extensIOns` making the point of having
>> general categories not really useful.
>>
>> On Wed, Feb 12, 2020 at 3:01 PM Ismaël Mejía  wrote:
>>
>>> The prefix is just extra characters makes readibility worse, notice
>>> that the full category (e.g. ios/runners/etc) will match because we have
>>> an
>>> extra tag equivalent to the prefix, so filtering is still possible. you
>>> rarely
>>> need to filter for more than one criteria, that's why github does not
>>> allow it
>>> (and the reason to have the extra per category labels).
>>>
>>> The only issue i can see is a possible name collision in the future, but
>>> until that
>>> happens i think this is a reasonable tradeoff.
>>>
>>> Excellent idea to unifiy the colors for the categories +1 !
>>>
>>> On Wed, Feb 12, 2020 at 2:33 PM Alex Van Boxel  wrote:
>>>
 Ismael, I saw that you removed the prefix. I still want to have some
 grouping between the subtypes, so I color coded them.

 I also added all the labels in the file. We now have 62 labels.

  _/
 _/ Alex Van Boxel


 On Wed, Feb 12, 2020 at 12:03 PM Ismaël Mejía 
 wrote:

> Forgot to mention, older PRs will look not classified, up to you guys
> if you
> want to do manually. All new PRs will be automatically labeled.
>
> On Wed, Feb 12, 2020 at 12:02 PM Ismaël Mejía 
> wrote:
>
>> For info Alex's PR to suport autolabeler was merged today and INFRA
>> enabled the plugin.
>> I created an artificial PR to check it was autolabeled correctly.
>> It is working perfectly now.
>> Thanks Alex !
>>
>> On Tue, Feb 11, 2020 at 5:23 PM Robert Bradshaw 
>> wrote:
>>
>>> +1 to finding the right balance.
>>>
>>> I do think per-runner makes sense, rather than a general "runners."
>>> IOs might make sense as well. Not sure about all the extensions-* I'd
>>> leave those out for now.
>>>
>>> On Tue, Feb 11, 2020 at 5:56 AM Ismaël Mejía 
>>> wrote:
>>> >
>>> > > So I propose going simple with a limited set of labels. Later on
>>> we can refine. Don't forget that does labels only are useful during the
>>> life-cycle of a PR.
>>> >
>>> > Labels are handy for quick filtering and finding PRs we care about
>>> for example
>>> > to review.
>>> >
>>> > I agree with the feeling that we should not go to the extremes,
>>> but what is
>>> > requested in the PR rarely would produce more than 5 labels per
>>> PR.  For example
>>> > if a PR changes KafkaIO and something in the CI it will produce
>>> "java io kafka
>>> > infra", a pure change on Flink runer will produce "runners flink"
>>> >
>>> > 100% d'accord with not to have many labels and keep them short,
>>> but the current
>>> > classification lacks detail, e.g. few people care about some
>>> general categories
>>> > "runners" or "io", but maintainers may care about their specific
>>> categories like
>>> > "spark" or "kafka" so I don't think that this extra level of
>>> detail is
>>> > inappropriate and in the end it will only add one extra label per
>>> matching path.
>>> >
>>> > Let's give it a try if it is too excesive we can took the opposite
>>> path and reduce it.
>>> >
>>> > Ismaël
>>> >
>>> >
>>> > On Tue, Feb 11, 2020 at 1:04 PM Alex Van Boxel 
>>> wrote:
>>> >>
>>> >> I'm wondering if we're not taking it too far with those detailed
>>> labels. It's like going from nothing to super details. The simples 
>>> use-case
>>> hasn't proven itself in practice yet.
>>> >>
>>> >> So I propose going simple with a limited set of labels. Later on
>>> we can refine. Don't forget that does labels 

Contributor permission for Beam Jira tickets

2020-02-13 Thread Wenbing Bai
Hi there,

I am Wenbing from Cruise. I would like to make some contributions to
the Python SDK for Beam. Can someone add me as a contributor in the
Beam Jira? My username is wenbing-bai.

Thank you!
Wenbing

-- 





Wenbing Bai

Senior Software Engineer, MLP

Cruise

Pronouns: She/Her

-- 


*Confidentiality Note:* We care about protecting our proprietary 
information, confidential material, and trade secrets. This message may 
contain some or all of those things. Cruise will suffer material harm if 
anyone other than the intended recipient disseminates or takes any action 
based on this message. If you have received this message (including any 
attachments) in error, please delete it immediately and notify the sender 
promptly.


Re: FnAPI proto backwards compatibility

2020-02-13 Thread Kyle Weaver
> we can take advantage of these pipeline features to get rid of the
categories of @ValidatesRunner tests, because we could have just simply
@ValidatesRunner and each test would be matched against runner capabilities

+1, I think the potential to formally integrate our idea of compatibility
and unit testing is a big advantage of this proposal. Also, when deciding
where to draw lines between different URNs, it may help to look at the
existing validates runner test categories, which are currently the most
accurate signal we have regarding a runner's capabilities.

On Thu, Feb 13, 2020 at 4:04 PM Robert Burke  wrote:

> Wrt per DoFn/ParDo level, there's the similar case of wether the DoFn has
> an Urn for requiring something or it's an annotation for saying the DoFn
> provides something (eg. Provides K-anonymization with k defined)
>
> The general theme of this thread seems to be trying to ensure a runner can
> reject a pipeline if it's not able to provide the right guarantees, so that
> latter case isn't handled.
>
> Eg. The latter provisions could be used to analyze a pipeline to ensure
> the outputs are all properly anonymized to a certain degree at construction
> time.
>
> On Thu, Feb 13, 2020, 2:26 PM Kenneth Knowles  wrote:
>
>>
>>
>> On Thu, Feb 13, 2020 at 12:42 PM Jan Lukavský  wrote:
>>
>>> Hi,
>>>
>>> +1 for adding pipeline required features. I think being able to reject
>>> pipeline with unknown requirement is pretty much needed, mostly because
>>> that enables runners to completely decouple from SDKs, while being able to
>>> recognize when a pipeline constructed with incomplatible version of SDK is
>>> run.
>>>
>>> I'll add some observations I made when implementing the latest "requires
>>> time sorted input" addition with regards to this discussion:
>>>
>>>  a) the features of pipeline are not simple function of set of
>>> PTransforms being present in the pipeline, but also depend on (type of)
>>> inputs. For instance a PTransform might have a simple expansion to
>>> primitive PTransforms in streaming case, but don't have such expansion in
>>> batch case. That is to say, runner that doesn't actually know of a specific
>>> extension to some PTransform _might_ actually execute it correctly under
>>> some conditions. But _must_ fail in other cases.
>>>
>>>  b) it would be good if this feature would work independently of
>>> portability (for Java SDK). We still have (at least two) non-portable
>>> runners that are IMO widely used in production and are likely to last for
>>> some time.
>>>
>> I think even if these runners keep their execution not using portability,
>> they should migrate to use the portable pipeline definition. Then they can
>> share the same model w/ runners that execute using portability. The Fn API
>> is not required to be used as long as the runner implements the semantics
>> of the pipeline.
>>
>> Kenn
>>
>>
>>>  c) we can take advantage of these pipeline features to get rid of the
>>> categories of @ValidatesRunner tests, because we could have just simply
>>> @ValidatesRunner and each test would be matched against runner capabilities
>>> (i.e. a runner would be tested with given test if and only if it would not
>>> reject it)
>>>
>>> Jan
>>> On 2/13/20 8:42 PM, Robert Burke wrote:
>>>
>>> +1 to deferring for now. Since they should not be modified after
>>> adoption, it makes sense not to get ahead of ourselves.
>>>
>>> On Thu, Feb 13, 2020, 10:59 AM Robert Bradshaw 
>>> wrote:
>>>
 On Thu, Feb 13, 2020 at 10:12 AM Robert Burke 
 wrote:
 >
 > One thing that doesn't appear to have been suggested yet is we could
 "batch" urns together under a "super urn" so that adding one super urn is
 like adding each of the represented batch of features. This prevents
 needing to send dozens of urns to be individually sent over.
 >
 >
 > The super urns would need to be static after definition to avoid
 mismatched definitions down the road.
 >
 > We collect together urns what is reasonably consider "vX" support,
 and can then increment that later.
 >
 > This would simplify new SDKs, as they can have a goal of initial v1
 support as we define what level of feature support it has, and doesn't
 prevent new capabilities from being added incrementally.

 Yes, this is a very good idea. I've also been thinking of certain sets
 of common operations/well known DoFns that often occur on opposite
 sides of GBKs (e.g. the pair-with-one, sum-ints, drop-keys, ...) that
 are commonly supported that could be grouped under these meta-urns.

 Note that these need not be monotonic, for example a current v1 might
 be requiring LengthPrefixCoderV1, but if a more efficient
 LengthPrefixCoderV2 comes along eventually v2 could require that and
 *not* require the old, now rarely used LengthPrefixCoderV1.

 Probably makes sense to defer adding such super-urns until we notice a
 set that is commonly 

Re: FnAPI proto backwards compatibility

2020-02-13 Thread Robert Burke
Wrt per DoFn/ParDo level, there's the similar case of wether the DoFn has
an Urn for requiring something or it's an annotation for saying the DoFn
provides something (eg. Provides K-anonymization with k defined)

The general theme of this thread seems to be trying to ensure a runner can
reject a pipeline if it's not able to provide the right guarantees, so that
latter case isn't handled.

Eg. The latter provisions could be used to analyze a pipeline to ensure the
outputs are all properly anonymized to a certain degree at construction
time.

On Thu, Feb 13, 2020, 2:26 PM Kenneth Knowles  wrote:

>
>
> On Thu, Feb 13, 2020 at 12:42 PM Jan Lukavský  wrote:
>
>> Hi,
>>
>> +1 for adding pipeline required features. I think being able to reject
>> pipeline with unknown requirement is pretty much needed, mostly because
>> that enables runners to completely decouple from SDKs, while being able to
>> recognize when a pipeline constructed with incomplatible version of SDK is
>> run.
>>
>> I'll add some observations I made when implementing the latest "requires
>> time sorted input" addition with regards to this discussion:
>>
>>  a) the features of pipeline are not simple function of set of
>> PTransforms being present in the pipeline, but also depend on (type of)
>> inputs. For instance a PTransform might have a simple expansion to
>> primitive PTransforms in streaming case, but don't have such expansion in
>> batch case. That is to say, runner that doesn't actually know of a specific
>> extension to some PTransform _might_ actually execute it correctly under
>> some conditions. But _must_ fail in other cases.
>>
>>  b) it would be good if this feature would work independently of
>> portability (for Java SDK). We still have (at least two) non-portable
>> runners that are IMO widely used in production and are likely to last for
>> some time.
>>
> I think even if these runners keep their execution not using portability,
> they should migrate to use the portable pipeline definition. Then they can
> share the same model w/ runners that execute using portability. The Fn API
> is not required to be used as long as the runner implements the semantics
> of the pipeline.
>
> Kenn
>
>
>>  c) we can take advantage of these pipeline features to get rid of the
>> categories of @ValidatesRunner tests, because we could have just simply
>> @ValidatesRunner and each test would be matched against runner capabilities
>> (i.e. a runner would be tested with given test if and only if it would not
>> reject it)
>>
>> Jan
>> On 2/13/20 8:42 PM, Robert Burke wrote:
>>
>> +1 to deferring for now. Since they should not be modified after
>> adoption, it makes sense not to get ahead of ourselves.
>>
>> On Thu, Feb 13, 2020, 10:59 AM Robert Bradshaw 
>> wrote:
>>
>>> On Thu, Feb 13, 2020 at 10:12 AM Robert Burke 
>>> wrote:
>>> >
>>> > One thing that doesn't appear to have been suggested yet is we could
>>> "batch" urns together under a "super urn" so that adding one super urn is
>>> like adding each of the represented batch of features. This prevents
>>> needing to send dozens of urns to be individually sent over.
>>> >
>>> >
>>> > The super urns would need to be static after definition to avoid
>>> mismatched definitions down the road.
>>> >
>>> > We collect together urns what is reasonably consider "vX" support, and
>>> can then increment that later.
>>> >
>>> > This would simplify new SDKs, as they can have a goal of initial v1
>>> support as we define what level of feature support it has, and doesn't
>>> prevent new capabilities from being added incrementally.
>>>
>>> Yes, this is a very good idea. I've also been thinking of certain sets
>>> of common operations/well known DoFns that often occur on opposite
>>> sides of GBKs (e.g. the pair-with-one, sum-ints, drop-keys, ...) that
>>> are commonly supported that could be grouped under these meta-urns.
>>>
>>> Note that these need not be monotonic, for example a current v1 might
>>> be requiring LengthPrefixCoderV1, but if a more efficient
>>> LengthPrefixCoderV2 comes along eventually v2 could require that and
>>> *not* require the old, now rarely used LengthPrefixCoderV1.
>>>
>>> Probably makes sense to defer adding such super-urns until we notice a
>>> set that is commonly used together in practice.
>>>
>>> Of course there's still value in SDKs being able to support features
>>> piecemeal as well, which is the big reason we're avoiding a simple
>>> monotonically-increasing version number.
>>>
>>> > Similarly, certain features sets could stand alone, eg around SQL.
>>> It's benefitial for optimization reasons if an SDK has native projection
>>> and UDF support for example, which a runner could take advantage of by
>>> avoiding extra cross language hops. These could then also be grouped under
>>> a SQL super urn.
>>> >
>>> > This is from the SDK capability side of course, rather than the SDK
>>> pipeline requirements side.
>>> >
>>> > ---
>>> > Related to that last point, it might be 

Re: FnAPI proto backwards compatibility

2020-02-13 Thread Kenneth Knowles
On Thu, Feb 13, 2020 at 12:42 PM Jan Lukavský  wrote:

> Hi,
>
> +1 for adding pipeline required features. I think being able to reject
> pipeline with unknown requirement is pretty much needed, mostly because
> that enables runners to completely decouple from SDKs, while being able to
> recognize when a pipeline constructed with incomplatible version of SDK is
> run.
>
> I'll add some observations I made when implementing the latest "requires
> time sorted input" addition with regards to this discussion:
>
>  a) the features of pipeline are not simple function of set of PTransforms
> being present in the pipeline, but also depend on (type of) inputs. For
> instance a PTransform might have a simple expansion to primitive
> PTransforms in streaming case, but don't have such expansion in batch case.
> That is to say, runner that doesn't actually know of a specific extension
> to some PTransform _might_ actually execute it correctly under some
> conditions. But _must_ fail in other cases.
>
>  b) it would be good if this feature would work independently of
> portability (for Java SDK). We still have (at least two) non-portable
> runners that are IMO widely used in production and are likely to last for
> some time.
>
I think even if these runners keep their execution not using portability,
they should migrate to use the portable pipeline definition. Then they can
share the same model w/ runners that execute using portability. The Fn API
is not required to be used as long as the runner implements the semantics
of the pipeline.

Kenn


>  c) we can take advantage of these pipeline features to get rid of the
> categories of @ValidatesRunner tests, because we could have just simply
> @ValidatesRunner and each test would be matched against runner capabilities
> (i.e. a runner would be tested with given test if and only if it would not
> reject it)
>
> Jan
> On 2/13/20 8:42 PM, Robert Burke wrote:
>
> +1 to deferring for now. Since they should not be modified after adoption,
> it makes sense not to get ahead of ourselves.
>
> On Thu, Feb 13, 2020, 10:59 AM Robert Bradshaw 
> wrote:
>
>> On Thu, Feb 13, 2020 at 10:12 AM Robert Burke  wrote:
>> >
>> > One thing that doesn't appear to have been suggested yet is we could
>> "batch" urns together under a "super urn" so that adding one super urn is
>> like adding each of the represented batch of features. This prevents
>> needing to send dozens of urns to be individually sent over.
>> >
>> >
>> > The super urns would need to be static after definition to avoid
>> mismatched definitions down the road.
>> >
>> > We collect together urns what is reasonably consider "vX" support, and
>> can then increment that later.
>> >
>> > This would simplify new SDKs, as they can have a goal of initial v1
>> support as we define what level of feature support it has, and doesn't
>> prevent new capabilities from being added incrementally.
>>
>> Yes, this is a very good idea. I've also been thinking of certain sets
>> of common operations/well known DoFns that often occur on opposite
>> sides of GBKs (e.g. the pair-with-one, sum-ints, drop-keys, ...) that
>> are commonly supported that could be grouped under these meta-urns.
>>
>> Note that these need not be monotonic, for example a current v1 might
>> be requiring LengthPrefixCoderV1, but if a more efficient
>> LengthPrefixCoderV2 comes along eventually v2 could require that and
>> *not* require the old, now rarely used LengthPrefixCoderV1.
>>
>> Probably makes sense to defer adding such super-urns until we notice a
>> set that is commonly used together in practice.
>>
>> Of course there's still value in SDKs being able to support features
>> piecemeal as well, which is the big reason we're avoiding a simple
>> monotonically-increasing version number.
>>
>> > Similarly, certain features sets could stand alone, eg around SQL. It's
>> benefitial for optimization reasons if an SDK has native projection and UDF
>> support for example, which a runner could take advantage of by avoiding
>> extra cross language hops. These could then also be grouped under a SQL
>> super urn.
>> >
>> > This is from the SDK capability side of course, rather than the SDK
>> pipeline requirements side.
>> >
>> > ---
>> > Related to that last point, it might be good to nail down early the
>> perspective used when discussing these things, as there's a dual between
>> "what and SDK can do", and "what the runner will do to a pipeline that the
>> SDK can understand" (eg. Combiner lifting, and state backed iterables), as
>> well as "what the pipeline requires from the runner" and "what the runner
>> is able to do" (eg. Requires sorted input)
>> >
>> >
>> > On Thu, Feb 13, 2020, 9:06 AM Luke Cwik  wrote:
>> >>
>> >>
>> >>
>> >> On Wed, Feb 12, 2020 at 2:24 PM Kenneth Knowles 
>> wrote:
>> >>>
>> >>>
>> >>>
>> >>> On Wed, Feb 12, 2020 at 12:04 PM Robert Bradshaw 
>> wrote:
>> 
>>  On Wed, Feb 12, 2020 at 11:08 AM Luke Cwik  wrote:
>>  >
>>  > We can 

Re: daily dataflow job failing today

2020-02-13 Thread Kenneth Knowles
But pip doesn't try to reconcile user's requested version and Beam's listed
dep, right? (https://github.com/pypa/pip/issues/988 still open)

Kenn

On Thu, Feb 13, 2020 at 9:48 AM Ahmet Altay  wrote:

> Thank you, Ismaël. I did not know that Avro was not using semantic
> versioning either.
>
> On Thu, Feb 13, 2020 at 9:44 AM Valentyn Tymofieiev 
> wrote:
>
>> Thank you, Ismaël. Good to know Avro doesn't follow semantic versioning.
>> Replied on the PR.
>>
>> On Thu, Feb 13, 2020 at 5:24 AM Ismaël Mejía  wrote:
>>
>>> For info Avro has published a new version 1.9.2.1 that fixes the issue:
>>> https://issues.apache.org/jira/browse/AVRO-2737
>>>
>>> I just submitted a PR to make the dependency consistent with Avro
>>> versioning and
>>> verify that everything works as intended with the upgraded dependency on
>>> the
>>> python SDK. Can you PTAL?
>>> https://github.com/apache/beam/pull/10851
>>>
>>>
>>> On Thu, Feb 13, 2020 at 9:39 AM Ismaël Mejía  wrote:
>>>

 > I can argue for not pinning and bounding with major version ranges.
 This gives flexibility to users to mix other third party libraries that
 share common dependencies with Beam. Our expectation is that dependencies
 follow semantic versioning and do not introduce breaking changes unless
 there is a major version change. A good example of this is Beam's
 dependency on "pytz>=2018.3". It is a simple wrapper around a time zone
 file. Latest version of the dependency is 2019.3, it is updated a few times
 a year. Beam users do not have to update Beam just to be able to use a
 later version of it since Beam does not pin it.

 Avro does not follow semantic versioning (the first number corresponds
 to the version of the Avro binary format the release is compatible with,
 the second correspond to the MAJOR and the third to the MINOR in semver),
 so we should then fix the upper bound to 1.10.0 instead of 2.0.0
 considering that 1.10.x before the summer and it may contain breaking
 changes.

 > There is also a middle ground, where we can pin certain dependencies
 if we are not confident about their releases. And allow ranges for rest of
 the dependencies. In general, we are currently following this practice.

 I see your point, like many things in software it is all about
 tradeoffs, and it is good to find a middle ground, do we have a robust
 reproducible release experience, or do we deal with the annoyance of doing
 manual minor version upgrades. Choices choices...




 On Thu, Feb 13, 2020 at 2:26 AM Ahmet Altay  wrote:

>
>
> On Wed, Feb 12, 2020 at 12:54 PM Ismaël Mejía 
> wrote:
>
>> Independently of the bug in the dependency release the fact that the
>> Beam Python
>> SDK does not have pinned fixed dependency numbers is error-prone. We
>> may
>> continue to have this kind of problems until we fix this (with other
>> dependencies too). In the Java SDK we do not accept such type of
>> dynamic
>> dependency numbers and python should probably follow this practice to
>> avoid
>> issues like the present one.
>>
>> Why don't we just do:
>>
>> 'avro-python3==1.9.1',
>>
>> instead of the current:
>>
>> 'avro-python3>=1.8.1,!=1.9.2,<2.0.0; python_version >= "3.0"',
>>
>
> I agree this is error prone. Your argument for pinning makes sense and
> I agree with it.
>
> I can argue for not pinning and bounding with major version ranges.
> This gives flexibility to users to mix other third party libraries that
> share common dependencies with Beam. Our expectation is that dependencies
> follow semantic versioning and do not introduce breaking changes unless
> there is a major version change. A good example of this is Beam's
> dependency on "pytz>=2018.3". It is a simple wrapper around a time zone
> file. Latest version of the dependency is 2019.3, it is updated a few 
> times
> a year. Beam users do not have to update Beam just to be able to use a
> later version of it since Beam does not pin it.
>
> There is also a middle ground, where we can pin certain dependencies
> if we are not confident about their releases. And allow ranges for rest of
> the dependencies. In general, we are currently following this practice.
>
>
>>
>>
>> On Wed, Feb 12, 2020 at 9:14 PM Ahmet Altay  wrote:
>>
>>> Related: we have dependencies on avro, avro-python3, and fastavro.
>>> fastavro supports both python 2 and 3. Could we reduce this dependency 
>>> list
>>> and depend only on fastavro? If we need avro and avro-python3 for the
>>> purposes of testing only, we can move them to test only dependencies.
>>>
>>> +Chamikara Jayalath , because I vaguely
>>> remember him working on this.
>>>
>>> The reason I am calling for this is the 

Re: Jenkins jobs not running for my PR 10438

2020-02-13 Thread Tomo Suzuki
Ahmet, thanks. But it seems Jenkins is not reporting the status
correctly. Will check tomorrow.

On Thu, Feb 13, 2020 at 2:45 PM Tomo Suzuki  wrote:
>
> Hi Beam committers,
>
> Would you run precommit checks on https://github.com/apache/beam/pull/10765
> with the following 6 additional commands?
> Run Java PostCommit
> Run Java HadoopFormatIO Performance Test
> Run BigQueryIO Streaming Performance Test Java
> Run Dataflow ValidatesRunner
> Run Spark ValidatesRunner
> Run SQL Postcommit
>
> Regards,
> Tomo



-- 
Regards,
Tomo


Re: FnAPI proto backwards compatibility

2020-02-13 Thread Jan Lukavský

Hi,

+1 for adding pipeline required features. I think being able to reject 
pipeline with unknown requirement is pretty much needed, mostly because 
that enables runners to completely decouple from SDKs, while being able 
to recognize when a pipeline constructed with incomplatible version of 
SDK is run.


I'll add some observations I made when implementing the latest "requires 
time sorted input" addition with regards to this discussion:


 a) the features of pipeline are not simple function of set of 
PTransforms being present in the pipeline, but also depend on (type of) 
inputs. For instance a PTransform might have a simple expansion to 
primitive PTransforms in streaming case, but don't have such expansion 
in batch case. That is to say, runner that doesn't actually know of a 
specific extension to some PTransform _might_ actually execute it 
correctly under some conditions. But _must_ fail in other cases.


 b) it would be good if this feature would work independently of 
portability (for Java SDK). We still have (at least two) non-portable 
runners that are IMO widely used in production and are likely to last 
for some time.


 c) we can take advantage of these pipeline features to get rid of the 
categories of @ValidatesRunner tests, because we could have just simply 
@ValidatesRunner and each test would be matched against runner 
capabilities (i.e. a runner would be tested with given test if and only 
if it would not reject it)


Jan

On 2/13/20 8:42 PM, Robert Burke wrote:
+1 to deferring for now. Since they should not be modified after 
adoption, it makes sense not to get ahead of ourselves.


On Thu, Feb 13, 2020, 10:59 AM Robert Bradshaw > wrote:


On Thu, Feb 13, 2020 at 10:12 AM Robert Burke mailto:rob...@frantil.com>> wrote:
>
> One thing that doesn't appear to have been suggested yet is we
could "batch" urns together under a "super urn" so that adding one
super urn is like adding each of the represented batch of
features. This prevents needing to send dozens of urns to be
individually sent over.
>
>
> The super urns would need to be static after definition to avoid
mismatched definitions down the road.
>
> We collect together urns what is reasonably consider "vX"
support, and can then increment that later.
>
> This would simplify new SDKs, as they can have a goal of initial
v1 support as we define what level of feature support it has, and
doesn't prevent new capabilities from being added incrementally.

Yes, this is a very good idea. I've also been thinking of certain sets
of common operations/well known DoFns that often occur on opposite
sides of GBKs (e.g. the pair-with-one, sum-ints, drop-keys, ...) that
are commonly supported that could be grouped under these meta-urns.

Note that these need not be monotonic, for example a current v1 might
be requiring LengthPrefixCoderV1, but if a more efficient
LengthPrefixCoderV2 comes along eventually v2 could require that and
*not* require the old, now rarely used LengthPrefixCoderV1.

Probably makes sense to defer adding such super-urns until we notice a
set that is commonly used together in practice.

Of course there's still value in SDKs being able to support features
piecemeal as well, which is the big reason we're avoiding a simple
monotonically-increasing version number.

> Similarly, certain features sets could stand alone, eg around
SQL. It's benefitial for optimization reasons if an SDK has native
projection and UDF support for example, which a runner could take
advantage of by avoiding extra cross language hops. These could
then also be grouped under a SQL super urn.
>
> This is from the SDK capability side of course, rather than the
SDK pipeline requirements side.
>
> ---
> Related to that last point, it might be good to nail down early
the perspective used when discussing these things, as there's a
dual between "what and SDK can do", and "what the runner will do
to a pipeline that the SDK can understand" (eg. Combiner lifting,
and state backed iterables), as well as "what the pipeline
requires from the runner" and "what the runner is able to do" (eg.
Requires sorted input)
>
>
> On Thu, Feb 13, 2020, 9:06 AM Luke Cwik mailto:lc...@google.com>> wrote:
>>
>>
>>
>> On Wed, Feb 12, 2020 at 2:24 PM Kenneth Knowles
mailto:k...@apache.org>> wrote:
>>>
>>>
>>>
>>> On Wed, Feb 12, 2020 at 12:04 PM Robert Bradshaw
mailto:rober...@google.com>> wrote:

 On Wed, Feb 12, 2020 at 11:08 AM Luke Cwik mailto:lc...@google.com>> wrote:
 >
 > We can always detect on the runner/SDK side whether there
is an unknown field[1] within a payload and fail to process it but
this is painful in two situations:
 > 1) It doesn't provide for a good 

Re: Jenkins jobs not running for my PR 10438

2020-02-13 Thread Tomo Suzuki
Hi Beam committers,

Would you run precommit checks on https://github.com/apache/beam/pull/10765
with the following 6 additional commands?
Run Java PostCommit
Run Java HadoopFormatIO Performance Test
Run BigQueryIO Streaming Performance Test Java
Run Dataflow ValidatesRunner
Run Spark ValidatesRunner
Run SQL Postcommit

Regards,
Tomo


Re: FnAPI proto backwards compatibility

2020-02-13 Thread Robert Burke
+1 to deferring for now. Since they should not be modified after adoption,
it makes sense not to get ahead of ourselves.

On Thu, Feb 13, 2020, 10:59 AM Robert Bradshaw  wrote:

> On Thu, Feb 13, 2020 at 10:12 AM Robert Burke  wrote:
> >
> > One thing that doesn't appear to have been suggested yet is we could
> "batch" urns together under a "super urn" so that adding one super urn is
> like adding each of the represented batch of features. This prevents
> needing to send dozens of urns to be individually sent over.
> >
> >
> > The super urns would need to be static after definition to avoid
> mismatched definitions down the road.
> >
> > We collect together urns what is reasonably consider "vX" support, and
> can then increment that later.
> >
> > This would simplify new SDKs, as they can have a goal of initial v1
> support as we define what level of feature support it has, and doesn't
> prevent new capabilities from being added incrementally.
>
> Yes, this is a very good idea. I've also been thinking of certain sets
> of common operations/well known DoFns that often occur on opposite
> sides of GBKs (e.g. the pair-with-one, sum-ints, drop-keys, ...) that
> are commonly supported that could be grouped under these meta-urns.
>
> Note that these need not be monotonic, for example a current v1 might
> be requiring LengthPrefixCoderV1, but if a more efficient
> LengthPrefixCoderV2 comes along eventually v2 could require that and
> *not* require the old, now rarely used LengthPrefixCoderV1.
>
> Probably makes sense to defer adding such super-urns until we notice a
> set that is commonly used together in practice.
>
> Of course there's still value in SDKs being able to support features
> piecemeal as well, which is the big reason we're avoiding a simple
> monotonically-increasing version number.
>
> > Similarly, certain features sets could stand alone, eg around SQL. It's
> benefitial for optimization reasons if an SDK has native projection and UDF
> support for example, which a runner could take advantage of by avoiding
> extra cross language hops. These could then also be grouped under a SQL
> super urn.
> >
> > This is from the SDK capability side of course, rather than the SDK
> pipeline requirements side.
> >
> > ---
> > Related to that last point, it might be good to nail down early the
> perspective used when discussing these things, as there's a dual between
> "what and SDK can do", and "what the runner will do to a pipeline that the
> SDK can understand" (eg. Combiner lifting, and state backed iterables), as
> well as "what the pipeline requires from the runner" and "what the runner
> is able to do" (eg. Requires sorted input)
> >
> >
> > On Thu, Feb 13, 2020, 9:06 AM Luke Cwik  wrote:
> >>
> >>
> >>
> >> On Wed, Feb 12, 2020 at 2:24 PM Kenneth Knowles 
> wrote:
> >>>
> >>>
> >>>
> >>> On Wed, Feb 12, 2020 at 12:04 PM Robert Bradshaw 
> wrote:
> 
>  On Wed, Feb 12, 2020 at 11:08 AM Luke Cwik  wrote:
>  >
>  > We can always detect on the runner/SDK side whether there is an
> unknown field[1] within a payload and fail to process it but this is
> painful in two situations:
>  > 1) It doesn't provide for a good error message since you can't say
> what the purpose of the field is. With a capability URN, the runner/SDK
> could say which URN it doesn't understand.
>  > 2) It doesn't allow for the addition of fields which don't impact
> semantics of execution. For example, if the display data feature was being
> developed, a runner could ignore it and still execute the pipeline
> correctly.
> 
>  Yeah, I don't think proto reflection is a flexible enough tool to do
>  this well either.
> 
>  > If we think this to be common enough, we can add capabilities list
> to the PTransform so each PTransform can do this and has a natural way of
> being extended for additions which are forwards compatible. The alternative
> to having capabilities on PTransform (and other constructs) is that we
> would have a new URN when the specification of the transform changes. For
> forwards compatible changes, each SDK/runner would map older versions of
> the URN onto the latest and internally treat it as the latest version but
> always downgrade it to the version the other party expects when
> communicating with it. Backwards incompatible changes would always require
> a new URN which capabilities at the PTransform level would not help with.
> 
>  As you point out, stateful+splittable may not be a particularly useful
>  combination, but as another example, we have
>  (backwards-incompatible-when-introduced) markers on DoFn as to whether
>  it requires finalization, stable inputs, and now time sorting. I don't
>  think we should have a new URN for each combination.
> >>>
> >>>
> >>> Agree with this. I don't think stateful, splittable, and "plain" ParDo
> are comparable to these. Each is an entirely different computational
> paradigm: per-element independent 

Re: FnAPI proto backwards compatibility

2020-02-13 Thread Robert Bradshaw
On Thu, Feb 13, 2020 at 10:12 AM Robert Burke  wrote:
>
> One thing that doesn't appear to have been suggested yet is we could "batch" 
> urns together under a "super urn" so that adding one super urn is like adding 
> each of the represented batch of features. This prevents needing to send 
> dozens of urns to be individually sent over.
>
>
> The super urns would need to be static after definition to avoid mismatched 
> definitions down the road.
>
> We collect together urns what is reasonably consider "vX" support, and can 
> then increment that later.
>
> This would simplify new SDKs, as they can have a goal of initial v1 support 
> as we define what level of feature support it has, and doesn't prevent new 
> capabilities from being added incrementally.

Yes, this is a very good idea. I've also been thinking of certain sets
of common operations/well known DoFns that often occur on opposite
sides of GBKs (e.g. the pair-with-one, sum-ints, drop-keys, ...) that
are commonly supported that could be grouped under these meta-urns.

Note that these need not be monotonic, for example a current v1 might
be requiring LengthPrefixCoderV1, but if a more efficient
LengthPrefixCoderV2 comes along eventually v2 could require that and
*not* require the old, now rarely used LengthPrefixCoderV1.

Probably makes sense to defer adding such super-urns until we notice a
set that is commonly used together in practice.

Of course there's still value in SDKs being able to support features
piecemeal as well, which is the big reason we're avoiding a simple
monotonically-increasing version number.

> Similarly, certain features sets could stand alone, eg around SQL. It's 
> benefitial for optimization reasons if an SDK has native projection and UDF 
> support for example, which a runner could take advantage of by avoiding extra 
> cross language hops. These could then also be grouped under a SQL super urn.
>
> This is from the SDK capability side of course, rather than the SDK pipeline 
> requirements side.
>
> ---
> Related to that last point, it might be good to nail down early the 
> perspective used when discussing these things, as there's a dual between 
> "what and SDK can do", and "what the runner will do to a pipeline that the 
> SDK can understand" (eg. Combiner lifting, and state backed iterables), as 
> well as "what the pipeline requires from the runner" and "what the runner is 
> able to do" (eg. Requires sorted input)
>
>
> On Thu, Feb 13, 2020, 9:06 AM Luke Cwik  wrote:
>>
>>
>>
>> On Wed, Feb 12, 2020 at 2:24 PM Kenneth Knowles  wrote:
>>>
>>>
>>>
>>> On Wed, Feb 12, 2020 at 12:04 PM Robert Bradshaw  
>>> wrote:

 On Wed, Feb 12, 2020 at 11:08 AM Luke Cwik  wrote:
 >
 > We can always detect on the runner/SDK side whether there is an unknown 
 > field[1] within a payload and fail to process it but this is painful in 
 > two situations:
 > 1) It doesn't provide for a good error message since you can't say what 
 > the purpose of the field is. With a capability URN, the runner/SDK could 
 > say which URN it doesn't understand.
 > 2) It doesn't allow for the addition of fields which don't impact 
 > semantics of execution. For example, if the display data feature was 
 > being developed, a runner could ignore it and still execute the pipeline 
 > correctly.

 Yeah, I don't think proto reflection is a flexible enough tool to do
 this well either.

 > If we think this to be common enough, we can add capabilities list to 
 > the PTransform so each PTransform can do this and has a natural way of 
 > being extended for additions which are forwards compatible. The 
 > alternative to having capabilities on PTransform (and other constructs) 
 > is that we would have a new URN when the specification of the transform 
 > changes. For forwards compatible changes, each SDK/runner would map 
 > older versions of the URN onto the latest and internally treat it as the 
 > latest version but always downgrade it to the version the other party 
 > expects when communicating with it. Backwards incompatible changes would 
 > always require a new URN which capabilities at the PTransform level 
 > would not help with.

 As you point out, stateful+splittable may not be a particularly useful
 combination, but as another example, we have
 (backwards-incompatible-when-introduced) markers on DoFn as to whether
 it requires finalization, stable inputs, and now time sorting. I don't
 think we should have a new URN for each combination.
>>>
>>>
>>> Agree with this. I don't think stateful, splittable, and "plain" ParDo are 
>>> comparable to these. Each is an entirely different computational paradigm: 
>>> per-element independent processing, per-key-and-window linear processing, 
>>> and per-element-and-restriction splittable processing. Most relevant IMO is 
>>> the nature of the parallelism. If you added 

Re: FnAPI proto backwards compatibility

2020-02-13 Thread Robert Burke
One thing that doesn't appear to have been suggested yet is we could
"batch" urns together under a "super urn" so that adding one super urn is
like adding each of the represented batch of features. This prevents
needing to send dozens of urns to be individually sent over.


The super urns would need to be static after definition to avoid mismatched
definitions down the road.

We collect together urns what is reasonably consider "vX" support, and can
then increment that later.

This would simplify new SDKs, as they can have a goal of initial v1 support
as we define what level of feature support it has, and doesn't prevent new
capabilities from being added incrementally.

Similarly, certain features sets could stand alone, eg around SQL. It's
benefitial for optimization reasons if an SDK has native projection and UDF
support for example, which a runner could take advantage of by avoiding
extra cross language hops. These could then also be grouped under a SQL
super urn.

This is from the SDK capability side of course, rather than the SDK
pipeline requirements side.

---
Related to that last point, it might be good to nail down early the
perspective used when discussing these things, as there's a dual between
"what and SDK can do", and "what the runner will do to a pipeline that the
SDK can understand" (eg. Combiner lifting, and state backed iterables), as
well as "what the pipeline requires from the runner" and "what the runner
is able to do" (eg. Requires sorted input)


On Thu, Feb 13, 2020, 9:06 AM Luke Cwik  wrote:

>
>
> On Wed, Feb 12, 2020 at 2:24 PM Kenneth Knowles  wrote:
>
>>
>>
>> On Wed, Feb 12, 2020 at 12:04 PM Robert Bradshaw 
>> wrote:
>>
>>> On Wed, Feb 12, 2020 at 11:08 AM Luke Cwik  wrote:
>>> >
>>> > We can always detect on the runner/SDK side whether there is an
>>> unknown field[1] within a payload and fail to process it but this is
>>> painful in two situations:
>>> > 1) It doesn't provide for a good error message since you can't say
>>> what the purpose of the field is. With a capability URN, the runner/SDK
>>> could say which URN it doesn't understand.
>>> > 2) It doesn't allow for the addition of fields which don't impact
>>> semantics of execution. For example, if the display data feature was being
>>> developed, a runner could ignore it and still execute the pipeline
>>> correctly.
>>>
>>> Yeah, I don't think proto reflection is a flexible enough tool to do
>>> this well either.
>>>
>>> > If we think this to be common enough, we can add capabilities list to
>>> the PTransform so each PTransform can do this and has a natural way of
>>> being extended for additions which are forwards compatible. The alternative
>>> to having capabilities on PTransform (and other constructs) is that we
>>> would have a new URN when the specification of the transform changes. For
>>> forwards compatible changes, each SDK/runner would map older versions of
>>> the URN onto the latest and internally treat it as the latest version but
>>> always downgrade it to the version the other party expects when
>>> communicating with it. Backwards incompatible changes would always require
>>> a new URN which capabilities at the PTransform level would not help with.
>>>
>>> As you point out, stateful+splittable may not be a particularly useful
>>> combination, but as another example, we have
>>> (backwards-incompatible-when-introduced) markers on DoFn as to whether
>>> it requires finalization, stable inputs, and now time sorting. I don't
>>> think we should have a new URN for each combination.
>>>
>>
>> Agree with this. I don't think stateful, splittable, and "plain" ParDo
>> are comparable to these. Each is an entirely different computational
>> paradigm: per-element independent processing, per-key-and-window linear
>> processing, and per-element-and-restriction splittable processing. Most
>> relevant IMO is the nature of the parallelism. If you added state to
>> splittable processing, it would still be splittable processing. Just as
>> Combine and ParDo can share the SideInput specification, it is easy to
>> share relevant sub-structures like state declarations. But it is a fair
>> point that the ability to split can be ignored and run as a plain-old
>> ParDo. It brings up the question of whether a runner that doesn't know SDF
>> is should have to reject it or should be allowed to run poorly.
>>
>
> Being splittable means that the SDK could choose to return a continuation
> saying please process the rest of my element in X amount of time which
> would require the runner to inspect certain fields on responses. One
> example would be I don't have many more messages to read from this message
> stream at the moment and another example could be that I detected that this
> filesystem is throttling me or is down and I would like to resume
> processing later.
>
>
>> It isn't a huge deal. Three different top-level URNS versus three
>> different sub-URNs will achieve the same result in the end if we get this
>> 

Re: [PROPOSAL] Transition released containers to the official ASF dockerhub organization

2020-02-13 Thread Robert Burke
+1 to a bulk add. Shared account removes all accouttabillity and is at risk
for abuse.

As it stands, the release managers could abuse their privilege, but we'd
have the opportunity to know about whodunnit.

On Thu, Feb 13, 2020, 9:51 AM Robert Bradshaw  wrote:

> +1, granting permission to individual accounts is preferable to trying
> to share a single account.
>
> On Thu, Feb 13, 2020 at 9:44 AM Ahmet Altay  wrote:
> >
> > Could we ask them to buik add a list of people to add to the list? We
> could add all PMC members and previous release managers to the list. That
> might cover a good chunk of the future releases.
> >
> > On Wed, Feb 12, 2020 at 10:10 PM Hannah Jiang 
> wrote:
> >>
> >> Thanks everyone for supporting it.
> >>
> >> Yes, it's very slow to get tickets resolved by infra. I propose a minor
> improvement to reduce interactions with infra.
> >>
> >> So far, we have granted maintainer permission(read & write) to release
> managers' personal accounts. This step needs help from infra to add new
> members to the group for every new release manager.
> >> In order to avoid this, I proposed that we create a new account for
> release purpose only and share it with release managers. The new account
> will have read & write permissions to all Apache Beam docker repositories.
> A password will be shared on an as-needed basis and we can change the
> password periodically if needed, which is in our control. Are there any
> concerns which I am not aware of with the sharing account approach?
> >>
> >> Thanks,
> >> Hannah
> >>
> >>
> >> On Thu, Jan 16, 2020 at 10:41 AM Kenneth Knowles 
> wrote:
> >>>
> >>> +1 very nice explanation
> >>>
> >>> On Wed, Jan 15, 2020 at 1:57 PM Ahmet Altay  wrote:
> 
>  +1 - Thank you for driving this!
> 
>  On Wed, Jan 15, 2020 at 1:55 PM Thomas Weise  wrote:
> >
> > +1 for the namespace proposal.
> >
> > It is similar to github repos. Top-level is the org, then single
> level for repo (beam-abc, beam-xzy, ..)
> >
> >
> >
> > On Wed, Jan 15, 2020 at 1:45 PM Robert Bradshaw 
> wrote:
> >>
> >> Various tags of the same image should at least logically be the same
> >> thing, so I agree that we should not be trying to share a single
> >> repository in that way. A full suite of apache/beam-{image_desc}
> >> repositories, if apache is fine with that, seems like the best
> >> approach.
> >>
> >> On Wed, Jan 15, 2020 at 1:32 PM Kyle Weaver 
> wrote:
> >> >
> >> > +1, agree that moving current image name to tags is a
> non-starter. Thanks for driving this Hannah. Let us know what they say
> about repo creation.
> >> >
> >> > On Wed, Jan 15, 2020 at 1:16 PM Udi Meiri 
> wrote:
> >> >>
> >> >> SG +1
> >> >>
> >> >> On Wed, Jan 15, 2020 at 12:59 PM Hannah Jiang <
> hannahji...@google.com> wrote:
> >> >>>
> >> >>> I have done some research about images released under apache
> namespace at docker hub, and here is my proposal.
> >> >>>
> >> >>> Currently, we are using apachebeam as our namespace and each
> image has its own repository. Version number is used to tag the images.
> >> >>> ie: apachebeam/python2.7_sdk:2.19.0,
> apachebeam/flink1.9_job_server:2.19.0
> >> >>>
> >> >>> Now we are migrating to apache namespace and docker hub doesn't
> support nested repository names, so we cannot use
> apache/beam/{image-desc}:{version}.
> >> >>> Instead, I propose to use apache/beam-{image_desc}:{version} as
> our repository name.
> >> >>> ie: apache/beam-python2.7_sdk:2.19.0,
> apache/beam-flink1.9_job_server:2.19.0
> >> >>> => When a user searches for apache/beam at docker hub, it will
> list all the repositories we deployed with apache/beam-, so no concerns
> that some released images are missed by users.
> >> >>> => Repository names give insights to the users which
> repositories they should use.
> >> >>> => A downside with this approach is we need to create a new
> repository whenever we release a new image, time and effort needed for this
> is pending, I am contacting apache docker hub management team.
> >> >>>
> >> >>> I have considered using beam as repository name and moving
> image name and version to tags, (ie: apache/beam:python3.7_sdk_2.19.0),
> which means put all images to a single repository, however, this approach
> has some downsides.
> >> >>> => When a user searches for apache/beam, only one repository is
> returned. Users need to use tags to identify which images they should use.
> Since we release images with new tags for each version, it will overwhelm
> the users and give them an impression that the images are not organized
> well. It's also difficult to know what kind of images we deployed.
> >> >>> => With both image name and version included at tags, it is a
> little bit more complicated to maintain the code.
> >> >>> => There is no correct answer which image the latest tag should
> point to.
> 

Re: [PROPOSAL] Transition released containers to the official ASF dockerhub organization

2020-02-13 Thread Robert Bradshaw
+1, granting permission to individual accounts is preferable to trying
to share a single account.

On Thu, Feb 13, 2020 at 9:44 AM Ahmet Altay  wrote:
>
> Could we ask them to buik add a list of people to add to the list? We could 
> add all PMC members and previous release managers to the list. That might 
> cover a good chunk of the future releases.
>
> On Wed, Feb 12, 2020 at 10:10 PM Hannah Jiang  wrote:
>>
>> Thanks everyone for supporting it.
>>
>> Yes, it's very slow to get tickets resolved by infra. I propose a minor 
>> improvement to reduce interactions with infra.
>>
>> So far, we have granted maintainer permission(read & write) to release 
>> managers' personal accounts. This step needs help from infra to add new 
>> members to the group for every new release manager.
>> In order to avoid this, I proposed that we create a new account for release 
>> purpose only and share it with release managers. The new account will have 
>> read & write permissions to all Apache Beam docker repositories. A password 
>> will be shared on an as-needed basis and we can change the password 
>> periodically if needed, which is in our control. Are there any concerns 
>> which I am not aware of with the sharing account approach?
>>
>> Thanks,
>> Hannah
>>
>>
>> On Thu, Jan 16, 2020 at 10:41 AM Kenneth Knowles  wrote:
>>>
>>> +1 very nice explanation
>>>
>>> On Wed, Jan 15, 2020 at 1:57 PM Ahmet Altay  wrote:

 +1 - Thank you for driving this!

 On Wed, Jan 15, 2020 at 1:55 PM Thomas Weise  wrote:
>
> +1 for the namespace proposal.
>
> It is similar to github repos. Top-level is the org, then single level 
> for repo (beam-abc, beam-xzy, ..)
>
>
>
> On Wed, Jan 15, 2020 at 1:45 PM Robert Bradshaw  
> wrote:
>>
>> Various tags of the same image should at least logically be the same
>> thing, so I agree that we should not be trying to share a single
>> repository in that way. A full suite of apache/beam-{image_desc}
>> repositories, if apache is fine with that, seems like the best
>> approach.
>>
>> On Wed, Jan 15, 2020 at 1:32 PM Kyle Weaver  wrote:
>> >
>> > +1, agree that moving current image name to tags is a non-starter. 
>> > Thanks for driving this Hannah. Let us know what they say about repo 
>> > creation.
>> >
>> > On Wed, Jan 15, 2020 at 1:16 PM Udi Meiri  wrote:
>> >>
>> >> SG +1
>> >>
>> >> On Wed, Jan 15, 2020 at 12:59 PM Hannah Jiang 
>> >>  wrote:
>> >>>
>> >>> I have done some research about images released under apache 
>> >>> namespace at docker hub, and here is my proposal.
>> >>>
>> >>> Currently, we are using apachebeam as our namespace and each image 
>> >>> has its own repository. Version number is used to tag the images.
>> >>> ie: apachebeam/python2.7_sdk:2.19.0, 
>> >>> apachebeam/flink1.9_job_server:2.19.0
>> >>>
>> >>> Now we are migrating to apache namespace and docker hub doesn't 
>> >>> support nested repository names, so we cannot use 
>> >>> apache/beam/{image-desc}:{version}.
>> >>> Instead, I propose to use apache/beam-{image_desc}:{version} as our 
>> >>> repository name.
>> >>> ie: apache/beam-python2.7_sdk:2.19.0, 
>> >>> apache/beam-flink1.9_job_server:2.19.0
>> >>> => When a user searches for apache/beam at docker hub, it will list 
>> >>> all the repositories we deployed with apache/beam-, so no concerns 
>> >>> that some released images are missed by users.
>> >>> => Repository names give insights to the users which repositories 
>> >>> they should use.
>> >>> => A downside with this approach is we need to create a new 
>> >>> repository whenever we release a new image, time and effort needed 
>> >>> for this is pending, I am contacting apache docker hub management 
>> >>> team.
>> >>>
>> >>> I have considered using beam as repository name and moving image 
>> >>> name and version to tags, (ie: apache/beam:python3.7_sdk_2.19.0), 
>> >>> which means put all images to a single repository, however, this 
>> >>> approach has some downsides.
>> >>> => When a user searches for apache/beam, only one repository is 
>> >>> returned. Users need to use tags to identify which images they 
>> >>> should use. Since we release images with new tags for each version, 
>> >>> it will overwhelm the users and give them an impression that the 
>> >>> images are not organized well. It's also difficult to know what kind 
>> >>> of images we deployed.
>> >>> => With both image name and version included at tags, it is a little 
>> >>> bit more complicated to maintain the code.
>> >>> => There is no correct answer which image the latest tag should 
>> >>> point to.
>> >>>
>> >>> Are there any concerns with this proposal?
>> >>>
>> >>> Thanks,
>> >>> Hannah
>> >>>
>> >>>
>> >>>

Re: daily dataflow job failing today

2020-02-13 Thread Ahmet Altay
Thank you, Ismaël. I did not know that Avro was not using semantic
versioning either.

On Thu, Feb 13, 2020 at 9:44 AM Valentyn Tymofieiev 
wrote:

> Thank you, Ismaël. Good to know Avro doesn't follow semantic versioning.
> Replied on the PR.
>
> On Thu, Feb 13, 2020 at 5:24 AM Ismaël Mejía  wrote:
>
>> For info Avro has published a new version 1.9.2.1 that fixes the issue:
>> https://issues.apache.org/jira/browse/AVRO-2737
>>
>> I just submitted a PR to make the dependency consistent with Avro
>> versioning and
>> verify that everything works as intended with the upgraded dependency on
>> the
>> python SDK. Can you PTAL?
>> https://github.com/apache/beam/pull/10851
>>
>>
>> On Thu, Feb 13, 2020 at 9:39 AM Ismaël Mejía  wrote:
>>
>>>
>>> > I can argue for not pinning and bounding with major version ranges.
>>> This gives flexibility to users to mix other third party libraries that
>>> share common dependencies with Beam. Our expectation is that dependencies
>>> follow semantic versioning and do not introduce breaking changes unless
>>> there is a major version change. A good example of this is Beam's
>>> dependency on "pytz>=2018.3". It is a simple wrapper around a time zone
>>> file. Latest version of the dependency is 2019.3, it is updated a few times
>>> a year. Beam users do not have to update Beam just to be able to use a
>>> later version of it since Beam does not pin it.
>>>
>>> Avro does not follow semantic versioning (the first number corresponds
>>> to the version of the Avro binary format the release is compatible with,
>>> the second correspond to the MAJOR and the third to the MINOR in semver),
>>> so we should then fix the upper bound to 1.10.0 instead of 2.0.0
>>> considering that 1.10.x before the summer and it may contain breaking
>>> changes.
>>>
>>> > There is also a middle ground, where we can pin certain dependencies
>>> if we are not confident about their releases. And allow ranges for rest of
>>> the dependencies. In general, we are currently following this practice.
>>>
>>> I see your point, like many things in software it is all about
>>> tradeoffs, and it is good to find a middle ground, do we have a robust
>>> reproducible release experience, or do we deal with the annoyance of doing
>>> manual minor version upgrades. Choices choices...
>>>
>>>
>>>
>>>
>>> On Thu, Feb 13, 2020 at 2:26 AM Ahmet Altay  wrote:
>>>


 On Wed, Feb 12, 2020 at 12:54 PM Ismaël Mejía 
 wrote:

> Independently of the bug in the dependency release the fact that the
> Beam Python
> SDK does not have pinned fixed dependency numbers is error-prone. We
> may
> continue to have this kind of problems until we fix this (with other
> dependencies too). In the Java SDK we do not accept such type of
> dynamic
> dependency numbers and python should probably follow this practice to
> avoid
> issues like the present one.
>
> Why don't we just do:
>
> 'avro-python3==1.9.1',
>
> instead of the current:
>
> 'avro-python3>=1.8.1,!=1.9.2,<2.0.0; python_version >= "3.0"',
>

 I agree this is error prone. Your argument for pinning makes sense and
 I agree with it.

 I can argue for not pinning and bounding with major version ranges.
 This gives flexibility to users to mix other third party libraries that
 share common dependencies with Beam. Our expectation is that dependencies
 follow semantic versioning and do not introduce breaking changes unless
 there is a major version change. A good example of this is Beam's
 dependency on "pytz>=2018.3". It is a simple wrapper around a time zone
 file. Latest version of the dependency is 2019.3, it is updated a few times
 a year. Beam users do not have to update Beam just to be able to use a
 later version of it since Beam does not pin it.

 There is also a middle ground, where we can pin certain dependencies if
 we are not confident about their releases. And allow ranges for rest of the
 dependencies. In general, we are currently following this practice.


>
>
> On Wed, Feb 12, 2020 at 9:14 PM Ahmet Altay  wrote:
>
>> Related: we have dependencies on avro, avro-python3, and fastavro.
>> fastavro supports both python 2 and 3. Could we reduce this dependency 
>> list
>> and depend only on fastavro? If we need avro and avro-python3 for the
>> purposes of testing only, we can move them to test only dependencies.
>>
>> +Chamikara Jayalath , because I vaguely
>> remember him working on this.
>>
>> The reason I am calling for this is the impact of bad dependency
>> releases are high. All previously released Beam versions will be 
>> impacted.
>> Reducing the dependency list will reduce the risk.
>>
>> Ahmet
>>
>> On Wed, Feb 12, 2020 at 12:02 PM Ahmet Altay 
>> wrote:
>>
>>> Thank you Valentyn!
>>>
>>> On Wed, 

Re: [PROPOSAL] Transition released containers to the official ASF dockerhub organization

2020-02-13 Thread Ahmet Altay
Could we ask them to buik add a list of people to add to the list? We could
add all PMC members and previous release managers to the list. That might
cover a good chunk of the future releases.

On Wed, Feb 12, 2020 at 10:10 PM Hannah Jiang 
wrote:

> Thanks everyone for supporting it.
>
> Yes, it's very slow to get tickets resolved by infra. I propose a minor
> improvement to reduce interactions with infra.
>
> So far, we have granted maintainer permission(read & write) to release
> managers' personal accounts. This step needs help from infra to add new
> members to the group for every new release manager.
> In order to avoid this, I proposed that we create a new account for
> release purpose only and share it with release managers. The new account
> will have read & write permissions to all Apache Beam docker repositories.
> A password will be shared on an as-needed basis and we can change the
> password periodically if needed, which is in our control. Are there any
> concerns which I am not aware of with the sharing account approach?
>
> Thanks,
> Hannah
>
>
> On Thu, Jan 16, 2020 at 10:41 AM Kenneth Knowles  wrote:
>
>> +1 very nice explanation
>>
>> On Wed, Jan 15, 2020 at 1:57 PM Ahmet Altay  wrote:
>>
>>> +1 - Thank you for driving this!
>>>
>>> On Wed, Jan 15, 2020 at 1:55 PM Thomas Weise  wrote:
>>>
 +1 for the namespace proposal.

 It is similar to github repos. Top-level is the org, then single level
 for repo (beam-abc, beam-xzy, ..)



 On Wed, Jan 15, 2020 at 1:45 PM Robert Bradshaw 
 wrote:

> Various tags of the same image should at least logically be the same
> thing, so I agree that we should not be trying to share a single
> repository in that way. A full suite of apache/beam-{image_desc}
> repositories, if apache is fine with that, seems like the best
> approach.
>
> On Wed, Jan 15, 2020 at 1:32 PM Kyle Weaver 
> wrote:
> >
> > +1, agree that moving current image name to tags is a non-starter.
> Thanks for driving this Hannah. Let us know what they say about repo
> creation.
> >
> > On Wed, Jan 15, 2020 at 1:16 PM Udi Meiri  wrote:
> >>
> >> SG +1
> >>
> >> On Wed, Jan 15, 2020 at 12:59 PM Hannah Jiang <
> hannahji...@google.com> wrote:
> >>>
> >>> I have done some research about images released under apache
> namespace at docker hub, and here is my proposal.
> >>>
> >>> Currently, we are using apachebeam as our namespace and each image
> has its own repository. Version number is used to tag the images.
> >>> ie: apachebeam/python2.7_sdk:2.19.0,
> apachebeam/flink1.9_job_server:2.19.0
> >>>
> >>> Now we are migrating to apache namespace and docker hub doesn't
> support nested repository names, so we cannot use
> apache/beam/{image-desc}:{version}.
> >>> Instead, I propose to use apache/beam-{image_desc}:{version} as
> our repository name.
> >>> ie: apache/beam-python2.7_sdk:2.19.0,
> apache/beam-flink1.9_job_server:2.19.0
> >>> => When a user searches for apache/beam at docker hub, it will
> list all the repositories we deployed with apache/beam-, so no concerns
> that some released images are missed by users.
> >>> => Repository names give insights to the users which repositories
> they should use.
> >>> => A downside with this approach is we need to create a new
> repository whenever we release a new image, time and effort needed for 
> this
> is pending, I am contacting apache docker hub management team.
> >>>
> >>> I have considered using beam as repository name and moving image
> name and version to tags, (ie: apache/beam:python3.7_sdk_2.19.0), which
> means put all images to a single repository, however, this approach has
> some downsides.
> >>> => When a user searches for apache/beam, only one repository is
> returned. Users need to use tags to identify which images they should use.
> Since we release images with new tags for each version, it will overwhelm
> the users and give them an impression that the images are not organized
> well. It's also difficult to know what kind of images we deployed.
> >>> => With both image name and version included at tags, it is a
> little bit more complicated to maintain the code.
> >>> => There is no correct answer which image the latest tag should
> point to.
> >>>
> >>> Are there any concerns with this proposal?
> >>>
> >>> Thanks,
> >>> Hannah
> >>>
> >>>
> >>>
> >>>
> >>> On Fri, Jan 10, 2020 at 4:19 PM Ahmet Altay 
> wrote:
> 
> 
> 
>  On Fri, Jan 10, 2020 at 3:33 PM Ahmet Altay 
> wrote:
> >
> >
> >
> > On Fri, Jan 10, 2020 at 3:32 PM Ankur Goenka 
> wrote:
> >>
> >> Also curious to know if apache provide any infra support fro
> projects 

Re: daily dataflow job failing today

2020-02-13 Thread Valentyn Tymofieiev
Thank you, Ismaël. Good to know Avro doesn't follow semantic versioning.
Replied on the PR.

On Thu, Feb 13, 2020 at 5:24 AM Ismaël Mejía  wrote:

> For info Avro has published a new version 1.9.2.1 that fixes the issue:
> https://issues.apache.org/jira/browse/AVRO-2737
>
> I just submitted a PR to make the dependency consistent with Avro
> versioning and
> verify that everything works as intended with the upgraded dependency on
> the
> python SDK. Can you PTAL?
> https://github.com/apache/beam/pull/10851
>
>
> On Thu, Feb 13, 2020 at 9:39 AM Ismaël Mejía  wrote:
>
>>
>> > I can argue for not pinning and bounding with major version ranges.
>> This gives flexibility to users to mix other third party libraries that
>> share common dependencies with Beam. Our expectation is that dependencies
>> follow semantic versioning and do not introduce breaking changes unless
>> there is a major version change. A good example of this is Beam's
>> dependency on "pytz>=2018.3". It is a simple wrapper around a time zone
>> file. Latest version of the dependency is 2019.3, it is updated a few times
>> a year. Beam users do not have to update Beam just to be able to use a
>> later version of it since Beam does not pin it.
>>
>> Avro does not follow semantic versioning (the first number corresponds to
>> the version of the Avro binary format the release is compatible with, the
>> second correspond to the MAJOR and the third to the MINOR in semver), so we
>> should then fix the upper bound to 1.10.0 instead of 2.0.0 considering that
>> 1.10.x before the summer and it may contain breaking changes.
>>
>> > There is also a middle ground, where we can pin certain dependencies if
>> we are not confident about their releases. And allow ranges for rest of the
>> dependencies. In general, we are currently following this practice.
>>
>> I see your point, like many things in software it is all about tradeoffs,
>> and it is good to find a middle ground, do we have a robust reproducible
>> release experience, or do we deal with the annoyance of doing manual minor
>> version upgrades. Choices choices...
>>
>>
>>
>>
>> On Thu, Feb 13, 2020 at 2:26 AM Ahmet Altay  wrote:
>>
>>>
>>>
>>> On Wed, Feb 12, 2020 at 12:54 PM Ismaël Mejía  wrote:
>>>
 Independently of the bug in the dependency release the fact that the
 Beam Python
 SDK does not have pinned fixed dependency numbers is error-prone. We may
 continue to have this kind of problems until we fix this (with other
 dependencies too). In the Java SDK we do not accept such type of dynamic
 dependency numbers and python should probably follow this practice to
 avoid
 issues like the present one.

 Why don't we just do:

 'avro-python3==1.9.1',

 instead of the current:

 'avro-python3>=1.8.1,!=1.9.2,<2.0.0; python_version >= "3.0"',

>>>
>>> I agree this is error prone. Your argument for pinning makes sense and I
>>> agree with it.
>>>
>>> I can argue for not pinning and bounding with major version ranges. This
>>> gives flexibility to users to mix other third party libraries that share
>>> common dependencies with Beam. Our expectation is that dependencies follow
>>> semantic versioning and do not introduce breaking changes unless there is a
>>> major version change. A good example of this is Beam's dependency on
>>> "pytz>=2018.3". It is a simple wrapper around a time zone file. Latest
>>> version of the dependency is 2019.3, it is updated a few times a year. Beam
>>> users do not have to update Beam just to be able to use a later version of
>>> it since Beam does not pin it.
>>>
>>> There is also a middle ground, where we can pin certain dependencies if
>>> we are not confident about their releases. And allow ranges for rest of the
>>> dependencies. In general, we are currently following this practice.
>>>
>>>


 On Wed, Feb 12, 2020 at 9:14 PM Ahmet Altay  wrote:

> Related: we have dependencies on avro, avro-python3, and fastavro.
> fastavro supports both python 2 and 3. Could we reduce this dependency 
> list
> and depend only on fastavro? If we need avro and avro-python3 for the
> purposes of testing only, we can move them to test only dependencies.
>
> +Chamikara Jayalath , because I vaguely
> remember him working on this.
>
> The reason I am calling for this is the impact of bad dependency
> releases are high. All previously released Beam versions will be impacted.
> Reducing the dependency list will reduce the risk.
>
> Ahmet
>
> On Wed, Feb 12, 2020 at 12:02 PM Ahmet Altay  wrote:
>
>> Thank you Valentyn!
>>
>> On Wed, Feb 12, 2020 at 11:32 AM Valentyn Tymofieiev <
>> valen...@google.com> wrote:
>>
>>> Yes, otherwise all Python tests will continue to fail until Avro
>>> comes up with a new release. Sent:
>>> https://github.com/apache/beam/pull/10844
>>>
>>> On Wed, Feb 12, 2020 at 

Re: FnAPI proto backwards compatibility

2020-02-13 Thread Luke Cwik
On Wed, Feb 12, 2020 at 2:24 PM Kenneth Knowles  wrote:

>
>
> On Wed, Feb 12, 2020 at 12:04 PM Robert Bradshaw 
> wrote:
>
>> On Wed, Feb 12, 2020 at 11:08 AM Luke Cwik  wrote:
>> >
>> > We can always detect on the runner/SDK side whether there is an unknown
>> field[1] within a payload and fail to process it but this is painful in two
>> situations:
>> > 1) It doesn't provide for a good error message since you can't say what
>> the purpose of the field is. With a capability URN, the runner/SDK could
>> say which URN it doesn't understand.
>> > 2) It doesn't allow for the addition of fields which don't impact
>> semantics of execution. For example, if the display data feature was being
>> developed, a runner could ignore it and still execute the pipeline
>> correctly.
>>
>> Yeah, I don't think proto reflection is a flexible enough tool to do
>> this well either.
>>
>> > If we think this to be common enough, we can add capabilities list to
>> the PTransform so each PTransform can do this and has a natural way of
>> being extended for additions which are forwards compatible. The alternative
>> to having capabilities on PTransform (and other constructs) is that we
>> would have a new URN when the specification of the transform changes. For
>> forwards compatible changes, each SDK/runner would map older versions of
>> the URN onto the latest and internally treat it as the latest version but
>> always downgrade it to the version the other party expects when
>> communicating with it. Backwards incompatible changes would always require
>> a new URN which capabilities at the PTransform level would not help with.
>>
>> As you point out, stateful+splittable may not be a particularly useful
>> combination, but as another example, we have
>> (backwards-incompatible-when-introduced) markers on DoFn as to whether
>> it requires finalization, stable inputs, and now time sorting. I don't
>> think we should have a new URN for each combination.
>>
>
> Agree with this. I don't think stateful, splittable, and "plain" ParDo are
> comparable to these. Each is an entirely different computational paradigm:
> per-element independent processing, per-key-and-window linear processing,
> and per-element-and-restriction splittable processing. Most relevant IMO is
> the nature of the parallelism. If you added state to splittable processing,
> it would still be splittable processing. Just as Combine and ParDo can
> share the SideInput specification, it is easy to share relevant
> sub-structures like state declarations. But it is a fair point that the
> ability to split can be ignored and run as a plain-old ParDo. It brings up
> the question of whether a runner that doesn't know SDF is should have to
> reject it or should be allowed to run poorly.
>

Being splittable means that the SDK could choose to return a continuation
saying please process the rest of my element in X amount of time which
would require the runner to inspect certain fields on responses. One
example would be I don't have many more messages to read from this message
stream at the moment and another example could be that I detected that this
filesystem is throttling me or is down and I would like to resume
processing later.


> It isn't a huge deal. Three different top-level URNS versus three
> different sub-URNs will achieve the same result in the end if we get this
> "capability" thing in place.
>
> Kenn
>
>
>>
>> >> > I do think that splittable ParDo and stateful ParDo should have
>> separate PTransform URNs since they are different paradigms than "vanilla"
>> ParDo.
>> >>
>> >> Here I disagree. What about one that is both splittable and stateful?
>> Would one have a fourth URN for that? If/when another flavor of DoFn comes
>> out, would we then want 8 distinct URNs? (SplitableParDo in particular can
>> be executed as a normal ParDo as long as the output is bounded.)
>> >
>> > I agree that you could have stateful and splittable dofns where the
>> element is the key and you share state and timers across restrictions. No
>> runner is capable of executing this efficiently.
>> >
>> >> >> > On the SDK requirements side: the constructing SDK owns the
>> Environment proto completely, so it is in a position to ensure the involved
>> docker images support the necessary features.
>> >> >>
>> >> >> Yes.
>> >
>> >
>> > I believe capabilities do exist on a Pipeline and it informs runners
>> about new types of fields to be aware of either within Components or on the
>> Pipeline object itself but for this discussion it makes sense that an
>> environment would store most "capabilities" related to execution.
>> >
>> >> [snip]
>> >
>> > As for the proto clean-ups, the scope is to cover almost all things
>> needed for execution now and to follow-up with optional transforms,
>> payloads, and coders later which would exclude job managment APIs and
>> artifact staging. A formal enumeration would be useful here. Also, we
>> should provide formal guidance about adding new fields, adding new 

Re: big data blog

2020-02-13 Thread Etienne Chauchot

Hi all,

I just sent the link to the blog articles on @ApacheBeam twitter as Kenn 
suggested.


Etienne

On 10/02/2020 10:01, Etienne Chauchot wrote:


Yes sure,

Here is the link to the spreadsheet for review of the tweet: 
https://docs.google.com/spreadsheets/d/1mz36njTtn1UJwDF50GdqyZVbX_F0n_A6eMYcxsktpSM/edit#gid=1413052381


thanks all for your encouragement !

Best

Etienne

On 08/02/2020 08:09, Kenneth Knowles wrote:
Nice! Yes, I think we should promote Beam articles that are 
insightful from a longtime contributor.


Etienne - can you add twitter announcements/retweets to the social 
media spreadsheet when you write new articles?


Kenn

On Fri, Feb 7, 2020 at 5:44 PM Ahmet Altay > wrote:


Cool, thank you. Would it make sense to promote Beam related
posts on our twitter channel?

On Fri, Feb 7, 2020 at 2:47 PM Pablo Estrada mailto:pabl...@google.com>> wrote:

Very nice. Thanks for sharing Etienne!

On Fri, Feb 7, 2020 at 2:19 PM Reuven Lax mailto:re...@google.com>> wrote:

Cool!

On Fri, Feb 7, 2020 at 7:24 AM Etienne Chauchot
mailto:echauc...@apache.org>> wrote:

Hi all,

FYI, I just started a blog around big data
technologies and for now it
is focused on Beam.

https://echauchot.blogspot.com/

Feel free to comment, suggest or anything.

Etienne



Re: daily dataflow job failing today

2020-02-13 Thread Ismaël Mejía
For info Avro has published a new version 1.9.2.1 that fixes the issue:
https://issues.apache.org/jira/browse/AVRO-2737

I just submitted a PR to make the dependency consistent with Avro
versioning and
verify that everything works as intended with the upgraded dependency on
the
python SDK. Can you PTAL?
https://github.com/apache/beam/pull/10851


On Thu, Feb 13, 2020 at 9:39 AM Ismaël Mejía  wrote:

>
> > I can argue for not pinning and bounding with major version ranges. This
> gives flexibility to users to mix other third party libraries that share
> common dependencies with Beam. Our expectation is that dependencies follow
> semantic versioning and do not introduce breaking changes unless there is a
> major version change. A good example of this is Beam's dependency on
> "pytz>=2018.3". It is a simple wrapper around a time zone file. Latest
> version of the dependency is 2019.3, it is updated a few times a year. Beam
> users do not have to update Beam just to be able to use a later version of
> it since Beam does not pin it.
>
> Avro does not follow semantic versioning (the first number corresponds to
> the version of the Avro binary format the release is compatible with, the
> second correspond to the MAJOR and the third to the MINOR in semver), so we
> should then fix the upper bound to 1.10.0 instead of 2.0.0 considering that
> 1.10.x before the summer and it may contain breaking changes.
>
> > There is also a middle ground, where we can pin certain dependencies if
> we are not confident about their releases. And allow ranges for rest of the
> dependencies. In general, we are currently following this practice.
>
> I see your point, like many things in software it is all about tradeoffs,
> and it is good to find a middle ground, do we have a robust reproducible
> release experience, or do we deal with the annoyance of doing manual minor
> version upgrades. Choices choices...
>
>
>
>
> On Thu, Feb 13, 2020 at 2:26 AM Ahmet Altay  wrote:
>
>>
>>
>> On Wed, Feb 12, 2020 at 12:54 PM Ismaël Mejía  wrote:
>>
>>> Independently of the bug in the dependency release the fact that the
>>> Beam Python
>>> SDK does not have pinned fixed dependency numbers is error-prone. We may
>>> continue to have this kind of problems until we fix this (with other
>>> dependencies too). In the Java SDK we do not accept such type of dynamic
>>> dependency numbers and python should probably follow this practice to
>>> avoid
>>> issues like the present one.
>>>
>>> Why don't we just do:
>>>
>>> 'avro-python3==1.9.1',
>>>
>>> instead of the current:
>>>
>>> 'avro-python3>=1.8.1,!=1.9.2,<2.0.0; python_version >= "3.0"',
>>>
>>
>> I agree this is error prone. Your argument for pinning makes sense and I
>> agree with it.
>>
>> I can argue for not pinning and bounding with major version ranges. This
>> gives flexibility to users to mix other third party libraries that share
>> common dependencies with Beam. Our expectation is that dependencies follow
>> semantic versioning and do not introduce breaking changes unless there is a
>> major version change. A good example of this is Beam's dependency on
>> "pytz>=2018.3". It is a simple wrapper around a time zone file. Latest
>> version of the dependency is 2019.3, it is updated a few times a year. Beam
>> users do not have to update Beam just to be able to use a later version of
>> it since Beam does not pin it.
>>
>> There is also a middle ground, where we can pin certain dependencies if
>> we are not confident about their releases. And allow ranges for rest of the
>> dependencies. In general, we are currently following this practice.
>>
>>
>>>
>>>
>>> On Wed, Feb 12, 2020 at 9:14 PM Ahmet Altay  wrote:
>>>
 Related: we have dependencies on avro, avro-python3, and fastavro.
 fastavro supports both python 2 and 3. Could we reduce this dependency list
 and depend only on fastavro? If we need avro and avro-python3 for the
 purposes of testing only, we can move them to test only dependencies.

 +Chamikara Jayalath , because I vaguely remember
 him working on this.

 The reason I am calling for this is the impact of bad dependency
 releases are high. All previously released Beam versions will be impacted.
 Reducing the dependency list will reduce the risk.

 Ahmet

 On Wed, Feb 12, 2020 at 12:02 PM Ahmet Altay  wrote:

> Thank you Valentyn!
>
> On Wed, Feb 12, 2020 at 11:32 AM Valentyn Tymofieiev <
> valen...@google.com> wrote:
>
>> Yes, otherwise all Python tests will continue to fail until Avro
>> comes up with a new release. Sent:
>> https://github.com/apache/beam/pull/10844
>>
>> On Wed, Feb 12, 2020 at 11:08 AM Ahmet Altay 
>> wrote:
>>
>>> Should we update Beam's setup.py to skip this avro-python3 version?
>>>
>>> On Wed, Feb 12, 2020 at 10:57 AM Alan Krumholz <
>>> alan.krumh...@betterup.co> wrote:
>>>
 makes sense. I'll add this workaround 

Re: A new reworked Elasticsearch 7+ IO module

2020-02-13 Thread Etienne Chauchot

Hi Cham, thanks for your comments !

I just sent an email to user ML with a survey link to count ES uses per 
version:


https://lists.apache.org/thread.html/rc8185afb8af86a2a032909c13f569e18bd89e75a5839894d5b5d4082%40%3Cuser.beam.apache.org%3E

Best

Etienne

On 10/02/2020 19:46, Chamikara Jayalath wrote:



On Thu, Feb 6, 2020 at 8:13 AM Etienne Chauchot > wrote:


Hi,

please see my comments inline

On 06/02/2020 16:24, Alexey Romanenko wrote:

Please, see my comments inline.


On 6 Feb 2020, at 10:50, Etienne Chauchot mailto:echauc...@apache.org>> wrote:



1. regarding version support: ES v2 is no more
maintained by Elastic since 2018/02 so we plan to
remove it from the IO. In the past we already retired
versions (like spark 1.6 for instance).



My only concern here is that there might be users who use
the existing module who might not be able to easily upgrade
the Beam version if we remove it. But given that V2 is 5
versions behind the latest release this might be OK.


It seems we have a consensus on this.
I think there should be another general discussion on the long
term support of our prefered tool IO modules.


=> yes, consensus, let's drop ESV2


We had (and still have) a similar problem with KafkaIO to support
different versions of Kafka, especially very old version 0.9. We
raised this question on user@ and it appears that there are users
who for some reasons still use old Kafka versions. So, before
dropping a support of any ES versions, I’d suggest to ask it
user@ and see if any people will be affected by this.

Yes we can do a survey among users but the question is, should we
support an ES version that is no more supported by Elastic
themselves ?


+1 for asking in the user list. I guess this is more about whether 
users need this specific version that we hope to drop support for. 
Whether we need to support unsupported versions is a more generic 
question that should prob. be addressed in the dev list. (and I 
personally don't think we should unless there's a large enough user 
base for a given version).



2. regarding the user: the aim is to unlock some new
features (listed by Ludovic) and give the user more
flexibility on his request. For that, it requires to
use high level java ES client in place of the low
level REST client (that was used because it is the
only one compatible with all ES versions). We plan to
replace the API (json document in and out) by more
complete standard ES objects that contain de request
logic (insert/update, doc routing etc...) and the
data. There are already IOs like SpannerIO that use
similar objects in input PCollection rather than pure
POJOs.



Won't this be a breaking change for all users ? IMO using
POJOs in PCollections is safer since we have to worry about
changes to the underlying client library API. Exception
would be when underlying client library offers a backwards
compatibility guarantee that we can rely on for the
foreseeable future (for example, BQ TableRow).


Agreed but actually, there will be POJOs in order to abstract
Elasticsearch's version support. The following third point
explains this.


=> indeed it will be a breaking change, hence this email to get
a consensus on that. Also I think our wrappers of ES request
objects will offer a backward compatible as the underlying objects


I just want to remind that according to what we agreed some time
ago on dev@ (at least, for IOs), all breaking user API changes
have to be added along with deprecation of old API that could be
removed after 3 consecutive Beam releases. In this case, users
will have a time to move to new API smoothly.


We are more discussing the target architecture of the new module
here but the process of deprecation is important to recall, I
agree. When I say DTOs backward compatible above I mean between
per-version sub-modules inside the new module. Anyway, sure, for
some time, both modules (the old REST-based that supports v2-7 and
the new that supports v5-7) will cohabit and the old one will
receive the deprecation annotations.


+1 for supporting both versions for at least three minor versions to 
give users time to migrate. Also, we should try to produce a warning 
for users who use the deprecated versions.


Thanks,
Cham

Best

Etienne






Re: Python2.7 Beam End-of-Life Date

2020-02-13 Thread Ismaël Mejía
> I would suggest re-evaluating this within the next 3 months again. We
need to balance between user pain/contributor pain/our ability to
continuously test with python 2 in a shifting environment.

Good idea for the in 3 months evaluation, at that point also distributions
will probably be phasing out python2 by default which definitely help in
this direction.
Thanks for updating the roadmap Ahmet


On Thu, Feb 13, 2020 at 2:49 AM Ahmet Altay  wrote:

>
>
> On Wed, Feb 12, 2020 at 1:29 AM Ismaël Mejía  wrote:
>
>> I am with Chad on this, we should probably extend it a bit more, even if
>> it
>> makes us struggle a bit at least we have some workarounds as Robert
>> suggests,
>> and as Chad said there are still many people playing the python 3 catchup
>> game,
>> so worth to support those users.
>>
>
>> But maybe it is worth to evaluate the current state later in the year.
>>
>
> I would suggest re-evaluating this within the next 3 months again. We need
> to balance between user pain/contributor pain/our ability to
> continuously test with python 2 in a shifting environment.
>
>
>> In the
>> meantime can someone please update our Roadmap in the website with this
>> info and
>> where we are with Python 3 support (it looks not up to date).
>> https://beam.apache.org/roadmap/
>>
>
> I made a minor change to update that page (
> https://github.com/apache/beam/pull/10848). A more comprehensive update
> to that page and linked (
> https://beam.apache.org/roadmap/python-sdk/#python-3-support) would still
> be welcome.
>
>
>>
>> - Ismaël
>>
>>
>> On Tue, Feb 4, 2020 at 10:49 PM Robert Bradshaw 
>> wrote:
>>
>>>  On Tue, Feb 4, 2020 at 12:12 PM Chad Dombrova 
>>> wrote:
>>> >>
>>> >>  Not to mention that all the nice work for the type hints will have
>>> to be redone in the for 3.x.
>>> >
>>> > Note that there's a tool for automatically converting type comments to
>>> annotations: https://github.com/ilevkivskyi/com2ann
>>> >
>>> > So don't let that part bother you.
>>>
>>> +1, I wouldn't worry about what can be easily automated.
>>>
>>> > I'm curious what other features you'd like to be using in the Beam
>>> source that you cannot now.
>>>
>>> I hit things occasionally, e.g. I just ran into wanting keyword-only
>>> arguments the other day.
>>>
>>> >> It seems the faster we drop support the better.
>>> >
>>> >
>>> > I've already gone over my position on this, but a refresher for those
>>> who care:  some of the key vendors that support my industry will not offer
>>> python3-compatible versions of their software until the 4th quarter of
>>> 2020.  If Beam switches to python3-only before that point we may be forced
>>> to stop contributing features (note: I'm the guy who added the type hints
>>> :).   Every month you can give us would be greatly appreciated.
>>>
>>> As another data point, we're still 80/20 on Py2/Py3 for downloads at
>>> PyPi [1] (which I've heard should be taken with a grain of salt, but
>>> likely isn't totally off). IMHO that ratio needs to be way higher for
>>> Python 3 to consider dropping Python 2. It's pretty noisy, but say it
>>> doubles every 3 months that would put us at least mid-year before we
>>> hit a cross-over point. On the other hand Q4 2020 is probably a
>>> stretch.
>>>
>>> We could consider whether it needs to be an all-or-nothing thing as
>>> well. E.g. perhaps some features could be Python 3 only sooner than
>>> the whole codebase. (This would have to be well justified.) Another
>>> mitigation is that it is possible to mix Python 2 and Python 3 in the
>>> same pipeline with portability, so if there's a library that you need
>>> for one DoFn it doesn't mean you have to hold back your whole
>>> pipeline.
>>>
>>> - Robert
>>>
>>> [1] https://pypistats.org/packages/apache-beam , and that 20% may just
>>> be a spike.
>>>
>>


Re: daily dataflow job failing today

2020-02-13 Thread Ismaël Mejía
> I can argue for not pinning and bounding with major version ranges. This
gives flexibility to users to mix other third party libraries that share
common dependencies with Beam. Our expectation is that dependencies follow
semantic versioning and do not introduce breaking changes unless there is a
major version change. A good example of this is Beam's dependency on
"pytz>=2018.3". It is a simple wrapper around a time zone file. Latest
version of the dependency is 2019.3, it is updated a few times a year. Beam
users do not have to update Beam just to be able to use a later version of
it since Beam does not pin it.

Avro does not follow semantic versioning (the first number corresponds to
the version of the Avro binary format the release is compatible with, the
second correspond to the MAJOR and the third to the MINOR in semver), so we
should then fix the upper bound to 1.10.0 instead of 2.0.0 considering that
1.10.x before the summer and it may contain breaking changes.

> There is also a middle ground, where we can pin certain dependencies if
we are not confident about their releases. And allow ranges for rest of the
dependencies. In general, we are currently following this practice.

I see your point, like many things in software it is all about tradeoffs,
and it is good to find a middle ground, do we have a robust reproducible
release experience, or do we deal with the annoyance of doing manual minor
version upgrades. Choices choices...




On Thu, Feb 13, 2020 at 2:26 AM Ahmet Altay  wrote:

>
>
> On Wed, Feb 12, 2020 at 12:54 PM Ismaël Mejía  wrote:
>
>> Independently of the bug in the dependency release the fact that the Beam
>> Python
>> SDK does not have pinned fixed dependency numbers is error-prone. We may
>> continue to have this kind of problems until we fix this (with other
>> dependencies too). In the Java SDK we do not accept such type of dynamic
>> dependency numbers and python should probably follow this practice to
>> avoid
>> issues like the present one.
>>
>> Why don't we just do:
>>
>> 'avro-python3==1.9.1',
>>
>> instead of the current:
>>
>> 'avro-python3>=1.8.1,!=1.9.2,<2.0.0; python_version >= "3.0"',
>>
>
> I agree this is error prone. Your argument for pinning makes sense and I
> agree with it.
>
> I can argue for not pinning and bounding with major version ranges. This
> gives flexibility to users to mix other third party libraries that share
> common dependencies with Beam. Our expectation is that dependencies follow
> semantic versioning and do not introduce breaking changes unless there is a
> major version change. A good example of this is Beam's dependency on
> "pytz>=2018.3". It is a simple wrapper around a time zone file. Latest
> version of the dependency is 2019.3, it is updated a few times a year. Beam
> users do not have to update Beam just to be able to use a later version of
> it since Beam does not pin it.
>
> There is also a middle ground, where we can pin certain dependencies if we
> are not confident about their releases. And allow ranges for rest of the
> dependencies. In general, we are currently following this practice.
>
>
>>
>>
>> On Wed, Feb 12, 2020 at 9:14 PM Ahmet Altay  wrote:
>>
>>> Related: we have dependencies on avro, avro-python3, and fastavro.
>>> fastavro supports both python 2 and 3. Could we reduce this dependency list
>>> and depend only on fastavro? If we need avro and avro-python3 for the
>>> purposes of testing only, we can move them to test only dependencies.
>>>
>>> +Chamikara Jayalath , because I vaguely remember
>>> him working on this.
>>>
>>> The reason I am calling for this is the impact of bad dependency
>>> releases are high. All previously released Beam versions will be impacted.
>>> Reducing the dependency list will reduce the risk.
>>>
>>> Ahmet
>>>
>>> On Wed, Feb 12, 2020 at 12:02 PM Ahmet Altay  wrote:
>>>
 Thank you Valentyn!

 On Wed, Feb 12, 2020 at 11:32 AM Valentyn Tymofieiev <
 valen...@google.com> wrote:

> Yes, otherwise all Python tests will continue to fail until Avro comes
> up with a new release. Sent: https://github.com/apache/beam/pull/10844
>
> On Wed, Feb 12, 2020 at 11:08 AM Ahmet Altay  wrote:
>
>> Should we update Beam's setup.py to skip this avro-python3 version?
>>
>> On Wed, Feb 12, 2020 at 10:57 AM Alan Krumholz <
>> alan.krumh...@betterup.co> wrote:
>>
>>> makes sense. I'll add this workaround for now.
>>> Thanks so much for your help!
>>>
>>> On Wed, Feb 12, 2020 at 10:33 AM Valentyn Tymofieiev <
>>> valen...@google.com> wrote:
>>>
 Alan, Dataflow workers preinstall Beam SDK dependencies, including
 (a working version) of avro-python3. So after reading your email once
 again, I think in your case you were not able to install Beam SDK 
 locally.
 So a workaround for you would be to `pip install avro-python3==1.9.1` 
 or
 `pip install pycodestyle`  before