Re: [CANCEL][VOTE] Release 2.3.0, release candidate #1

2018-02-05 Thread Reuven Lax
Could this be related to any of the portability changes?

On Mon, Feb 5, 2018 at 7:51 AM, Jean-Baptiste Onofré 
wrote:

> Created:
>
> https://issues.apache.org/jira/browse/BEAM-3617
>
> Regards
> JB
>
> On 02/05/2018 04:42 PM, Kenneth Knowles wrote:
> > What is the Jira for direct runner perf?
> >
> > On Mon, Feb 5, 2018 at 4:35 AM, Jean-Baptiste Onofré  > > wrote:
> >
> > Thanks !
> >
> > I cherry-pick on release-2.3.0 branch.
> >
> > I'm on the direct runner perf test in the mean time.
> >
> > Thanks again !
> >
> > Regards
> > JB
> >
> > On 02/05/2018 12:06 PM, Aljoscha Krettek wrote:
> > > I merged fixes for:
> > >  - https://issues.apache.org/jira/browse/BEAM-3186
> > 
> > >  - https://issues.apache.org/jira/browse/BEAM-3589
> > 
> > >
> > > @JB I didn't yet merge them on the 2.3.0 branch, though, but I can
> or you can go
> > > ahead.
> > >
> > >> On 5. Feb 2018, at 06:30, Jean-Baptiste Onofré  
> > >> >> wrote:
> > >>
> > >> Hi guys,
> > >>
> > >> Quick update on the RC2 preparation:
> > >>
> > >> * BEAM-3587 (TextIO with Flink) seems related to a custom build
> with
> > Gradle (not
> > >> using artifacts created by Maven). Anyway, I will take a look
> today.
> > >> * BEAM-3186 has a PR. Aljoscha will do the review pretty soon.
> > >> * I'm also taking a look on the direct runner performance change.
> > >>
> > >> The goal is to cut RC2 tomorrow or the day after.
> > >>
> > >> Regards
> > >> JB
> > >>
> > >> On 02/01/2018 04:16 PM, Jean-Baptiste Onofré wrote:
> > >>> Hi guys,
> > >>>
> > >>> Especially due to BEAM-3587 & BEAM-3186 regressions, I cancel
> RC1.
> > >>>
> > >>> We will cherry-pick fixes on release-2.3.0 branch.
> > >>>
> > >>> I'm updating Jira right now. When the fixes will be
> cherry-picked, I
> > will submit
> > >>> a RC2 to vote.
> > >>>
> > >>> Thanks !
> > >>> Regards
> > >>> JB
> > >>>
> > >>> On 01/30/2018 09:04 AM, Jean-Baptiste Onofré wrote:
> >  Hi everyone,
> > 
> >  Please review and vote on the release candidate #1 for the
> version
> > 2.3.0, as
> >  follows:
> > 
> >  [ ] +1, Approve the release
> >  [ ] -1, Do not approve the release (please provide specific
> comments)
> > 
> > 
> >  The complete staging area is available for your review, which
> includes:
> >  * JIRA release notes [1],
> >  * the official Apache source release to be deployed to
> dist.apache.org
> > 
> >   [2],
> >  which is signed with the key with fingerprint C8282E76 [3],
> >  * all artifacts to be deployed to the Maven Central Repository
> [4],
> >  * source code tag "v2.3.0-RC1" [5],
> >  * website pull request listing the release and publishing the
> API reference
> >  manual [6].
> >  * Java artifacts were built with Maven 3.3.9 and Oracle JDK
> 1.8.0_111.
> >  * Python artifacts are deployed along with the source release
> to the
> >  dist.apache.org  <
> http://dist.apache.org> [2].
> > 
> >  The vote will be open for at least 72 hours. It is adopted by
> majority approval,
> >  with at least 3 PMC affirmative votes.
> > 
> >  Thanks,
> >  JB
> > 
> >  [1]
> >  https://issues.apache.org/jira/secure/ReleaseNote.jspa?
> projectId=12319527=12341608
> >  projectId=12319527=12341608>
> >  [2] https://dist.apache.org/repos/dist/dev/beam/2.3.0/
> > 
> >  [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> > 
> >  [4] https://repository.apache.org/content/repositories/
> orgapachebeam-1026/
> >  orgapachebeam-1026/>
> >  [5] https://github.com/apache/beam/tree/v2.3.0-RC1
> > 
> >  [6] https://github.com/apache/beam-site/pull/381
> > 
> > 
> > >>>
> > >>
> > >> --
> > >> Jean-Baptiste Onofré
> > >> jbono...@apache.org 
> > >
> > >> http://blog.nanthrax.net
> > >> Talend - http://www.talend.com
> > >
> >
> > --
> > 

Re: Schema-Aware PCollections revisited

2018-02-05 Thread Reuven Lax
On Mon, Feb 5, 2018 at 9:06 PM, Kenneth Knowles  wrote:

> Joining late, but very interested. Commented on the doc. Since there's a
> forked discussion between doc and thread, I want to say this on the thread:
>
> 1. I have used JSON schema in production for describing the structure of
> analytics events and it is OK but not great. If you are sure your data is
> only JSON, use it. For Beam the hierarchical structure is meaningful while
> the atomic pieces should be existing coders. When we integrate with SQL
> that can get more specific.
>

Even if your input data is JSON, you probably don't want Beam's internal
representation to be JSON. Experience shows that this can increase the cost
of a pipeline by an order of magnitude, and in fact is one of the reasons
we removed source coders (users would accidentally set a JSON coder
throughout their pipeline, causing major problems)


>
> 2. Overall, I found the discussion and doc a bit short on use cases. I can
> propose a few:
>

Good call - I'll add a use-cases section.


>
>  - incoming topic of events from clients (at various levels of upgrade /
> schema adherence)
>  - async update of client and pipeline in the above
>  - archive of files that parse to a POJO of known schema, or archive of
> all of the above
>  - SQL integration / columnar operation with all of the above
>  - autogenerated UI integration with all of the above
>
> My impression is that the design will nail SQL integration and
> autogenerated UI but will leave compatibility/evolution concerns for later.
> IMO this is smart as they are much harder.
>

If we care about streaming pipelines, we need some degree of evolution
support (at least "unknown-field" support).


>
> Kenn
>
> On Mon, Feb 5, 2018 at 1:55 PM, Romain Manni-Bucau 
> wrote:
>
>> None, Json-p - the spec so no strong impl requires - as record API and a
>> custom light wrapping for schema - like https://github.com/Talend
>> /component-runtime/blob/master/component-form/component-
>> form-model/src/main/java/org/talend/sdk/component/form/
>> model/jsonschema/JsonSchema.java (note this code is used for something
>> else) or a plain JsonObject which should be sufficient.
>>
>> side note: Apache Johnzon would probably be happy to host an enriched
>> schema module based on jsonp if you feel it better this way.
>>
>>
>> Le 5 févr. 2018 21:43, "Reuven Lax"  a écrit :
>>
>> Which json library are you thinking of? At least in Java, there's always
>> been a problem of no good standard Json library.
>>
>>
>>
>> On Mon, Feb 5, 2018 at 12:03 PM, Romain Manni-Bucau <
>> rmannibu...@gmail.com> wrote:
>>
>>>
>>>
>>> Le 5 févr. 2018 19:54, "Reuven Lax"  a écrit :
>>>
>>> multiplying by 1.0 doesn't really solve the right problems. The number
>>> type used by Javascript (and by extension, they standard for json) only has
>>> 53 bits of precision. I've seen many, many bugs caused because of this -
>>> the input data may easily contain numbers too large for 53 bits.
>>>
>>>
>>> You have alternative than string at the end whatever schema you use so
>>> not sure it is an issue. At least if runtime is in java or mainstream
>>> languages.
>>>
>>>
>>>
>>> In addition, Beam's schema representation must be no less general than
>>> other common representations. For the case of an ETL pipeline, if input
>>> fields are integers the output fields should also be numbers. We shouldn't
>>> turn them into floats because the schema class we used couldn't distinguish
>>> between ints and floats. If anything, Avro schemas are a better fit here as
>>> they are more general.
>>>
>>>
>>> This is what previous definition does. Avro are not better for 2 reasons:
>>>
>>> 1. Their dep stack is a clear blocker and please dont even speak of yet
>>> another uncontrolled shade in the API. Until avro become an api only and
>>> not an impl this is a bad fit for beam.
>>> 2. They must be json friendly so you are back on json + metada so
>>> jsonschema+extension entry is strictly equivalent and as typed
>>>
>>>
>>>
>>> Reuven
>>>
>>> On Sun, Feb 4, 2018 at 9:31 AM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
 You can handle integers using multipleOf: 1.0 IIRC.
 Yes limitations are still here but it is a good starting model and to
 be honest it is good enough - not a single model will work good enough even
 if you can go a little bit further with other models a bit more complex.
 That said the idea is to enrich the model with a beam object which
 would allow to complete the metadata as required when needed (never?).



 Romain Manni-Bucau
 @rmannibucau  |  Blog
  | Old Blog
  | Github
  | LinkedIn
  | Book
 

Re: Schema-Aware PCollections revisited

2018-02-05 Thread Romain Manni-Bucau
I would add a use case: single serialization mecanism accross a pipeline.
JSON allows to handle generic records (JsonObject) as well as POJO
serialization and both are compatible. Compared to avro built-in mecanism,
it is not intrusive in the models which is a key feature of an API. It also
increases the portability with other languages and simplifies the cluster
setup/maintenance of streams, and development - keep in mind people can
(do) use beam without the portable API which has been so intrusive lately
too.

It also joins the API driven world where we live now - and will not change
soon ;).

Le 6 févr. 2018 06:06, "Kenneth Knowles"  a écrit :

Joining late, but very interested. Commented on the doc. Since there's a
forked discussion between doc and thread, I want to say this on the thread:

1. I have used JSON schema in production for describing the structure of
analytics events and it is OK but not great. If you are sure your data is
only JSON, use it. For Beam the hierarchical structure is meaningful while
the atomic pieces should be existing coders. When we integrate with SQL
that can get more specific.

2. Overall, I found the discussion and doc a bit short on use cases. I can
propose a few:

 - incoming topic of events from clients (at various levels of upgrade /
schema adherence)
 - async update of client and pipeline in the above
 - archive of files that parse to a POJO of known schema, or archive of all
of the above
 - SQL integration / columnar operation with all of the above
 - autogenerated UI integration with all of the above

My impression is that the design will nail SQL integration and
autogenerated UI but will leave compatibility/evolution concerns for later.
IMO this is smart as they are much harder.

Kenn

On Mon, Feb 5, 2018 at 1:55 PM, Romain Manni-Bucau 
wrote:

> None, Json-p - the spec so no strong impl requires - as record API and a
> custom light wrapping for schema - like https://github.com/Talend
> /component-runtime/blob/master/component-form/component-
> form-model/src/main/java/org/talend/sdk/component/form/
> model/jsonschema/JsonSchema.java (note this code is used for something
> else) or a plain JsonObject which should be sufficient.
>
> side note: Apache Johnzon would probably be happy to host an enriched
> schema module based on jsonp if you feel it better this way.
>
>
> Le 5 févr. 2018 21:43, "Reuven Lax"  a écrit :
>
> Which json library are you thinking of? At least in Java, there's always
> been a problem of no good standard Json library.
>
>
>
> On Mon, Feb 5, 2018 at 12:03 PM, Romain Manni-Bucau  > wrote:
>
>>
>>
>> Le 5 févr. 2018 19:54, "Reuven Lax"  a écrit :
>>
>> multiplying by 1.0 doesn't really solve the right problems. The number
>> type used by Javascript (and by extension, they standard for json) only has
>> 53 bits of precision. I've seen many, many bugs caused because of this -
>> the input data may easily contain numbers too large for 53 bits.
>>
>>
>> You have alternative than string at the end whatever schema you use so
>> not sure it is an issue. At least if runtime is in java or mainstream
>> languages.
>>
>>
>>
>> In addition, Beam's schema representation must be no less general than
>> other common representations. For the case of an ETL pipeline, if input
>> fields are integers the output fields should also be numbers. We shouldn't
>> turn them into floats because the schema class we used couldn't distinguish
>> between ints and floats. If anything, Avro schemas are a better fit here as
>> they are more general.
>>
>>
>> This is what previous definition does. Avro are not better for 2 reasons:
>>
>> 1. Their dep stack is a clear blocker and please dont even speak of yet
>> another uncontrolled shade in the API. Until avro become an api only and
>> not an impl this is a bad fit for beam.
>> 2. They must be json friendly so you are back on json + metada so
>> jsonschema+extension entry is strictly equivalent and as typed
>>
>>
>>
>> Reuven
>>
>> On Sun, Feb 4, 2018 at 9:31 AM, Romain Manni-Bucau > > wrote:
>>
>>> You can handle integers using multipleOf: 1.0 IIRC.
>>> Yes limitations are still here but it is a good starting model and to be
>>> honest it is good enough - not a single model will work good enough even if
>>> you can go a little bit further with other models a bit more complex.
>>> That said the idea is to enrich the model with a beam object which would
>>> allow to complete the metadata as required when needed (never?).
>>>
>>>
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau  |  Blog
>>>  | Old Blog
>>>  | Github
>>>  | LinkedIn
>>>  | Book
>>> 
>>>
>>> 2018-02-04 

Re: [CANCEL][VOTE] Release 2.3.0, release candidate #1

2018-02-05 Thread Jean-Baptiste Onofré
Hi,

quick update about RC2: BEAM-3617 is the only Jira pending for the release.

I'm doing a git bisect to identify the commit who caused the performance
degradation.

Depending of the result, if it's a easy fix than we will try to do it for RC2,
else I will start the RC2 as now.

I will keep you posted.

Regards
JB

On 02/05/2018 04:51 PM, Jean-Baptiste Onofré wrote:
> Created:
> 
> https://issues.apache.org/jira/browse/BEAM-3617
> 
> Regards
> JB
> 
> On 02/05/2018 04:42 PM, Kenneth Knowles wrote:
>> What is the Jira for direct runner perf?
>>
>> On Mon, Feb 5, 2018 at 4:35 AM, Jean-Baptiste Onofré > > wrote:
>>
>> Thanks !
>>
>> I cherry-pick on release-2.3.0 branch.
>>
>> I'm on the direct runner perf test in the mean time.
>>
>> Thanks again !
>>
>> Regards
>> JB
>>
>> On 02/05/2018 12:06 PM, Aljoscha Krettek wrote:
>> > I merged fixes for:
>> >  - https://issues.apache.org/jira/browse/BEAM-3186
>> 
>> >  - https://issues.apache.org/jira/browse/BEAM-3589
>> 
>> >
>> > @JB I didn't yet merge them on the 2.3.0 branch, though, but I can or 
>> you can go
>> > ahead.
>> >
>> >> On 5. Feb 2018, at 06:30, Jean-Baptiste Onofré > 
>> >> >> wrote:
>> >>
>> >> Hi guys,
>> >>
>> >> Quick update on the RC2 preparation:
>> >>
>> >> * BEAM-3587 (TextIO with Flink) seems related to a custom build with
>> Gradle (not
>> >> using artifacts created by Maven). Anyway, I will take a look today.
>> >> * BEAM-3186 has a PR. Aljoscha will do the review pretty soon.
>> >> * I'm also taking a look on the direct runner performance change.
>> >>
>> >> The goal is to cut RC2 tomorrow or the day after.
>> >>
>> >> Regards
>> >> JB
>> >>
>> >> On 02/01/2018 04:16 PM, Jean-Baptiste Onofré wrote:
>> >>> Hi guys,
>> >>>
>> >>> Especially due to BEAM-3587 & BEAM-3186 regressions, I cancel RC1.
>> >>>
>> >>> We will cherry-pick fixes on release-2.3.0 branch.
>> >>>
>> >>> I'm updating Jira right now. When the fixes will be cherry-picked, I
>> will submit
>> >>> a RC2 to vote.
>> >>>
>> >>> Thanks !
>> >>> Regards
>> >>> JB
>> >>>
>> >>> On 01/30/2018 09:04 AM, Jean-Baptiste Onofré wrote:
>>  Hi everyone,
>> 
>>  Please review and vote on the release candidate #1 for the version
>> 2.3.0, as
>>  follows:
>> 
>>  [ ] +1, Approve the release
>>  [ ] -1, Do not approve the release (please provide specific 
>> comments)
>> 
>> 
>>  The complete staging area is available for your review, which 
>> includes:
>>  * JIRA release notes [1],
>>  * the official Apache source release to be deployed to 
>> dist.apache.org
>> 
>>   [2],
>>  which is signed with the key with fingerprint C8282E76 [3],
>>  * all artifacts to be deployed to the Maven Central Repository [4],
>>  * source code tag "v2.3.0-RC1" [5],
>>  * website pull request listing the release and publishing the API 
>> reference
>>  manual [6].
>>  * Java artifacts were built with Maven 3.3.9 and Oracle JDK 
>> 1.8.0_111.
>>  * Python artifacts are deployed along with the source release to the
>>  dist.apache.org   
>> [2].
>> 
>>  The vote will be open for at least 72 hours. It is adopted by 
>> majority approval,
>>  with at least 3 PMC affirmative votes.
>> 
>>  Thanks,
>>  JB
>> 
>>  [1]
>>  
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12341608
>> 
>> 
>>  [2] https://dist.apache.org/repos/dist/dev/beam/2.3.0/
>> 
>>  [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>> 
>>  [4] 
>> https://repository.apache.org/content/repositories/orgapachebeam-1026/
>> 
>>  [5] https://github.com/apache/beam/tree/v2.3.0-RC1
>> 
>>  [6] https://github.com/apache/beam-site/pull/381
>> 
>> 
>> >>>
>> >>
>> >> --
>> >> Jean-Baptiste Onofré
>> >> jbono...@apache.org 
>> 

Re: [DISCUSS] What to do about widespread KinesisIO breakage

2018-02-05 Thread Jean-Baptiste Onofré
Yes, I don't consider this as blocker, and I think it fails for a while ;)

Regards
JB

On 02/06/2018 06:22 AM, Kenneth Knowles wrote:
> Thanks JB (and Alexey)! If you are confident the failures are not release
> blockers, then that's great.
> 
> Kenn
> 
> On Mon, Feb 5, 2018 at 9:17 PM, Jean-Baptiste Onofré  > wrote:
> 
> Hi Kenn,
> 
> I have Alexey in my team who started to work on the write part of 
> KinesisIO. I
> will ask him to work on those issues.
> 
> I think it's OK to keep KinesisIO in the distribution and work on it in 
> the mean
> time.
> 
> Regards
> JB
> 
> On 02/06/2018 04:20 AM, Kenneth Knowles wrote:
> > The flaking of KinesisIO on both Maven and Gradle executions has become 
> very
> > bad. Multiple methods are flaky, and we've collected these 
> Critical-severity
> > tickets:
> >
> >     (looks like https://issues.apache.org/jira/browse/BEAM-3228
>  is fixed?)
> >     https://issues.apache.org/jira/browse/BEAM-3317
> 
> >     https://issues.apache.org/jira/browse/BEAM-3598
> 
> >     https://issues.apache.org/jira/browse/BEAM-3599
> 
> >     https://issues.apache.org/jira/browse/BEAM-3605
> 
> >
> > There don't appear to be immediate resources for addressing these so I 
> have
> > taken these so they can be sickbayed until there are resources to 
> de-flake
> them
> > and turn them on again.
> >
> >     https://github.com/apache/beam/pull/4605
> 
> >
> > If you are interested in KinesisIO, please volunteer and fix them 
> properly.
> >
> > The trouble is that the flakes may indicate serious problems with 
> KinesisIO.
> > Should we stop shipping KinesisIO until this is fixed? I'm interested in
> hearing
> > from folks on other ASF project what they have done.
> >
> > It also raises another issue, which is that I think we would benefit 
> from
> > granular Jira components for connectors so issues can be routed 
> effectively.
> > Connectors with no owner will likely rot and we'll need a plan for what 
> to do
> > when that happens.
> >
> > Kenn
> >
> 
> --
> Jean-Baptiste Onofré
> jbono...@apache.org 
> http://blog.nanthrax.net
> Talend - http://www.talend.com
> 
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [DISCUSS] What to do about widespread KinesisIO breakage

2018-02-05 Thread Kenneth Knowles
Thanks JB (and Alexey)! If you are confident the failures are not release
blockers, then that's great.

Kenn

On Mon, Feb 5, 2018 at 9:17 PM, Jean-Baptiste Onofré 
wrote:

> Hi Kenn,
>
> I have Alexey in my team who started to work on the write part of
> KinesisIO. I
> will ask him to work on those issues.
>
> I think it's OK to keep KinesisIO in the distribution and work on it in
> the mean
> time.
>
> Regards
> JB
>
> On 02/06/2018 04:20 AM, Kenneth Knowles wrote:
> > The flaking of KinesisIO on both Maven and Gradle executions has become
> very
> > bad. Multiple methods are flaky, and we've collected these
> Critical-severity
> > tickets:
> >
> > (looks like https://issues.apache.org/jira/browse/BEAM-3228 is
> fixed?)
> > https://issues.apache.org/jira/browse/BEAM-3317
> > https://issues.apache.org/jira/browse/BEAM-3598
> > https://issues.apache.org/jira/browse/BEAM-3599
> > https://issues.apache.org/jira/browse/BEAM-3605
> >
> > There don't appear to be immediate resources for addressing these so I
> have
> > taken these so they can be sickbayed until there are resources to
> de-flake them
> > and turn them on again.
> >
> > https://github.com/apache/beam/pull/4605
> >
> > If you are interested in KinesisIO, please volunteer and fix them
> properly.
> >
> > The trouble is that the flakes may indicate serious problems with
> KinesisIO.
> > Should we stop shipping KinesisIO until this is fixed? I'm interested in
> hearing
> > from folks on other ASF project what they have done.
> >
> > It also raises another issue, which is that I think we would benefit from
> > granular Jira components for connectors so issues can be routed
> effectively.
> > Connectors with no owner will likely rot and we'll need a plan for what
> to do
> > when that happens.
> >
> > Kenn
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [DISCUSS] What to do about widespread KinesisIO breakage

2018-02-05 Thread Jean-Baptiste Onofré
Hi Kenn,

I have Alexey in my team who started to work on the write part of KinesisIO. I
will ask him to work on those issues.

I think it's OK to keep KinesisIO in the distribution and work on it in the mean
time.

Regards
JB

On 02/06/2018 04:20 AM, Kenneth Knowles wrote:
> The flaking of KinesisIO on both Maven and Gradle executions has become very
> bad. Multiple methods are flaky, and we've collected these Critical-severity
> tickets:
> 
>     (looks like https://issues.apache.org/jira/browse/BEAM-3228 is fixed?)
>     https://issues.apache.org/jira/browse/BEAM-3317
>     https://issues.apache.org/jira/browse/BEAM-3598
>     https://issues.apache.org/jira/browse/BEAM-3599
>     https://issues.apache.org/jira/browse/BEAM-3605
> 
> There don't appear to be immediate resources for addressing these so I have
> taken these so they can be sickbayed until there are resources to de-flake 
> them
> and turn them on again.
> 
>     https://github.com/apache/beam/pull/4605
> 
> If you are interested in KinesisIO, please volunteer and fix them properly.
> 
> The trouble is that the flakes may indicate serious problems with KinesisIO.
> Should we stop shipping KinesisIO until this is fixed? I'm interested in 
> hearing
> from folks on other ASF project what they have done.
> 
> It also raises another issue, which is that I think we would benefit from
> granular Jira components for connectors so issues can be routed effectively.
> Connectors with no owner will likely rot and we'll need a plan for what to do
> when that happens.
> 
> Kenn
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Schema-Aware PCollections revisited

2018-02-05 Thread Kenneth Knowles
Joining late, but very interested. Commented on the doc. Since there's a
forked discussion between doc and thread, I want to say this on the thread:

1. I have used JSON schema in production for describing the structure of
analytics events and it is OK but not great. If you are sure your data is
only JSON, use it. For Beam the hierarchical structure is meaningful while
the atomic pieces should be existing coders. When we integrate with SQL
that can get more specific.

2. Overall, I found the discussion and doc a bit short on use cases. I can
propose a few:

 - incoming topic of events from clients (at various levels of upgrade /
schema adherence)
 - async update of client and pipeline in the above
 - archive of files that parse to a POJO of known schema, or archive of all
of the above
 - SQL integration / columnar operation with all of the above
 - autogenerated UI integration with all of the above

My impression is that the design will nail SQL integration and
autogenerated UI but will leave compatibility/evolution concerns for later.
IMO this is smart as they are much harder.

Kenn

On Mon, Feb 5, 2018 at 1:55 PM, Romain Manni-Bucau 
wrote:

> None, Json-p - the spec so no strong impl requires - as record API and a
> custom light wrapping for schema - like https://github.com/
> Talend/component-runtime/blob/master/component-form/
> component-form-model/src/main/java/org/talend/sdk/component/
> form/model/jsonschema/JsonSchema.java (note this code is used for
> something else) or a plain JsonObject which should be sufficient.
>
> side note: Apache Johnzon would probably be happy to host an enriched
> schema module based on jsonp if you feel it better this way.
>
>
> Le 5 févr. 2018 21:43, "Reuven Lax"  a écrit :
>
> Which json library are you thinking of? At least in Java, there's always
> been a problem of no good standard Json library.
>
>
>
> On Mon, Feb 5, 2018 at 12:03 PM, Romain Manni-Bucau  > wrote:
>
>>
>>
>> Le 5 févr. 2018 19:54, "Reuven Lax"  a écrit :
>>
>> multiplying by 1.0 doesn't really solve the right problems. The number
>> type used by Javascript (and by extension, they standard for json) only has
>> 53 bits of precision. I've seen many, many bugs caused because of this -
>> the input data may easily contain numbers too large for 53 bits.
>>
>>
>> You have alternative than string at the end whatever schema you use so
>> not sure it is an issue. At least if runtime is in java or mainstream
>> languages.
>>
>>
>>
>> In addition, Beam's schema representation must be no less general than
>> other common representations. For the case of an ETL pipeline, if input
>> fields are integers the output fields should also be numbers. We shouldn't
>> turn them into floats because the schema class we used couldn't distinguish
>> between ints and floats. If anything, Avro schemas are a better fit here as
>> they are more general.
>>
>>
>> This is what previous definition does. Avro are not better for 2 reasons:
>>
>> 1. Their dep stack is a clear blocker and please dont even speak of yet
>> another uncontrolled shade in the API. Until avro become an api only and
>> not an impl this is a bad fit for beam.
>> 2. They must be json friendly so you are back on json + metada so
>> jsonschema+extension entry is strictly equivalent and as typed
>>
>>
>>
>> Reuven
>>
>> On Sun, Feb 4, 2018 at 9:31 AM, Romain Manni-Bucau > > wrote:
>>
>>> You can handle integers using multipleOf: 1.0 IIRC.
>>> Yes limitations are still here but it is a good starting model and to be
>>> honest it is good enough - not a single model will work good enough even if
>>> you can go a little bit further with other models a bit more complex.
>>> That said the idea is to enrich the model with a beam object which would
>>> allow to complete the metadata as required when needed (never?).
>>>
>>>
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau  |  Blog
>>>  | Old Blog
>>>  | Github
>>>  | LinkedIn
>>>  | Book
>>> 
>>>
>>> 2018-02-04 18:21 GMT+01:00 Jean-Baptiste Onofré :
>>>
 Sorry guys, I was off today. Happy to be part of the party too ;)

 Regards
 JB

 On 02/04/2018 06:19 PM, Reuven Lax wrote:
 > Romain, since you're interested maybe the two of us should put
 together a
 > proposal for how to set this things (hints, schema) on PCollections?
 I don't
 > think it'll be hard - the previous list thread on hints already
 agreed on a
 > general approach, and we would just need to flesh it out.
 >
 > BTW in the past when I looked, Json schemas seemed to have some odd
 limitations
 > inherited from Javascript 

Re: [DISCUSS] What to do about widespread KinesisIO breakage

2018-02-05 Thread Kenneth Knowles
I believe these are all unit tests running locally. The failures generally
look more like incorrect results than service problems.

On Mon, Feb 5, 2018 at 7:36 PM, Reuven Lax  wrote:

> Do these tests run locally, or are they contacting an actual Kinesis
> service?
>
> On Mon, Feb 5, 2018 at 7:20 PM, Kenneth Knowles  wrote:
>
>> The flaking of KinesisIO on both Maven and Gradle executions has become
>> very bad. Multiple methods are flaky, and we've collected these
>> Critical-severity tickets:
>>
>> (looks like https://issues.apache.org/jira/browse/BEAM-3228 is
>> fixed?)
>> https://issues.apache.org/jira/browse/BEAM-3317
>> https://issues.apache.org/jira/browse/BEAM-3598
>> https://issues.apache.org/jira/browse/BEAM-3599
>> https://issues.apache.org/jira/browse/BEAM-3605
>>
>> There don't appear to be immediate resources for addressing these so I
>> have taken these so they can be sickbayed until there are resources to
>> de-flake them and turn them on again.
>>
>> https://github.com/apache/beam/pull/4605
>>
>> If you are interested in KinesisIO, please volunteer and fix them
>> properly.
>>
>> The trouble is that the flakes may indicate serious problems with
>> KinesisIO. Should we stop shipping KinesisIO until this is fixed? I'm
>> interested in hearing from folks on other ASF project what they have done.
>>
>> It also raises another issue, which is that I think we would benefit from
>> granular Jira components for connectors so issues can be routed
>> effectively. Connectors with no owner will likely rot and we'll need a plan
>> for what to do when that happens.
>>
>> Kenn
>>
>>
>


Re: [DISCUSS] What to do about widespread KinesisIO breakage

2018-02-05 Thread Reuven Lax
Do these tests run locally, or are they contacting an actual Kinesis
service?

On Mon, Feb 5, 2018 at 7:20 PM, Kenneth Knowles  wrote:

> The flaking of KinesisIO on both Maven and Gradle executions has become
> very bad. Multiple methods are flaky, and we've collected these
> Critical-severity tickets:
>
> (looks like https://issues.apache.org/jira/browse/BEAM-3228 is fixed?)
> https://issues.apache.org/jira/browse/BEAM-3317
> https://issues.apache.org/jira/browse/BEAM-3598
> https://issues.apache.org/jira/browse/BEAM-3599
> https://issues.apache.org/jira/browse/BEAM-3605
>
> There don't appear to be immediate resources for addressing these so I
> have taken these so they can be sickbayed until there are resources to
> de-flake them and turn them on again.
>
> https://github.com/apache/beam/pull/4605
>
> If you are interested in KinesisIO, please volunteer and fix them properly.
>
> The trouble is that the flakes may indicate serious problems with
> KinesisIO. Should we stop shipping KinesisIO until this is fixed? I'm
> interested in hearing from folks on other ASF project what they have done.
>
> It also raises another issue, which is that I think we would benefit from
> granular Jira components for connectors so issues can be routed
> effectively. Connectors with no owner will likely rot and we'll need a plan
> for what to do when that happens.
>
> Kenn
>
>


[DISCUSS] What to do about widespread KinesisIO breakage

2018-02-05 Thread Kenneth Knowles
The flaking of KinesisIO on both Maven and Gradle executions has become
very bad. Multiple methods are flaky, and we've collected these
Critical-severity tickets:

(looks like https://issues.apache.org/jira/browse/BEAM-3228 is fixed?)
https://issues.apache.org/jira/browse/BEAM-3317
https://issues.apache.org/jira/browse/BEAM-3598
https://issues.apache.org/jira/browse/BEAM-3599
https://issues.apache.org/jira/browse/BEAM-3605

There don't appear to be immediate resources for addressing these so I have
taken these so they can be sickbayed until there are resources to de-flake
them and turn them on again.

https://github.com/apache/beam/pull/4605

If you are interested in KinesisIO, please volunteer and fix them properly.

The trouble is that the flakes may indicate serious problems with
KinesisIO. Should we stop shipping KinesisIO until this is fixed? I'm
interested in hearing from folks on other ASF project what they have done.

It also raises another issue, which is that I think we would benefit from
granular Jira components for connectors so issues can be routed
effectively. Connectors with no owner will likely rot and we'll need a plan
for what to do when that happens.

Kenn


Re: KafkaIO reading from latest offset when pipeline fails on FlinkRunner

2018-02-05 Thread Raghu Angadi
Hi Sushil,

That is expected behavior. If you don't have any saved checkpoint, the
pipeline would start from scratch. It does not have any connection to
previous run.

On Thu, Feb 1, 2018 at 1:29 AM, Sushil Ks  wrote:

> Hi,
>Apologies for delay in my reply,
>
> @Raghu Angadi
> This checkpoints 20 mins, as you mentioned before any
> checkpoint is created and if the pipeline restarts, it's reading from the
> latest offset.
>
> @Mingmin
> Thanks a lot for sharing your learnings, However in case of any
> *UserCodeException* while processing the element as part of ParDo after
> materializing the window, the pipeline drops the unprocessed elements and
> restarts. Is this expected from Beam?
>
>
> On Wed, Jan 17, 2018 at 2:13 AM, Kenneth Knowles  wrote:
>
>> Is there a JIRA filed for this? I think this discussion should live in a
>> ticket.
>>
>> Kenn
>>
>> On Wed, Jan 10, 2018 at 11:00 AM, Mingmin Xu  wrote:
>>
>>> @Sushil, I have several jobs running on KafkaIO+FlinkRunner, hope my
>>> experience can help you a bit.
>>>
>>> For short, `ENABLE_AUTO_COMMIT_CONFIG` doesn't meet your requirement,
>>> you need to leverage exactly-once checkpoint/savepoint in Flink. The reason
>>> is,  with `ENABLE_AUTO_COMMIT_CONFIG` KafkaIO commits offset after data is
>>> read, and once job is restarted KafkaIO reads from last_committed_offset.
>>>
>>> In my jobs, I enable external(external should be optional I think?)
>>> checkpoint on exactly-once mode in Flink cluster. When the job auto-restart
>>> on failures it doesn't lost data. In case of manually redeploy the job, I
>>> use savepoint to cancel and launch the job.
>>>
>>> Mingmin
>>>
>>> On Wed, Jan 10, 2018 at 10:34 AM, Raghu Angadi 
>>> wrote:
>>>
 How often does your pipeline checkpoint/snapshot? If the failure
 happens before the first checkpoint, the pipeline could restart without any
 state, in which case KafkaIO would read from latest offset. There is
 probably some way to verify if pipeline is restarting from a checkpoint.

 On Sun, Jan 7, 2018 at 10:57 PM, Sushil Ks  wrote:

> HI Aljoscha,
>The issue is let's say I consumed 100 elements in 5
> mins Fixed Window with *GroupByKey* and later I applied *ParDO* for
> all those elements. If there is an issue while processing element 70 in
> *ParDo *and the pipeline restarts with *UserCodeException *it's
> skipping the rest 30 elements. Wanted to know if this is expected? In case
> if you still having doubt let me know will share a code snippet.
>
> Regards,
> Sushil Ks
>


>>>
>>>
>>> --
>>> 
>>> Mingmin
>>>
>>
>>
>


Re: Schema-Aware PCollections revisited

2018-02-05 Thread Romain Manni-Bucau
None, Json-p - the spec so no strong impl requires - as record API and a
custom light wrapping for schema - like
https://github.com/Talend/component-runtime/blob/master/component-form/component-form-model/src/main/java/org/talend/sdk/component/form/model/jsonschema/JsonSchema.java
(note this code is used for something else) or a plain JsonObject which
should be sufficient.

side note: Apache Johnzon would probably be happy to host an enriched
schema module based on jsonp if you feel it better this way.

Le 5 févr. 2018 21:43, "Reuven Lax"  a écrit :

Which json library are you thinking of? At least in Java, there's always
been a problem of no good standard Json library.



On Mon, Feb 5, 2018 at 12:03 PM, Romain Manni-Bucau 
wrote:

>
>
> Le 5 févr. 2018 19:54, "Reuven Lax"  a écrit :
>
> multiplying by 1.0 doesn't really solve the right problems. The number
> type used by Javascript (and by extension, they standard for json) only has
> 53 bits of precision. I've seen many, many bugs caused because of this -
> the input data may easily contain numbers too large for 53 bits.
>
>
> You have alternative than string at the end whatever schema you use so not
> sure it is an issue. At least if runtime is in java or mainstream languages.
>
>
>
> In addition, Beam's schema representation must be no less general than
> other common representations. For the case of an ETL pipeline, if input
> fields are integers the output fields should also be numbers. We shouldn't
> turn them into floats because the schema class we used couldn't distinguish
> between ints and floats. If anything, Avro schemas are a better fit here as
> they are more general.
>
>
> This is what previous definition does. Avro are not better for 2 reasons:
>
> 1. Their dep stack is a clear blocker and please dont even speak of yet
> another uncontrolled shade in the API. Until avro become an api only and
> not an impl this is a bad fit for beam.
> 2. They must be json friendly so you are back on json + metada so
> jsonschema+extension entry is strictly equivalent and as typed
>
>
>
> Reuven
>
> On Sun, Feb 4, 2018 at 9:31 AM, Romain Manni-Bucau 
> wrote:
>
>> You can handle integers using multipleOf: 1.0 IIRC.
>> Yes limitations are still here but it is a good starting model and to be
>> honest it is good enough - not a single model will work good enough even if
>> you can go a little bit further with other models a bit more complex.
>> That said the idea is to enrich the model with a beam object which would
>> allow to complete the metadata as required when needed (never?).
>>
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau  |  Blog
>>  | Old Blog
>>  | Github
>>  | LinkedIn
>>  | Book
>> 
>>
>> 2018-02-04 18:21 GMT+01:00 Jean-Baptiste Onofré :
>>
>>> Sorry guys, I was off today. Happy to be part of the party too ;)
>>>
>>> Regards
>>> JB
>>>
>>> On 02/04/2018 06:19 PM, Reuven Lax wrote:
>>> > Romain, since you're interested maybe the two of us should put
>>> together a
>>> > proposal for how to set this things (hints, schema) on PCollections? I
>>> don't
>>> > think it'll be hard - the previous list thread on hints already agreed
>>> on a
>>> > general approach, and we would just need to flesh it out.
>>> >
>>> > BTW in the past when I looked, Json schemas seemed to have some odd
>>> limitations
>>> > inherited from Javascript (e.g. no distinction between integer and
>>> > floating-point types). Is that still true?
>>> >
>>> > Reuven
>>> >
>>> > On Sun, Feb 4, 2018 at 9:12 AM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com
>>> > > wrote:
>>> >
>>> >
>>> >
>>> > 2018-02-04 17:53 GMT+01:00 Reuven Lax >> > >:
>>> >
>>> >
>>> >
>>> > On Sun, Feb 4, 2018 at 8:42 AM, Romain Manni-Bucau
>>> > > wrote:
>>> >
>>> >
>>> > 2018-02-04 17:37 GMT+01:00 Reuven Lax >> > >:
>>> >
>>> > I'm not sure where proto comes from here. Proto is one
>>> example
>>> > of a type that has a schema, but only one example.
>>> >
>>> > 1. In the initial prototype I want to avoid modifying
>>> the
>>> > PCollection API. So I think it's best to create a
>>> special
>>> > SchemaCoder, and pass the schema into this coder.
>>> Later we might
>>> > targeted APIs for this instead of going through a
>>> coder.
>>> > 1.a I don't see what hints have to do with this?
>>> >
>>> >
>>> > Hints are a way to replace 

Re: Schema-Aware PCollections revisited

2018-02-05 Thread Reuven Lax
Which json library are you thinking of? At least in Java, there's always
been a problem of no good standard Json library.



On Mon, Feb 5, 2018 at 12:03 PM, Romain Manni-Bucau 
wrote:

>
>
> Le 5 févr. 2018 19:54, "Reuven Lax"  a écrit :
>
> multiplying by 1.0 doesn't really solve the right problems. The number
> type used by Javascript (and by extension, they standard for json) only has
> 53 bits of precision. I've seen many, many bugs caused because of this -
> the input data may easily contain numbers too large for 53 bits.
>
>
> You have alternative than string at the end whatever schema you use so not
> sure it is an issue. At least if runtime is in java or mainstream languages.
>
>
>
> In addition, Beam's schema representation must be no less general than
> other common representations. For the case of an ETL pipeline, if input
> fields are integers the output fields should also be numbers. We shouldn't
> turn them into floats because the schema class we used couldn't distinguish
> between ints and floats. If anything, Avro schemas are a better fit here as
> they are more general.
>
>
> This is what previous definition does. Avro are not better for 2 reasons:
>
> 1. Their dep stack is a clear blocker and please dont even speak of yet
> another uncontrolled shade in the API. Until avro become an api only and
> not an impl this is a bad fit for beam.
> 2. They must be json friendly so you are back on json + metada so
> jsonschema+extension entry is strictly equivalent and as typed
>
>
>
> Reuven
>
> On Sun, Feb 4, 2018 at 9:31 AM, Romain Manni-Bucau 
> wrote:
>
>> You can handle integers using multipleOf: 1.0 IIRC.
>> Yes limitations are still here but it is a good starting model and to be
>> honest it is good enough - not a single model will work good enough even if
>> you can go a little bit further with other models a bit more complex.
>> That said the idea is to enrich the model with a beam object which would
>> allow to complete the metadata as required when needed (never?).
>>
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau  |  Blog
>>  | Old Blog
>>  | Github
>>  | LinkedIn
>>  | Book
>> 
>>
>> 2018-02-04 18:21 GMT+01:00 Jean-Baptiste Onofré :
>>
>>> Sorry guys, I was off today. Happy to be part of the party too ;)
>>>
>>> Regards
>>> JB
>>>
>>> On 02/04/2018 06:19 PM, Reuven Lax wrote:
>>> > Romain, since you're interested maybe the two of us should put
>>> together a
>>> > proposal for how to set this things (hints, schema) on PCollections? I
>>> don't
>>> > think it'll be hard - the previous list thread on hints already agreed
>>> on a
>>> > general approach, and we would just need to flesh it out.
>>> >
>>> > BTW in the past when I looked, Json schemas seemed to have some odd
>>> limitations
>>> > inherited from Javascript (e.g. no distinction between integer and
>>> > floating-point types). Is that still true?
>>> >
>>> > Reuven
>>> >
>>> > On Sun, Feb 4, 2018 at 9:12 AM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com
>>> > > wrote:
>>> >
>>> >
>>> >
>>> > 2018-02-04 17:53 GMT+01:00 Reuven Lax >> > >:
>>> >
>>> >
>>> >
>>> > On Sun, Feb 4, 2018 at 8:42 AM, Romain Manni-Bucau
>>> > > wrote:
>>> >
>>> >
>>> > 2018-02-04 17:37 GMT+01:00 Reuven Lax >> > >:
>>> >
>>> > I'm not sure where proto comes from here. Proto is one
>>> example
>>> > of a type that has a schema, but only one example.
>>> >
>>> > 1. In the initial prototype I want to avoid modifying
>>> the
>>> > PCollection API. So I think it's best to create a
>>> special
>>> > SchemaCoder, and pass the schema into this coder.
>>> Later we might
>>> > targeted APIs for this instead of going through a
>>> coder.
>>> > 1.a I don't see what hints have to do with this?
>>> >
>>> >
>>> > Hints are a way to replace the new API and unify the way
>>> to pass
>>> > metadata in beam instead of adding a new custom way each
>>> time.
>>> >
>>> >
>>> > I don't think schema is a hint. But I hear what your saying -
>>> hint is a
>>> > type of PCollection metadata as is schema, and we should have
>>> a unified
>>> > API for setting such metadata.
>>> >
>>> >
>>> > :), Ismael pointed me out earlier this week that "hint" had an old
>>> meaning
>>> > in beam. My usage is purely the one done in most EE spec (your
>>> "metadata" in
>>> > previous 

Re: coder evolutions?

2018-02-05 Thread Romain Manni-Bucau
Does it mean we would change the implicit resolution? Do you see it being
backward compatible? If so sounds a good solution.

Le 5 févr. 2018 20:36, "Kenneth Knowles"  a écrit :

> TL;DR: create _new_ coders is not a problem. If you have a new idea for an
> encoding, you can build it alongside and users can use it. We also need
> data migration, and this is probably the easy way to be ready for that.
>
> We made a pretty big mistake in our naming of ListCoder, SetCoder, and
> IterableLikeCoder because they make users think it is the
> only/best/canonical encoding. We did it right with e.g. VarLongCoder and
> BigEndianLongCoder. There is a default, but it is just a default.
>
> We actually already need "SetIterableLikeCoder" (aka SetCoder) and perhaps
> "LexicallySortedBytesSetCoder" so we can change coder inference to ask for
> a deterministic coder when it is needed instead of first asking for "any"
> coder and then crashing when we get the wrong type.
>
> Kenn
>
> On Mon, Feb 5, 2018 at 11:00 AM, Robert Bradshaw 
> wrote:
>
>> Just to clarify, the issue is that for some types (byte array being
>> the simplest) one needs to know the length of the data in order to
>> decode it from the stream. In particular, the claim is that many
>> libraries out there that do encoding/decoding assume they can gather
>> this information from the end of the stream and so don't explicitly
>> record it. For nested values, someone needs to record these lengths.
>> Note that in the Fn API, nearly everything is nested, as the elements
>> are sent as a large byte stream of concatenated encoded elements.
>>
>> Your proposed solution is to require all container coders (though I
>> think your PR only considers IterableLikeCoder, there's others, and
>> there's the Elements proto itself) to prefix element encodings with
>> sizes so it can give truncated streams on decoding. I think this
>> places an undue burden (and code redundancy in) container coders, and
>> disallows optimization on those coders that don't need to be length
>> prefixed (and note that *prefixing* with length is not the only way to
>> delimit a stream, we shouldn't impose that restriction as well).
>> Instead, I'd keep thing the way they are, but offer a new Coder
>> subclass that users can subclass if they want to write an "easy" Coder
>> that does the delimiting for them (on encode and decode). We would
>> point users to this for writing custom coders in the easiest way
>> possible as a good option, and keeps the current Coder API the same.
>>
>> On Mon, Feb 5, 2018 at 10:21 AM, Romain Manni-Bucau
>>  wrote:
>> > Answered inlined but I want to highlight beam is a portable API on top
>> of
>> > well known vendors API which have friendly shortcuts. So the background
>> here
>> > is to make beam at least user friendly.
>> >
>> > Im fine if the outcome of the discussion is coder concept is wrong or
>> > something like that but Im not fine to say we dont want to solve an API
>> > issue, to not say bug, of a project which has an API as added value.
>> >
>> > I understand the perf concern which must be balanced with the fact
>> > derialization is not used for each step/transform and that currently the
>> > coder API is already intrusive and heavy for dev but also not usable by
>> most
>> > existing codecs out there. Even some jaxb or plain xml flavors dont work
>> > with it :(.
>> >
>> >
>> > Le 5 févr. 2018 18:46, "Robert Bradshaw"  a écrit
>> :
>> >
>> > On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau
>> >  wrote:
>> >> Hi guys,
>> >>
>> >> I submitted a PR on coders to enhance 1. the user experience 2. the
>> >> determinism and handling of coders.
>> >>
>> >> 1. the user experience is linked to what i sent some days ago: close
>> >> handling of the streams from a coder code. Long story short I add a
>> >> SkipCloseCoder which can decorate a coder and just wraps the stream
>> (input
>> >> or output) in flavors skipping close() calls. This avoids to do it by
>> >> default (which had my preference if you read the related thread but not
>> >> the
>> >> one of everybody) but also makes the usage of a coder with this issue
>> easy
>> >> since the of() of the coder just wraps itself in this delagating coder.
>> >>
>> >> 2. this one is more nasty and mainly concerns IterableLikeCoders. These
>> >> ones
>> >> use this kind of algorithm (keep in mind they work on a list):
>> >>
>> >> writeSize()
>> >> for all element e {
>> >> elementCoder.write(e)
>> >> }
>> >> writeMagicNumber() // this one depends the size
>> >>
>> >> The decoding is symmetric so I bypass it here.
>> >>
>> >> Indeed all these writes (reads) are done on the same stream. Therefore
>> it
>> >> assumes you read as much bytes than you write...which is a huge
>> assumption
>> >> for a coder which should by contract assume it can read the
>> stream...as a
>> >> stream (until -1).
>> >>
>> >> The idea of 

Re: Schema-Aware PCollections revisited

2018-02-05 Thread Romain Manni-Bucau
Le 5 févr. 2018 19:54, "Reuven Lax"  a écrit :

multiplying by 1.0 doesn't really solve the right problems. The number type
used by Javascript (and by extension, they standard for json) only has 53
bits of precision. I've seen many, many bugs caused because of this - the
input data may easily contain numbers too large for 53 bits.


You have alternative than string at the end whatever schema you use so not
sure it is an issue. At least if runtime is in java or mainstream languages.



In addition, Beam's schema representation must be no less general than
other common representations. For the case of an ETL pipeline, if input
fields are integers the output fields should also be numbers. We shouldn't
turn them into floats because the schema class we used couldn't distinguish
between ints and floats. If anything, Avro schemas are a better fit here as
they are more general.


This is what previous definition does. Avro are not better for 2 reasons:

1. Their dep stack is a clear blocker and please dont even speak of yet
another uncontrolled shade in the API. Until avro become an api only and
not an impl this is a bad fit for beam.
2. They must be json friendly so you are back on json + metada so
jsonschema+extension entry is strictly equivalent and as typed



Reuven

On Sun, Feb 4, 2018 at 9:31 AM, Romain Manni-Bucau 
wrote:

> You can handle integers using multipleOf: 1.0 IIRC.
> Yes limitations are still here but it is a good starting model and to be
> honest it is good enough - not a single model will work good enough even if
> you can go a little bit further with other models a bit more complex.
> That said the idea is to enrich the model with a beam object which would
> allow to complete the metadata as required when needed (never?).
>
>
>
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
>  | Book
> 
>
> 2018-02-04 18:21 GMT+01:00 Jean-Baptiste Onofré :
>
>> Sorry guys, I was off today. Happy to be part of the party too ;)
>>
>> Regards
>> JB
>>
>> On 02/04/2018 06:19 PM, Reuven Lax wrote:
>> > Romain, since you're interested maybe the two of us should put together
>> a
>> > proposal for how to set this things (hints, schema) on PCollections? I
>> don't
>> > think it'll be hard - the previous list thread on hints already agreed
>> on a
>> > general approach, and we would just need to flesh it out.
>> >
>> > BTW in the past when I looked, Json schemas seemed to have some odd
>> limitations
>> > inherited from Javascript (e.g. no distinction between integer and
>> > floating-point types). Is that still true?
>> >
>> > Reuven
>> >
>> > On Sun, Feb 4, 2018 at 9:12 AM, Romain Manni-Bucau <
>> rmannibu...@gmail.com
>> > > wrote:
>> >
>> >
>> >
>> > 2018-02-04 17:53 GMT+01:00 Reuven Lax > > >:
>> >
>> >
>> >
>> > On Sun, Feb 4, 2018 at 8:42 AM, Romain Manni-Bucau
>> > > wrote:
>> >
>> >
>> > 2018-02-04 17:37 GMT+01:00 Reuven Lax > > >:
>> >
>> > I'm not sure where proto comes from here. Proto is one
>> example
>> > of a type that has a schema, but only one example.
>> >
>> > 1. In the initial prototype I want to avoid modifying
>> the
>> > PCollection API. So I think it's best to create a
>> special
>> > SchemaCoder, and pass the schema into this coder. Later
>> we might
>> > targeted APIs for this instead of going through a coder.
>> > 1.a I don't see what hints have to do with this?
>> >
>> >
>> > Hints are a way to replace the new API and unify the way to
>> pass
>> > metadata in beam instead of adding a new custom way each
>> time.
>> >
>> >
>> > I don't think schema is a hint. But I hear what your saying -
>> hint is a
>> > type of PCollection metadata as is schema, and we should have a
>> unified
>> > API for setting such metadata.
>> >
>> >
>> > :), Ismael pointed me out earlier this week that "hint" had an old
>> meaning
>> > in beam. My usage is purely the one done in most EE spec (your
>> "metadata" in
>> > previous answer). But guess we are aligned on the meaning now, just
>> wanted
>> > to be sure.
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > 2. BeamSQL already has a generic record type which fits
>> this use
>> > case very well (though we might modify it). However as
>> mentioned
>> > in the doc, the user is never forced to 

Re: coder evolutions?

2018-02-05 Thread Kenneth Knowles
TL;DR: create _new_ coders is not a problem. If you have a new idea for an
encoding, you can build it alongside and users can use it. We also need
data migration, and this is probably the easy way to be ready for that.

We made a pretty big mistake in our naming of ListCoder, SetCoder, and
IterableLikeCoder because they make users think it is the
only/best/canonical encoding. We did it right with e.g. VarLongCoder and
BigEndianLongCoder. There is a default, but it is just a default.

We actually already need "SetIterableLikeCoder" (aka SetCoder) and perhaps
"LexicallySortedBytesSetCoder" so we can change coder inference to ask for
a deterministic coder when it is needed instead of first asking for "any"
coder and then crashing when we get the wrong type.

Kenn

On Mon, Feb 5, 2018 at 11:00 AM, Robert Bradshaw 
wrote:

> Just to clarify, the issue is that for some types (byte array being
> the simplest) one needs to know the length of the data in order to
> decode it from the stream. In particular, the claim is that many
> libraries out there that do encoding/decoding assume they can gather
> this information from the end of the stream and so don't explicitly
> record it. For nested values, someone needs to record these lengths.
> Note that in the Fn API, nearly everything is nested, as the elements
> are sent as a large byte stream of concatenated encoded elements.
>
> Your proposed solution is to require all container coders (though I
> think your PR only considers IterableLikeCoder, there's others, and
> there's the Elements proto itself) to prefix element encodings with
> sizes so it can give truncated streams on decoding. I think this
> places an undue burden (and code redundancy in) container coders, and
> disallows optimization on those coders that don't need to be length
> prefixed (and note that *prefixing* with length is not the only way to
> delimit a stream, we shouldn't impose that restriction as well).
> Instead, I'd keep thing the way they are, but offer a new Coder
> subclass that users can subclass if they want to write an "easy" Coder
> that does the delimiting for them (on encode and decode). We would
> point users to this for writing custom coders in the easiest way
> possible as a good option, and keeps the current Coder API the same.
>
> On Mon, Feb 5, 2018 at 10:21 AM, Romain Manni-Bucau
>  wrote:
> > Answered inlined but I want to highlight beam is a portable API on top of
> > well known vendors API which have friendly shortcuts. So the background
> here
> > is to make beam at least user friendly.
> >
> > Im fine if the outcome of the discussion is coder concept is wrong or
> > something like that but Im not fine to say we dont want to solve an API
> > issue, to not say bug, of a project which has an API as added value.
> >
> > I understand the perf concern which must be balanced with the fact
> > derialization is not used for each step/transform and that currently the
> > coder API is already intrusive and heavy for dev but also not usable by
> most
> > existing codecs out there. Even some jaxb or plain xml flavors dont work
> > with it :(.
> >
> >
> > Le 5 févr. 2018 18:46, "Robert Bradshaw"  a écrit :
> >
> > On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau
> >  wrote:
> >> Hi guys,
> >>
> >> I submitted a PR on coders to enhance 1. the user experience 2. the
> >> determinism and handling of coders.
> >>
> >> 1. the user experience is linked to what i sent some days ago: close
> >> handling of the streams from a coder code. Long story short I add a
> >> SkipCloseCoder which can decorate a coder and just wraps the stream
> (input
> >> or output) in flavors skipping close() calls. This avoids to do it by
> >> default (which had my preference if you read the related thread but not
> >> the
> >> one of everybody) but also makes the usage of a coder with this issue
> easy
> >> since the of() of the coder just wraps itself in this delagating coder.
> >>
> >> 2. this one is more nasty and mainly concerns IterableLikeCoders. These
> >> ones
> >> use this kind of algorithm (keep in mind they work on a list):
> >>
> >> writeSize()
> >> for all element e {
> >> elementCoder.write(e)
> >> }
> >> writeMagicNumber() // this one depends the size
> >>
> >> The decoding is symmetric so I bypass it here.
> >>
> >> Indeed all these writes (reads) are done on the same stream. Therefore
> it
> >> assumes you read as much bytes than you write...which is a huge
> assumption
> >> for a coder which should by contract assume it can read the stream...as
> a
> >> stream (until -1).
> >>
> >> The idea of the fix is to change this encoding to this kind of
> algorithm:
> >>
> >> writeSize()
> >> for all element e {
> >> writeElementByteCount(e)
> >> elementCoder.write(e)
> >> }
> >> writeMagicNumber() // still optionally
> >
> > Regardless of the backwards incompatibility issues, I'm unconvinced
> > that 

Re: coder evolutions?

2018-02-05 Thread Eugene Kirpichov
>From a brief reading of this discussion: if I understand correctly, we want
something to help deal with libraries that assume that they own the stream
(e.g. some common xml or json parsers), when using them in a context where
they don't (inside a Coder).

Setting aside the questions of "why would one even use an xml or json
library in a coder" (coders should be efficient, and the wire format of a
coder is not intended to be readable by anything except this exact coder
itself), and the question that ideally users would simply never write new
coders (I'm hoping schemas can move us in that direction) - I think we just
want an adapter for input and output streams that does this, and put this
adapter in front of both reading and writing using the library.

One way to build such an adapter is length prefixing:
- output stream collects all bytes into a byte array, and when closed,
writes the array length-prefixed
- input stream reads the length, and then lets the consumer read only as
many bytes
This adds an extra copy (extra GC pressure + extra maximum memory usage). I
suppose there has to be some price for integrating with a misbehaving
library, but I, like some others in this thread, would not be comfortable
having this overhead unconditionally even when it's not needed.

There are other ways to nest streams with considerably less overhead: e.g.
chunked length-prefixing (also copying, but less), escaping (having an
"escape" byte, e.g. 0xFF, and when the input contains this byte then write
0xFF 0xFF). However I think we still don't need to apply them
unconditionally - I see nothing wrong with applying them on a case-by-case
basis, only when your coder is actually using a misbehaving library.

On Mon, Feb 5, 2018 at 11:00 AM Robert Bradshaw  wrote:

> Just to clarify, the issue is that for some types (byte array being
> the simplest) one needs to know the length of the data in order to
> decode it from the stream. In particular, the claim is that many
> libraries out there that do encoding/decoding assume they can gather
> this information from the end of the stream and so don't explicitly
> record it. For nested values, someone needs to record these lengths.
> Note that in the Fn API, nearly everything is nested, as the elements
> are sent as a large byte stream of concatenated encoded elements.
>
> Your proposed solution is to require all container coders (though I
> think your PR only considers IterableLikeCoder, there's others, and
> there's the Elements proto itself) to prefix element encodings with
> sizes so it can give truncated streams on decoding. I think this
> places an undue burden (and code redundancy in) container coders, and
> disallows optimization on those coders that don't need to be length
> prefixed (and note that *prefixing* with length is not the only way to
> delimit a stream, we shouldn't impose that restriction as well).
> Instead, I'd keep thing the way they are, but offer a new Coder
> subclass that users can subclass if they want to write an "easy" Coder
> that does the delimiting for them (on encode and decode). We would
> point users to this for writing custom coders in the easiest way
> possible as a good option, and keeps the current Coder API the same.
>
> On Mon, Feb 5, 2018 at 10:21 AM, Romain Manni-Bucau
>  wrote:
> > Answered inlined but I want to highlight beam is a portable API on top of
> > well known vendors API which have friendly shortcuts. So the background
> here
> > is to make beam at least user friendly.
> >
> > Im fine if the outcome of the discussion is coder concept is wrong or
> > something like that but Im not fine to say we dont want to solve an API
> > issue, to not say bug, of a project which has an API as added value.
> >
> > I understand the perf concern which must be balanced with the fact
> > derialization is not used for each step/transform and that currently the
> > coder API is already intrusive and heavy for dev but also not usable by
> most
> > existing codecs out there. Even some jaxb or plain xml flavors dont work
> > with it :(.
> >
> >
> > Le 5 févr. 2018 18:46, "Robert Bradshaw"  a écrit :
> >
> > On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau
> >  wrote:
> >> Hi guys,
> >>
> >> I submitted a PR on coders to enhance 1. the user experience 2. the
> >> determinism and handling of coders.
> >>
> >> 1. the user experience is linked to what i sent some days ago: close
> >> handling of the streams from a coder code. Long story short I add a
> >> SkipCloseCoder which can decorate a coder and just wraps the stream
> (input
> >> or output) in flavors skipping close() calls. This avoids to do it by
> >> default (which had my preference if you read the related thread but not
> >> the
> >> one of everybody) but also makes the usage of a coder with this issue
> easy
> >> since the of() of the coder just wraps itself in this delagating coder.
> >>
> >> 2. this one 

Re: coder evolutions?

2018-02-05 Thread Robert Bradshaw
Just to clarify, the issue is that for some types (byte array being
the simplest) one needs to know the length of the data in order to
decode it from the stream. In particular, the claim is that many
libraries out there that do encoding/decoding assume they can gather
this information from the end of the stream and so don't explicitly
record it. For nested values, someone needs to record these lengths.
Note that in the Fn API, nearly everything is nested, as the elements
are sent as a large byte stream of concatenated encoded elements.

Your proposed solution is to require all container coders (though I
think your PR only considers IterableLikeCoder, there's others, and
there's the Elements proto itself) to prefix element encodings with
sizes so it can give truncated streams on decoding. I think this
places an undue burden (and code redundancy in) container coders, and
disallows optimization on those coders that don't need to be length
prefixed (and note that *prefixing* with length is not the only way to
delimit a stream, we shouldn't impose that restriction as well).
Instead, I'd keep thing the way they are, but offer a new Coder
subclass that users can subclass if they want to write an "easy" Coder
that does the delimiting for them (on encode and decode). We would
point users to this for writing custom coders in the easiest way
possible as a good option, and keeps the current Coder API the same.

On Mon, Feb 5, 2018 at 10:21 AM, Romain Manni-Bucau
 wrote:
> Answered inlined but I want to highlight beam is a portable API on top of
> well known vendors API which have friendly shortcuts. So the background here
> is to make beam at least user friendly.
>
> Im fine if the outcome of the discussion is coder concept is wrong or
> something like that but Im not fine to say we dont want to solve an API
> issue, to not say bug, of a project which has an API as added value.
>
> I understand the perf concern which must be balanced with the fact
> derialization is not used for each step/transform and that currently the
> coder API is already intrusive and heavy for dev but also not usable by most
> existing codecs out there. Even some jaxb or plain xml flavors dont work
> with it :(.
>
>
> Le 5 févr. 2018 18:46, "Robert Bradshaw"  a écrit :
>
> On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau
>  wrote:
>> Hi guys,
>>
>> I submitted a PR on coders to enhance 1. the user experience 2. the
>> determinism and handling of coders.
>>
>> 1. the user experience is linked to what i sent some days ago: close
>> handling of the streams from a coder code. Long story short I add a
>> SkipCloseCoder which can decorate a coder and just wraps the stream (input
>> or output) in flavors skipping close() calls. This avoids to do it by
>> default (which had my preference if you read the related thread but not
>> the
>> one of everybody) but also makes the usage of a coder with this issue easy
>> since the of() of the coder just wraps itself in this delagating coder.
>>
>> 2. this one is more nasty and mainly concerns IterableLikeCoders. These
>> ones
>> use this kind of algorithm (keep in mind they work on a list):
>>
>> writeSize()
>> for all element e {
>> elementCoder.write(e)
>> }
>> writeMagicNumber() // this one depends the size
>>
>> The decoding is symmetric so I bypass it here.
>>
>> Indeed all these writes (reads) are done on the same stream. Therefore it
>> assumes you read as much bytes than you write...which is a huge assumption
>> for a coder which should by contract assume it can read the stream...as a
>> stream (until -1).
>>
>> The idea of the fix is to change this encoding to this kind of algorithm:
>>
>> writeSize()
>> for all element e {
>> writeElementByteCount(e)
>> elementCoder.write(e)
>> }
>> writeMagicNumber() // still optionally
>
> Regardless of the backwards incompatibility issues, I'm unconvinced
> that prefixing every element with its length is a good idea. It can
> lead to blow-up in size (e.g. a list of ints, and it should be noted
> that containers with lots of elements bias towards having small
> elements) and also writeElementByteCount(e) could be very inefficient
> for many type e (e.g. a list of lists).
>
>
> What is your proposal Robert then? Current restriction is clearly a blocker
> for portability, users, determinism and is unsafe and only checkable at
> runtime so not something we should lead to keep.
>
> Alternative i thought about was to forbid implicit coders but it doesnt help
> users.
>
>
>
>> This way on the decode size you can wrap the stream by element to enforce
>> the limitation of the byte count.
>>
>> Side note: this indeed enforce a limitation due to java byte limitation
>> but
>> if you check coder code it is already here at the higher level so it is
>> not
>> a big deal for now.
>>
>> In terms of implementation it uses a LengthAwareCoder which delegates to
>> another coder the encoding and just 

Re: Schema-Aware PCollections revisited

2018-02-05 Thread Reuven Lax
multiplying by 1.0 doesn't really solve the right problems. The number type
used by Javascript (and by extension, they standard for json) only has 53
bits of precision. I've seen many, many bugs caused because of this - the
input data may easily contain numbers too large for 53 bits.

In addition, Beam's schema representation must be no less general than
other common representations. For the case of an ETL pipeline, if input
fields are integers the output fields should also be numbers. We shouldn't
turn them into floats because the schema class we used couldn't distinguish
between ints and floats. If anything, Avro schemas are a better fit here as
they are more general.

Reuven

On Sun, Feb 4, 2018 at 9:31 AM, Romain Manni-Bucau 
wrote:

> You can handle integers using multipleOf: 1.0 IIRC.
> Yes limitations are still here but it is a good starting model and to be
> honest it is good enough - not a single model will work good enough even if
> you can go a little bit further with other models a bit more complex.
> That said the idea is to enrich the model with a beam object which would
> allow to complete the metadata as required when needed (never?).
>
>
>
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
>  | Book
> 
>
> 2018-02-04 18:21 GMT+01:00 Jean-Baptiste Onofré :
>
>> Sorry guys, I was off today. Happy to be part of the party too ;)
>>
>> Regards
>> JB
>>
>> On 02/04/2018 06:19 PM, Reuven Lax wrote:
>> > Romain, since you're interested maybe the two of us should put together
>> a
>> > proposal for how to set this things (hints, schema) on PCollections? I
>> don't
>> > think it'll be hard - the previous list thread on hints already agreed
>> on a
>> > general approach, and we would just need to flesh it out.
>> >
>> > BTW in the past when I looked, Json schemas seemed to have some odd
>> limitations
>> > inherited from Javascript (e.g. no distinction between integer and
>> > floating-point types). Is that still true?
>> >
>> > Reuven
>> >
>> > On Sun, Feb 4, 2018 at 9:12 AM, Romain Manni-Bucau <
>> rmannibu...@gmail.com
>> > > wrote:
>> >
>> >
>> >
>> > 2018-02-04 17:53 GMT+01:00 Reuven Lax > > >:
>> >
>> >
>> >
>> > On Sun, Feb 4, 2018 at 8:42 AM, Romain Manni-Bucau
>> > > wrote:
>> >
>> >
>> > 2018-02-04 17:37 GMT+01:00 Reuven Lax > > >:
>> >
>> > I'm not sure where proto comes from here. Proto is one
>> example
>> > of a type that has a schema, but only one example.
>> >
>> > 1. In the initial prototype I want to avoid modifying
>> the
>> > PCollection API. So I think it's best to create a
>> special
>> > SchemaCoder, and pass the schema into this coder. Later
>> we might
>> > targeted APIs for this instead of going through a coder.
>> > 1.a I don't see what hints have to do with this?
>> >
>> >
>> > Hints are a way to replace the new API and unify the way to
>> pass
>> > metadata in beam instead of adding a new custom way each
>> time.
>> >
>> >
>> > I don't think schema is a hint. But I hear what your saying -
>> hint is a
>> > type of PCollection metadata as is schema, and we should have a
>> unified
>> > API for setting such metadata.
>> >
>> >
>> > :), Ismael pointed me out earlier this week that "hint" had an old
>> meaning
>> > in beam. My usage is purely the one done in most EE spec (your
>> "metadata" in
>> > previous answer). But guess we are aligned on the meaning now, just
>> wanted
>> > to be sure.
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > 2. BeamSQL already has a generic record type which fits
>> this use
>> > case very well (though we might modify it). However as
>> mentioned
>> > in the doc, the user is never forced to use this
>> generic record
>> > type.
>> >
>> >
>> > Well yes and not. A type already exists but 1. it is very
>> strictly
>> > limited (flat/columns only which is very few of what big
>> data SQL
>> > can do) and 2. it must be aligned on the converge of
>> generic data
>> > the schema will bring (really read "aligned" as "dropped in
>> favor
>> > of" - deprecated being a smooth way to do it).
>> >
>> >
>> > As I said the existing class needs to be modified and extended,
>> and not
>> > just for this schema us was. It 

Re: coder evolutions?

2018-02-05 Thread Raghu Angadi
Could you describe 2nd issue bit more in detail may be with a short example?
LengthAwareCoder in the PR adds another buffer copy..
(BufferedElementCountingOutputStream already has extra buffer copy).

On Mon, Feb 5, 2018 at 10:34 AM, Romain Manni-Bucau 
wrote:

> Would this work for everyone - can update the pr if so:
>
> If coder is not built in
> Prefix with byte size
> Else
> Current behavior
>
> ?
>
> Le 5 févr. 2018 19:21, "Romain Manni-Bucau"  a
> écrit :
>
>> Answered inlined but I want to highlight beam is a portable API on top of
>> well known vendors API which have friendly shortcuts. So the background
>> here is to make beam at least user friendly.
>>
>> Im fine if the outcome of the discussion is coder concept is wrong or
>> something like that but Im not fine to say we dont want to solve an API
>> issue, to not say bug, of a project which has an API as added value.
>>
>> I understand the perf concern which must be balanced with the fact
>> derialization is not used for each step/transform and that currently the
>> coder API is already intrusive and heavy for dev but also not usable by
>> most existing codecs out there. Even some jaxb or plain xml flavors dont
>> work with it :(.
>>
>> Le 5 févr. 2018 18:46, "Robert Bradshaw"  a écrit :
>>
>> On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau
>>  wrote:
>> > Hi guys,
>> >
>> > I submitted a PR on coders to enhance 1. the user experience 2. the
>> > determinism and handling of coders.
>> >
>> > 1. the user experience is linked to what i sent some days ago: close
>> > handling of the streams from a coder code. Long story short I add a
>> > SkipCloseCoder which can decorate a coder and just wraps the stream
>> (input
>> > or output) in flavors skipping close() calls. This avoids to do it by
>> > default (which had my preference if you read the related thread but not
>> the
>> > one of everybody) but also makes the usage of a coder with this issue
>> easy
>> > since the of() of the coder just wraps itself in this delagating coder.
>> >
>> > 2. this one is more nasty and mainly concerns IterableLikeCoders. These
>> ones
>> > use this kind of algorithm (keep in mind they work on a list):
>> >
>> > writeSize()
>> > for all element e {
>> > elementCoder.write(e)
>> > }
>> > writeMagicNumber() // this one depends the size
>> >
>> > The decoding is symmetric so I bypass it here.
>> >
>> > Indeed all these writes (reads) are done on the same stream. Therefore
>> it
>> > assumes you read as much bytes than you write...which is a huge
>> assumption
>> > for a coder which should by contract assume it can read the stream...as
>> a
>> > stream (until -1).
>> >
>> > The idea of the fix is to change this encoding to this kind of
>> algorithm:
>> >
>> > writeSize()
>> > for all element e {
>> > writeElementByteCount(e)
>> > elementCoder.write(e)
>> > }
>> > writeMagicNumber() // still optionally
>>
>> Regardless of the backwards incompatibility issues, I'm unconvinced
>> that prefixing every element with its length is a good idea. It can
>> lead to blow-up in size (e.g. a list of ints, and it should be noted
>> that containers with lots of elements bias towards having small
>> elements) and also writeElementByteCount(e) could be very inefficient
>> for many type e (e.g. a list of lists).
>>
>>
>> What is your proposal Robert then? Current restriction is clearly a
>> blocker for portability, users, determinism and is unsafe and only
>> checkable at runtime so not something we should lead to keep.
>>
>> Alternative i thought about was to forbid implicit coders but it doesnt
>> help users.
>>
>>
>>
>> > This way on the decode size you can wrap the stream by element to
>> enforce
>> > the limitation of the byte count.
>> >
>> > Side note: this indeed enforce a limitation due to java byte limitation
>> but
>> > if you check coder code it is already here at the higher level so it is
>> not
>> > a big deal for now.
>> >
>> > In terms of implementation it uses a LengthAwareCoder which delegates to
>> > another coder the encoding and just adds the byte count before the
>> actual
>> > serialization. Not perfect but should be more than enough in terms of
>> > support and perf for beam if you think real pipelines (we try to avoid
>> > serializations or it is done on some well known points where this algo
>> > should be enough...worse case it is not a huge overhead, mainly just
>> some
>> > memory overhead).
>> >
>> >
>> > The PR is available at https://github.com/apache/beam/pull/4594. If you
>> > check you will see I put it "WIP". The main reason is that it changes
>> the
>> > encoding format for containers (lists, iterable, ...) and therefore
>> breaks
>> > python/go/... tests and the standard_coders.yml definition. Some help on
>> > that would be very welcomed.
>> >
>> > Technical side note if you wonder: UnownedInputStream doesn't even
>> allow to
>> > mark the 

Re: coder evolutions?

2018-02-05 Thread Romain Manni-Bucau
Would this work for everyone - can update the pr if so:

If coder is not built in
Prefix with byte size
Else
Current behavior

?

Le 5 févr. 2018 19:21, "Romain Manni-Bucau"  a
écrit :

> Answered inlined but I want to highlight beam is a portable API on top of
> well known vendors API which have friendly shortcuts. So the background
> here is to make beam at least user friendly.
>
> Im fine if the outcome of the discussion is coder concept is wrong or
> something like that but Im not fine to say we dont want to solve an API
> issue, to not say bug, of a project which has an API as added value.
>
> I understand the perf concern which must be balanced with the fact
> derialization is not used for each step/transform and that currently the
> coder API is already intrusive and heavy for dev but also not usable by
> most existing codecs out there. Even some jaxb or plain xml flavors dont
> work with it :(.
>
> Le 5 févr. 2018 18:46, "Robert Bradshaw"  a écrit :
>
> On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau
>  wrote:
> > Hi guys,
> >
> > I submitted a PR on coders to enhance 1. the user experience 2. the
> > determinism and handling of coders.
> >
> > 1. the user experience is linked to what i sent some days ago: close
> > handling of the streams from a coder code. Long story short I add a
> > SkipCloseCoder which can decorate a coder and just wraps the stream
> (input
> > or output) in flavors skipping close() calls. This avoids to do it by
> > default (which had my preference if you read the related thread but not
> the
> > one of everybody) but also makes the usage of a coder with this issue
> easy
> > since the of() of the coder just wraps itself in this delagating coder.
> >
> > 2. this one is more nasty and mainly concerns IterableLikeCoders. These
> ones
> > use this kind of algorithm (keep in mind they work on a list):
> >
> > writeSize()
> > for all element e {
> > elementCoder.write(e)
> > }
> > writeMagicNumber() // this one depends the size
> >
> > The decoding is symmetric so I bypass it here.
> >
> > Indeed all these writes (reads) are done on the same stream. Therefore it
> > assumes you read as much bytes than you write...which is a huge
> assumption
> > for a coder which should by contract assume it can read the stream...as a
> > stream (until -1).
> >
> > The idea of the fix is to change this encoding to this kind of algorithm:
> >
> > writeSize()
> > for all element e {
> > writeElementByteCount(e)
> > elementCoder.write(e)
> > }
> > writeMagicNumber() // still optionally
>
> Regardless of the backwards incompatibility issues, I'm unconvinced
> that prefixing every element with its length is a good idea. It can
> lead to blow-up in size (e.g. a list of ints, and it should be noted
> that containers with lots of elements bias towards having small
> elements) and also writeElementByteCount(e) could be very inefficient
> for many type e (e.g. a list of lists).
>
>
> What is your proposal Robert then? Current restriction is clearly a
> blocker for portability, users, determinism and is unsafe and only
> checkable at runtime so not something we should lead to keep.
>
> Alternative i thought about was to forbid implicit coders but it doesnt
> help users.
>
>
>
> > This way on the decode size you can wrap the stream by element to enforce
> > the limitation of the byte count.
> >
> > Side note: this indeed enforce a limitation due to java byte limitation
> but
> > if you check coder code it is already here at the higher level so it is
> not
> > a big deal for now.
> >
> > In terms of implementation it uses a LengthAwareCoder which delegates to
> > another coder the encoding and just adds the byte count before the actual
> > serialization. Not perfect but should be more than enough in terms of
> > support and perf for beam if you think real pipelines (we try to avoid
> > serializations or it is done on some well known points where this algo
> > should be enough...worse case it is not a huge overhead, mainly just some
> > memory overhead).
> >
> >
> > The PR is available at https://github.com/apache/beam/pull/4594. If you
> > check you will see I put it "WIP". The main reason is that it changes the
> > encoding format for containers (lists, iterable, ...) and therefore
> breaks
> > python/go/... tests and the standard_coders.yml definition. Some help on
> > that would be very welcomed.
> >
> > Technical side note if you wonder: UnownedInputStream doesn't even allow
> to
> > mark the stream so there is no real fast way to read the stream as fast
> as
> > possible with standard buffering strategies and to support this automatic
> > IterableCoder wrapping which is implicit. In other words, if beam wants
> to
> > support any coder, including the ones not requiring to write the size of
> the
> > output - most of the codecs - then we need to change the way it works to
> > something like that which does it for the 

Re: coder evolutions?

2018-02-05 Thread Romain Manni-Bucau
Answered inlined but I want to highlight beam is a portable API on top of
well known vendors API which have friendly shortcuts. So the background
here is to make beam at least user friendly.

Im fine if the outcome of the discussion is coder concept is wrong or
something like that but Im not fine to say we dont want to solve an API
issue, to not say bug, of a project which has an API as added value.

I understand the perf concern which must be balanced with the fact
derialization is not used for each step/transform and that currently the
coder API is already intrusive and heavy for dev but also not usable by
most existing codecs out there. Even some jaxb or plain xml flavors dont
work with it :(.

Le 5 févr. 2018 18:46, "Robert Bradshaw"  a écrit :

On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau
 wrote:
> Hi guys,
>
> I submitted a PR on coders to enhance 1. the user experience 2. the
> determinism and handling of coders.
>
> 1. the user experience is linked to what i sent some days ago: close
> handling of the streams from a coder code. Long story short I add a
> SkipCloseCoder which can decorate a coder and just wraps the stream (input
> or output) in flavors skipping close() calls. This avoids to do it by
> default (which had my preference if you read the related thread but not
the
> one of everybody) but also makes the usage of a coder with this issue easy
> since the of() of the coder just wraps itself in this delagating coder.
>
> 2. this one is more nasty and mainly concerns IterableLikeCoders. These
ones
> use this kind of algorithm (keep in mind they work on a list):
>
> writeSize()
> for all element e {
> elementCoder.write(e)
> }
> writeMagicNumber() // this one depends the size
>
> The decoding is symmetric so I bypass it here.
>
> Indeed all these writes (reads) are done on the same stream. Therefore it
> assumes you read as much bytes than you write...which is a huge assumption
> for a coder which should by contract assume it can read the stream...as a
> stream (until -1).
>
> The idea of the fix is to change this encoding to this kind of algorithm:
>
> writeSize()
> for all element e {
> writeElementByteCount(e)
> elementCoder.write(e)
> }
> writeMagicNumber() // still optionally

Regardless of the backwards incompatibility issues, I'm unconvinced
that prefixing every element with its length is a good idea. It can
lead to blow-up in size (e.g. a list of ints, and it should be noted
that containers with lots of elements bias towards having small
elements) and also writeElementByteCount(e) could be very inefficient
for many type e (e.g. a list of lists).


What is your proposal Robert then? Current restriction is clearly a blocker
for portability, users, determinism and is unsafe and only checkable at
runtime so not something we should lead to keep.

Alternative i thought about was to forbid implicit coders but it doesnt
help users.



> This way on the decode size you can wrap the stream by element to enforce
> the limitation of the byte count.
>
> Side note: this indeed enforce a limitation due to java byte limitation
but
> if you check coder code it is already here at the higher level so it is
not
> a big deal for now.
>
> In terms of implementation it uses a LengthAwareCoder which delegates to
> another coder the encoding and just adds the byte count before the actual
> serialization. Not perfect but should be more than enough in terms of
> support and perf for beam if you think real pipelines (we try to avoid
> serializations or it is done on some well known points where this algo
> should be enough...worse case it is not a huge overhead, mainly just some
> memory overhead).
>
>
> The PR is available at https://github.com/apache/beam/pull/4594. If you
> check you will see I put it "WIP". The main reason is that it changes the
> encoding format for containers (lists, iterable, ...) and therefore breaks
> python/go/... tests and the standard_coders.yml definition. Some help on
> that would be very welcomed.
>
> Technical side note if you wonder: UnownedInputStream doesn't even allow
to
> mark the stream so there is no real fast way to read the stream as fast as
> possible with standard buffering strategies and to support this automatic
> IterableCoder wrapping which is implicit. In other words, if beam wants to
> support any coder, including the ones not requiring to write the size of
the
> output - most of the codecs - then we need to change the way it works to
> something like that which does it for the user which doesn't know its
coder
> got wrapped.
>
> Hope it makes sense, if not, don't hesitate to ask questions.
>
> Happy end of week-end.
>
> Romain Manni-Bucau
> @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book


Re: coder evolutions?

2018-02-05 Thread Robert Bradshaw
On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau
 wrote:
> Hi guys,
>
> I submitted a PR on coders to enhance 1. the user experience 2. the
> determinism and handling of coders.
>
> 1. the user experience is linked to what i sent some days ago: close
> handling of the streams from a coder code. Long story short I add a
> SkipCloseCoder which can decorate a coder and just wraps the stream (input
> or output) in flavors skipping close() calls. This avoids to do it by
> default (which had my preference if you read the related thread but not the
> one of everybody) but also makes the usage of a coder with this issue easy
> since the of() of the coder just wraps itself in this delagating coder.
>
> 2. this one is more nasty and mainly concerns IterableLikeCoders. These ones
> use this kind of algorithm (keep in mind they work on a list):
>
> writeSize()
> for all element e {
> elementCoder.write(e)
> }
> writeMagicNumber() // this one depends the size
>
> The decoding is symmetric so I bypass it here.
>
> Indeed all these writes (reads) are done on the same stream. Therefore it
> assumes you read as much bytes than you write...which is a huge assumption
> for a coder which should by contract assume it can read the stream...as a
> stream (until -1).
>
> The idea of the fix is to change this encoding to this kind of algorithm:
>
> writeSize()
> for all element e {
> writeElementByteCount(e)
> elementCoder.write(e)
> }
> writeMagicNumber() // still optionally

Regardless of the backwards incompatibility issues, I'm unconvinced
that prefixing every element with its length is a good idea. It can
lead to blow-up in size (e.g. a list of ints, and it should be noted
that containers with lots of elements bias towards having small
elements) and also writeElementByteCount(e) could be very inefficient
for many type e (e.g. a list of lists).

> This way on the decode size you can wrap the stream by element to enforce
> the limitation of the byte count.
>
> Side note: this indeed enforce a limitation due to java byte limitation but
> if you check coder code it is already here at the higher level so it is not
> a big deal for now.
>
> In terms of implementation it uses a LengthAwareCoder which delegates to
> another coder the encoding and just adds the byte count before the actual
> serialization. Not perfect but should be more than enough in terms of
> support and perf for beam if you think real pipelines (we try to avoid
> serializations or it is done on some well known points where this algo
> should be enough...worse case it is not a huge overhead, mainly just some
> memory overhead).
>
>
> The PR is available at https://github.com/apache/beam/pull/4594. If you
> check you will see I put it "WIP". The main reason is that it changes the
> encoding format for containers (lists, iterable, ...) and therefore breaks
> python/go/... tests and the standard_coders.yml definition. Some help on
> that would be very welcomed.
>
> Technical side note if you wonder: UnownedInputStream doesn't even allow to
> mark the stream so there is no real fast way to read the stream as fast as
> possible with standard buffering strategies and to support this automatic
> IterableCoder wrapping which is implicit. In other words, if beam wants to
> support any coder, including the ones not requiring to write the size of the
> output - most of the codecs - then we need to change the way it works to
> something like that which does it for the user which doesn't know its coder
> got wrapped.
>
> Hope it makes sense, if not, don't hesitate to ask questions.
>
> Happy end of week-end.
>
> Romain Manni-Bucau
> @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book


Re: coder evolutions?

2018-02-05 Thread Lukasz Cwik
I do agree that being able to upgrade the encoding for coders between
pipelines is important and thanks for creating BEAM-3616.

Marking/reset for a coder can only be supported by either the root coder or
every leaf coder in a coder tree unless you wrap each layer with a byte
copying stream. If you don't wrap the stream, it is likely that there will
be an obscure bug and we will read data multiple times. Imagine we have a
List coder and the List coder marks the stream and the String coder
marks the stream, since both mark the stream then the List coders mark will
be lost. Byte copying to support mark is a non-trivial cost.

It is unknown whether length prefixing every element in a coder structure
has significant performance implications. The Python SDK does this for many
of their types beyond a handful they treat specially. A couple of
degenerate cases where a lot of encoding overhead are:
Iterable: Any small value like a short requires length
prefixing by one byte which leads to a  1/x increase in the size of the
stream. Worst case, iterables of byte values double in encoding size.
KV, C>, D>: Nested structures need to length prefix at each
level their inner contents.
The 2.0 Java SDK had this concept of nested and outer contexts that would
get passed into the coder to say whether the coder owned only a part of the
stream or owned the remainder of the stream. In the 2.1 or 2.2 release,
this concept was deprecated and the Java SDK moved closer to the Python
SDK. So if you see the context stuff being used in the Java SDK or access
to the stream in the Python SDK, it's likely handling some degenerate case.

On Mon, Feb 5, 2018 at 1:09 AM, Romain Manni-Bucau 
wrote:

> Thanks, created https://issues.apache.org/jira/browse/BEAM-3616
>
>
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
>  | Book
> 
>
> 2018-02-04 22:12 GMT+01:00 Jean-Baptiste Onofré :
>
>> Done
>>
>> Regards
>> JB
>>
>> On 02/04/2018 09:14 PM, Romain Manni-Bucau wrote:
>> > Works for me. So a jira with target version = 3.
>> >
>> > Can someone with the karma check we have a 3.0.0 in jira system please?
>> >
>> > Le 4 févr. 2018 20:46, "Reuven Lax"  re...@google.com>>
>> > a écrit :
>> >
>> > Seems fine to me. At some point we might want to do an audit of
>> existing
>> > Jira issues, because I suspect there are issues that should be
>> targeted to
>> > 3.0 but are not yet tagged.
>> >
>> > On Sun, Feb 4, 2018 at 11:41 AM, Jean-Baptiste Onofré <
>> j...@nanthrax.net
>> > > wrote:
>> >
>> > I would prefer to use Jira, with "wish"/"ideas", and adding
>> Beam 3.0.0
>> > version.
>> >
>> > WDYT ?
>> >
>> > Regards
>> > JB
>> >
>> > On 02/04/2018 07:55 PM, Reuven Lax wrote:
>> > > Do we have a good place to track the items for Beam 3.0, or
>> is Jira the best
>> > > place? Romain has a good point - if this gets forgotten when
>> we do Beam 3.0,
>> > > then we're stuck waiting around till Beam 4.0.
>> > >
>> > > Reuven
>> > >
>> > > On Sun, Feb 4, 2018 at 9:27 AM, Jean-Baptiste Onofré <
>> j...@nanthrax.net 
>> > > >> wrote:
>> > >
>> > > That's a good point. In the roadmap for Beam 3, I think
>> it makes
>> > sense to add a
>> > > point about this.
>> > >
>> > > Regards
>> > > JB
>> > >
>> > > On 02/04/2018 06:18 PM, Eugene Kirpichov wrote:
>> > > > I think doing a change that would break pipeline update
>> for
>> > every single user of
>> > > > Flink and Dataflow needs to be postponed until a next
>> major
>> > version. Pipeline
>> > > > update is a very frequently used feature, especially by
>> the
>> > largest users. We've
>> > > > had those users get significantly upset even when we
>> > accidentally broke update
>> > > > compatibility for some special cases of individual
>> transforms;
>> > breaking it
>> > > > intentionally and project-wide is too extreme to be
>> justified by
>> > the benefits of
>> > > > the current change.
>> > > >
>> > > > That said, I think concerns about coder APIs are
>> reasonable, and
>> > it is
>> > > > unfortunate that we effectively can't make changes to
>> them right
>> > now. It would
>> > > > be great if in the next major version we were better

Re: [CANCEL][VOTE] Release 2.3.0, release candidate #1

2018-02-05 Thread Jean-Baptiste Onofré
Created:

https://issues.apache.org/jira/browse/BEAM-3617

Regards
JB

On 02/05/2018 04:42 PM, Kenneth Knowles wrote:
> What is the Jira for direct runner perf?
> 
> On Mon, Feb 5, 2018 at 4:35 AM, Jean-Baptiste Onofré  > wrote:
> 
> Thanks !
> 
> I cherry-pick on release-2.3.0 branch.
> 
> I'm on the direct runner perf test in the mean time.
> 
> Thanks again !
> 
> Regards
> JB
> 
> On 02/05/2018 12:06 PM, Aljoscha Krettek wrote:
> > I merged fixes for:
> >  - https://issues.apache.org/jira/browse/BEAM-3186
> 
> >  - https://issues.apache.org/jira/browse/BEAM-3589
> 
> >
> > @JB I didn't yet merge them on the 2.3.0 branch, though, but I can or 
> you can go
> > ahead.
> >
> >> On 5. Feb 2018, at 06:30, Jean-Baptiste Onofré  
> >> >> wrote:
> >>
> >> Hi guys,
> >>
> >> Quick update on the RC2 preparation:
> >>
> >> * BEAM-3587 (TextIO with Flink) seems related to a custom build with
> Gradle (not
> >> using artifacts created by Maven). Anyway, I will take a look today.
> >> * BEAM-3186 has a PR. Aljoscha will do the review pretty soon.
> >> * I'm also taking a look on the direct runner performance change.
> >>
> >> The goal is to cut RC2 tomorrow or the day after.
> >>
> >> Regards
> >> JB
> >>
> >> On 02/01/2018 04:16 PM, Jean-Baptiste Onofré wrote:
> >>> Hi guys,
> >>>
> >>> Especially due to BEAM-3587 & BEAM-3186 regressions, I cancel RC1.
> >>>
> >>> We will cherry-pick fixes on release-2.3.0 branch.
> >>>
> >>> I'm updating Jira right now. When the fixes will be cherry-picked, I
> will submit
> >>> a RC2 to vote.
> >>>
> >>> Thanks !
> >>> Regards
> >>> JB
> >>>
> >>> On 01/30/2018 09:04 AM, Jean-Baptiste Onofré wrote:
>  Hi everyone,
> 
>  Please review and vote on the release candidate #1 for the version
> 2.3.0, as
>  follows:
> 
>  [ ] +1, Approve the release
>  [ ] -1, Do not approve the release (please provide specific comments)
> 
> 
>  The complete staging area is available for your review, which 
> includes:
>  * JIRA release notes [1],
>  * the official Apache source release to be deployed to 
> dist.apache.org
> 
>   [2],
>  which is signed with the key with fingerprint C8282E76 [3],
>  * all artifacts to be deployed to the Maven Central Repository [4],
>  * source code tag "v2.3.0-RC1" [5],
>  * website pull request listing the release and publishing the API 
> reference
>  manual [6].
>  * Java artifacts were built with Maven 3.3.9 and Oracle JDK 
> 1.8.0_111.
>  * Python artifacts are deployed along with the source release to the
>  dist.apache.org   
> [2].
> 
>  The vote will be open for at least 72 hours. It is adopted by 
> majority approval,
>  with at least 3 PMC affirmative votes.
> 
>  Thanks,
>  JB
> 
>  [1]
>  
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12341608
> 
> 
>  [2] https://dist.apache.org/repos/dist/dev/beam/2.3.0/
> 
>  [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> 
>  [4] 
> https://repository.apache.org/content/repositories/orgapachebeam-1026/
> 
>  [5] https://github.com/apache/beam/tree/v2.3.0-RC1
> 
>  [6] https://github.com/apache/beam-site/pull/381
> 
> 
> >>>
> >>
> >> --
> >> Jean-Baptiste Onofré
> >> jbono...@apache.org 
> >
> >> http://blog.nanthrax.net
> >> Talend - http://www.talend.com
> >
> 
> --
> Jean-Baptiste Onofré
> jbono...@apache.org 
> http://blog.nanthrax.net
> Talend - http://www.talend.com
> 
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [CANCEL][VOTE] Release 2.3.0, release candidate #1

2018-02-05 Thread Jean-Baptiste Onofré
Hi Kenn,

my bad, I didn't create one yet (I was busy on the TextIO with flink runner, now
identify \o/ ;)).

Let me create it right now.

Thanks !
Regards
JB

On 02/05/2018 04:42 PM, Kenneth Knowles wrote:
> What is the Jira for direct runner perf?
> 
> On Mon, Feb 5, 2018 at 4:35 AM, Jean-Baptiste Onofré  > wrote:
> 
> Thanks !
> 
> I cherry-pick on release-2.3.0 branch.
> 
> I'm on the direct runner perf test in the mean time.
> 
> Thanks again !
> 
> Regards
> JB
> 
> On 02/05/2018 12:06 PM, Aljoscha Krettek wrote:
> > I merged fixes for:
> >  - https://issues.apache.org/jira/browse/BEAM-3186
> 
> >  - https://issues.apache.org/jira/browse/BEAM-3589
> 
> >
> > @JB I didn't yet merge them on the 2.3.0 branch, though, but I can or 
> you can go
> > ahead.
> >
> >> On 5. Feb 2018, at 06:30, Jean-Baptiste Onofré  
> >> >> wrote:
> >>
> >> Hi guys,
> >>
> >> Quick update on the RC2 preparation:
> >>
> >> * BEAM-3587 (TextIO with Flink) seems related to a custom build with
> Gradle (not
> >> using artifacts created by Maven). Anyway, I will take a look today.
> >> * BEAM-3186 has a PR. Aljoscha will do the review pretty soon.
> >> * I'm also taking a look on the direct runner performance change.
> >>
> >> The goal is to cut RC2 tomorrow or the day after.
> >>
> >> Regards
> >> JB
> >>
> >> On 02/01/2018 04:16 PM, Jean-Baptiste Onofré wrote:
> >>> Hi guys,
> >>>
> >>> Especially due to BEAM-3587 & BEAM-3186 regressions, I cancel RC1.
> >>>
> >>> We will cherry-pick fixes on release-2.3.0 branch.
> >>>
> >>> I'm updating Jira right now. When the fixes will be cherry-picked, I
> will submit
> >>> a RC2 to vote.
> >>>
> >>> Thanks !
> >>> Regards
> >>> JB
> >>>
> >>> On 01/30/2018 09:04 AM, Jean-Baptiste Onofré wrote:
>  Hi everyone,
> 
>  Please review and vote on the release candidate #1 for the version
> 2.3.0, as
>  follows:
> 
>  [ ] +1, Approve the release
>  [ ] -1, Do not approve the release (please provide specific comments)
> 
> 
>  The complete staging area is available for your review, which 
> includes:
>  * JIRA release notes [1],
>  * the official Apache source release to be deployed to 
> dist.apache.org
> 
>   [2],
>  which is signed with the key with fingerprint C8282E76 [3],
>  * all artifacts to be deployed to the Maven Central Repository [4],
>  * source code tag "v2.3.0-RC1" [5],
>  * website pull request listing the release and publishing the API 
> reference
>  manual [6].
>  * Java artifacts were built with Maven 3.3.9 and Oracle JDK 
> 1.8.0_111.
>  * Python artifacts are deployed along with the source release to the
>  dist.apache.org   
> [2].
> 
>  The vote will be open for at least 72 hours. It is adopted by 
> majority approval,
>  with at least 3 PMC affirmative votes.
> 
>  Thanks,
>  JB
> 
>  [1]
>  
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12341608
> 
> 
>  [2] https://dist.apache.org/repos/dist/dev/beam/2.3.0/
> 
>  [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> 
>  [4] 
> https://repository.apache.org/content/repositories/orgapachebeam-1026/
> 
>  [5] https://github.com/apache/beam/tree/v2.3.0-RC1
> 
>  [6] https://github.com/apache/beam-site/pull/381
> 
> 
> >>>
> >>
> >> --
> >> Jean-Baptiste Onofré
> >> jbono...@apache.org 
> >
> >> http://blog.nanthrax.net
> >> Talend - http://www.talend.com
> >
> 
> --
> Jean-Baptiste Onofré
> jbono...@apache.org 
> http://blog.nanthrax.net
> Talend - http://www.talend.com
> 
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [CANCEL][VOTE] Release 2.3.0, release candidate #1

2018-02-05 Thread Kenneth Knowles
What is the Jira for direct runner perf?

On Mon, Feb 5, 2018 at 4:35 AM, Jean-Baptiste Onofré 
wrote:

> Thanks !
>
> I cherry-pick on release-2.3.0 branch.
>
> I'm on the direct runner perf test in the mean time.
>
> Thanks again !
>
> Regards
> JB
>
> On 02/05/2018 12:06 PM, Aljoscha Krettek wrote:
> > I merged fixes for:
> >  - https://issues.apache.org/jira/browse/BEAM-3186
> >  - https://issues.apache.org/jira/browse/BEAM-3589
> >
> > @JB I didn't yet merge them on the 2.3.0 branch, though, but I can or
> you can go
> > ahead.
> >
> >> On 5. Feb 2018, at 06:30, Jean-Baptiste Onofré  >> > wrote:
> >>
> >> Hi guys,
> >>
> >> Quick update on the RC2 preparation:
> >>
> >> * BEAM-3587 (TextIO with Flink) seems related to a custom build with
> Gradle (not
> >> using artifacts created by Maven). Anyway, I will take a look today.
> >> * BEAM-3186 has a PR. Aljoscha will do the review pretty soon.
> >> * I'm also taking a look on the direct runner performance change.
> >>
> >> The goal is to cut RC2 tomorrow or the day after.
> >>
> >> Regards
> >> JB
> >>
> >> On 02/01/2018 04:16 PM, Jean-Baptiste Onofré wrote:
> >>> Hi guys,
> >>>
> >>> Especially due to BEAM-3587 & BEAM-3186 regressions, I cancel RC1.
> >>>
> >>> We will cherry-pick fixes on release-2.3.0 branch.
> >>>
> >>> I'm updating Jira right now. When the fixes will be cherry-picked, I
> will submit
> >>> a RC2 to vote.
> >>>
> >>> Thanks !
> >>> Regards
> >>> JB
> >>>
> >>> On 01/30/2018 09:04 AM, Jean-Baptiste Onofré wrote:
>  Hi everyone,
> 
>  Please review and vote on the release candidate #1 for the version
> 2.3.0, as
>  follows:
> 
>  [ ] +1, Approve the release
>  [ ] -1, Do not approve the release (please provide specific comments)
> 
> 
>  The complete staging area is available for your review, which
> includes:
>  * JIRA release notes [1],
>  * the official Apache source release to be deployed to
> dist.apache.org
>   [2],
>  which is signed with the key with fingerprint C8282E76 [3],
>  * all artifacts to be deployed to the Maven Central Repository [4],
>  * source code tag "v2.3.0-RC1" [5],
>  * website pull request listing the release and publishing the API
> reference
>  manual [6].
>  * Java artifacts were built with Maven 3.3.9 and Oracle JDK 1.8.0_111.
>  * Python artifacts are deployed along with the source release to the
>  dist.apache.org  [2].
> 
>  The vote will be open for at least 72 hours. It is adopted by
> majority approval,
>  with at least 3 PMC affirmative votes.
> 
>  Thanks,
>  JB
> 
>  [1]
>  https://issues.apache.org/jira/secure/ReleaseNote.jspa?
> projectId=12319527=12341608
>  [2] https://dist.apache.org/repos/dist/dev/beam/2.3.0/
>  [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>  [4] https://repository.apache.org/content/repositories/
> orgapachebeam-1026/
>  [5] https://github.com/apache/beam/tree/v2.3.0-RC1
>  [6] https://github.com/apache/beam-site/pull/381
> 
> >>>
> >>
> >> --
> >> Jean-Baptiste Onofré
> >> jbono...@apache.org 
> >> http://blog.nanthrax.net
> >> Talend - http://www.talend.com
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [CANCEL][VOTE] Release 2.3.0, release candidate #1

2018-02-05 Thread Jean-Baptiste Onofré
Thanks !

I cherry-pick on release-2.3.0 branch.

I'm on the direct runner perf test in the mean time.

Thanks again !

Regards
JB

On 02/05/2018 12:06 PM, Aljoscha Krettek wrote:
> I merged fixes for:
>  - https://issues.apache.org/jira/browse/BEAM-3186
>  - https://issues.apache.org/jira/browse/BEAM-3589
> 
> @JB I didn't yet merge them on the 2.3.0 branch, though, but I can or you can 
> go
> ahead.
> 
>> On 5. Feb 2018, at 06:30, Jean-Baptiste Onofré > > wrote:
>>
>> Hi guys,
>>
>> Quick update on the RC2 preparation:
>>
>> * BEAM-3587 (TextIO with Flink) seems related to a custom build with Gradle 
>> (not
>> using artifacts created by Maven). Anyway, I will take a look today.
>> * BEAM-3186 has a PR. Aljoscha will do the review pretty soon.
>> * I'm also taking a look on the direct runner performance change.
>>
>> The goal is to cut RC2 tomorrow or the day after.
>>
>> Regards
>> JB
>>
>> On 02/01/2018 04:16 PM, Jean-Baptiste Onofré wrote:
>>> Hi guys,
>>>
>>> Especially due to BEAM-3587 & BEAM-3186 regressions, I cancel RC1.
>>>
>>> We will cherry-pick fixes on release-2.3.0 branch.
>>>
>>> I'm updating Jira right now. When the fixes will be cherry-picked, I will 
>>> submit
>>> a RC2 to vote.
>>>
>>> Thanks !
>>> Regards
>>> JB
>>>
>>> On 01/30/2018 09:04 AM, Jean-Baptiste Onofré wrote:
 Hi everyone,

 Please review and vote on the release candidate #1 for the version 2.3.0, 
 as
 follows:

 [ ] +1, Approve the release
 [ ] -1, Do not approve the release (please provide specific comments)


 The complete staging area is available for your review, which includes:
 * JIRA release notes [1],
 * the official Apache source release to be deployed to dist.apache.org
  [2],
 which is signed with the key with fingerprint C8282E76 [3],
 * all artifacts to be deployed to the Maven Central Repository [4],
 * source code tag "v2.3.0-RC1" [5],
 * website pull request listing the release and publishing the API reference
 manual [6].
 * Java artifacts were built with Maven 3.3.9 and Oracle JDK 1.8.0_111.
 * Python artifacts are deployed along with the source release to the
 dist.apache.org  [2].

 The vote will be open for at least 72 hours. It is adopted by majority 
 approval,
 with at least 3 PMC affirmative votes.

 Thanks,
 JB

 [1]
 https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12341608
 [2] https://dist.apache.org/repos/dist/dev/beam/2.3.0/
 [3] https://dist.apache.org/repos/dist/release/beam/KEYS
 [4] https://repository.apache.org/content/repositories/orgapachebeam-1026/
 [5] https://github.com/apache/beam/tree/v2.3.0-RC1
 [6] https://github.com/apache/beam-site/pull/381

>>>
>>
>> -- 
>> Jean-Baptiste Onofré
>> jbono...@apache.org 
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [CANCEL][VOTE] Release 2.3.0, release candidate #1

2018-02-05 Thread Aljoscha Krettek
I merged fixes for:
 - https://issues.apache.org/jira/browse/BEAM-3186 

 - https://issues.apache.org/jira/browse/BEAM-3589 


@JB I didn't yet merge them on the 2.3.0 branch, though, but I can or you can 
go ahead.

> On 5. Feb 2018, at 06:30, Jean-Baptiste Onofré  wrote:
> 
> Hi guys,
> 
> Quick update on the RC2 preparation:
> 
> * BEAM-3587 (TextIO with Flink) seems related to a custom build with Gradle 
> (not
> using artifacts created by Maven). Anyway, I will take a look today.
> * BEAM-3186 has a PR. Aljoscha will do the review pretty soon.
> * I'm also taking a look on the direct runner performance change.
> 
> The goal is to cut RC2 tomorrow or the day after.
> 
> Regards
> JB
> 
> On 02/01/2018 04:16 PM, Jean-Baptiste Onofré wrote:
>> Hi guys,
>> 
>> Especially due to BEAM-3587 & BEAM-3186 regressions, I cancel RC1.
>> 
>> We will cherry-pick fixes on release-2.3.0 branch.
>> 
>> I'm updating Jira right now. When the fixes will be cherry-picked, I will 
>> submit
>> a RC2 to vote.
>> 
>> Thanks !
>> Regards
>> JB
>> 
>> On 01/30/2018 09:04 AM, Jean-Baptiste Onofré wrote:
>>> Hi everyone,
>>> 
>>> Please review and vote on the release candidate #1 for the version 2.3.0, as
>>> follows:
>>> 
>>> [ ] +1, Approve the release
>>> [ ] -1, Do not approve the release (please provide specific comments)
>>> 
>>> 
>>> The complete staging area is available for your review, which includes:
>>> * JIRA release notes [1],
>>> * the official Apache source release to be deployed to dist.apache.org [2],
>>> which is signed with the key with fingerprint C8282E76 [3],
>>> * all artifacts to be deployed to the Maven Central Repository [4],
>>> * source code tag "v2.3.0-RC1" [5],
>>> * website pull request listing the release and publishing the API reference
>>> manual [6].
>>> * Java artifacts were built with Maven 3.3.9 and Oracle JDK 1.8.0_111.
>>> * Python artifacts are deployed along with the source release to the
>>> dist.apache.org [2].
>>> 
>>> The vote will be open for at least 72 hours. It is adopted by majority 
>>> approval,
>>> with at least 3 PMC affirmative votes.
>>> 
>>> Thanks,
>>> JB
>>> 
>>> [1]
>>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12341608
>>> [2] https://dist.apache.org/repos/dist/dev/beam/2.3.0/
>>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> [4] https://repository.apache.org/content/repositories/orgapachebeam-1026/
>>> [5] https://github.com/apache/beam/tree/v2.3.0-RC1
>>> [6] https://github.com/apache/beam-site/pull/381
>>> 
>> 
> 
> -- 
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com



Build failed in Jenkins: beam_PostRelease_NightlySnapshot #17

2018-02-05 Thread Apache Jenkins Server
See 


Changes:

[klk] google-java-format

[klk] Fix empty window assignments in Nexmark

[klk] Fix empty window assignment in FlattenEvaluatorFactoryTest

[klk] Switch DataflowRunner to its own private ValueInEmptyWindows

[klk] Remove deprecated valueInEmptyWindows

[aljoscha.krettek] Remove erroneous cast in FlinkStreamingTransformTranslators

[aljoscha.krettek] [BEAM-3186] Correctly use deserialized timerService in Flink 
Runner

--
Started by timer
[EnvInject] - Loading node environment variables.
Building remotely on beam4 (beam) in workspace 

 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/apache/beam.git # timeout=10
Fetching upstream changes from https://github.com/apache/beam.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/beam.git 
 > +refs/heads/*:refs/remotes/origin/* 
 > +refs/pull/${ghprbPullId}/*:refs/remotes/origin/pr/${ghprbPullId}/*
 > git rev-parse origin/master^{commit} # timeout=10
Checking out Revision 93818abf901dd996aa30445457153811b13ece43 (origin/master)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 93818abf901dd996aa30445457153811b13ece43
Commit message: "Merge pull request #4563: Fixes Flink timer service restore"
 > git rev-list 42ac62a067307173da061e7dbdbffd5e03672b96 # timeout=10
Cleaning workspace
 > git rev-parse --verify HEAD # timeout=10
Resetting working tree
 > git reset --hard # timeout=10
 > git clean -fdx # timeout=10
[EnvInject] - Executing scripts and injecting environment variables after the 
SCM step.
[EnvInject] - Injecting as environment variables the properties content 
SPARK_LOCAL_IP=127.0.0.1

[EnvInject] - Variables injected successfully.
[beam_PostRelease_NightlySnapshot] $ /bin/bash -xe 
/tmp/jenkins3032049510810763464.sh
+ cd src/release
+ groovy quickstart-java-direct.groovy
/tmp/jenkins3032049510810763464.sh: line 2: groovy: command not found
Build step 'Execute shell' marked build as failure
Not sending mail to unregistered user xuming...@users.noreply.github.com
Not sending mail to unregistered user pawel.pk.kaczmarc...@gmail.com
Not sending mail to unregistered user kirpic...@google.com
Not sending mail to unregistered user aljoscha.kret...@gmail.com
Not sending mail to unregistered user g...@telligent-data.com
Not sending mail to unregistered user eh...@google.com
Not sending mail to unregistered user dariusz.aniszew...@polidea.com
Not sending mail to unregistered user ekirpic...@gmail.com
Not sending mail to unregistered user ccla...@bluewin.ch
Not sending mail to unregistered user mari...@mariagh.svl.corp.google.com
Not sending mail to unregistered user aromanenko@gmail.com
Not sending mail to unregistered user mott...@gmail.com
Not sending mail to unregistered user w...@google.com
Not sending mail to unregistered user z...@giggles.nyc.corp.google.com
Not sending mail to unregistered user git...@alasdairhodge.co.uk
Not sending mail to unregistered user ke...@google.com
Not sending mail to unregistered user k...@google.com
Not sending mail to unregistered user joey.bar...@gmail.com


Re: coder evolutions?

2018-02-05 Thread Romain Manni-Bucau
Thanks, created https://issues.apache.org/jira/browse/BEAM-3616


Romain Manni-Bucau
@rmannibucau  |  Blog
 | Old Blog
 | Github  |
LinkedIn  | Book


2018-02-04 22:12 GMT+01:00 Jean-Baptiste Onofré :

> Done
>
> Regards
> JB
>
> On 02/04/2018 09:14 PM, Romain Manni-Bucau wrote:
> > Works for me. So a jira with target version = 3.
> >
> > Can someone with the karma check we have a 3.0.0 in jira system please?
> >
> > Le 4 févr. 2018 20:46, "Reuven Lax" >
> > a écrit :
> >
> > Seems fine to me. At some point we might want to do an audit of
> existing
> > Jira issues, because I suspect there are issues that should be
> targeted to
> > 3.0 but are not yet tagged.
> >
> > On Sun, Feb 4, 2018 at 11:41 AM, Jean-Baptiste Onofré <
> j...@nanthrax.net
> > > wrote:
> >
> > I would prefer to use Jira, with "wish"/"ideas", and adding Beam
> 3.0.0
> > version.
> >
> > WDYT ?
> >
> > Regards
> > JB
> >
> > On 02/04/2018 07:55 PM, Reuven Lax wrote:
> > > Do we have a good place to track the items for Beam 3.0, or is
> Jira the best
> > > place? Romain has a good point - if this gets forgotten when
> we do Beam 3.0,
> > > then we're stuck waiting around till Beam 4.0.
> > >
> > > Reuven
> > >
> > > On Sun, Feb 4, 2018 at 9:27 AM, Jean-Baptiste Onofré <
> j...@nanthrax.net 
> > > >> wrote:
> > >
> > > That's a good point. In the roadmap for Beam 3, I think it
> makes
> > sense to add a
> > > point about this.
> > >
> > > Regards
> > > JB
> > >
> > > On 02/04/2018 06:18 PM, Eugene Kirpichov wrote:
> > > > I think doing a change that would break pipeline update
> for
> > every single user of
> > > > Flink and Dataflow needs to be postponed until a next
> major
> > version. Pipeline
> > > > update is a very frequently used feature, especially by
> the
> > largest users. We've
> > > > had those users get significantly upset even when we
> > accidentally broke update
> > > > compatibility for some special cases of individual
> transforms;
> > breaking it
> > > > intentionally and project-wide is too extreme to be
> justified by
> > the benefits of
> > > > the current change.
> > > >
> > > > That said, I think concerns about coder APIs are
> reasonable, and
> > it is
> > > > unfortunate that we effectively can't make changes to
> them right
> > now. It would
> > > > be great if in the next major version we were better
> prepared
> > for evolution of
> > > > coders, e.g. by having coders support a version marker or
> > something like that,
> > > > with an API for detecting the version of data on wire and
> > reading or writing
> > > > data of an old version. Such a change (introducing
> versioning)
> > would also, of
> > > > course, be incompatible and would need to be postponed
> until a
> > major version -
> > > > but, at least, subsequent changes wouldn't.
> > > >
> > > > ...And as I was typing this email, seems that this is
> what the
> > thread already
> > > > came to!
> > > >
> > > > On Sun, Feb 4, 2018 at 9:16 AM Romain Manni-Bucau
> > 
> > >
> > > > 
> >  wrote:
> > > >
> > > > I like this idea of migration support at coder
> level. It would require to
> > > > add a metadata in all outputs which would represent
> the version then coders
> > > > can handle the logic properly depending the version
> - we can assume a coder
> > > > dev upgrade the version when he breaks the
> representation I hope ;).
> > > > With this: no runner impact at all :).
> > > >
> > > >
> > > > Romain Manni-Bucau
> > > > @rmannibucau  https://twitter.com/rmannibucau>
> > >