Re: Schema Aware PCollection stabilization

2020-02-05 Thread Alex Van Boxel
Oh, I thought it was. I've set anyone with a link can edit... as long as
there is no spam detected. Added you explicitly.

 _/
_/ Alex Van Boxel


On Thu, Feb 6, 2020 at 2:01 AM Brian Hulette  wrote:

> Can you open up the document for commenting or editing?
>
> On Wed, Feb 5, 2020 at 4:48 AM Alex Van Boxel  wrote:
>
>> I have the feeling people want to bring Schema Aware PCollection out of
>> experimental. I've started a document to follow up this track, please could
>> all interested parties add the open issues they have to the document. We
>> can discuss the open issues in the doc and resolve and/or create JIRA
>> tickets for the issues and visa-versa.
>>
>>
>> https://docs.google.com/document/d/1WseNjxFXYrpjWjbIxfWUpwHmjTL7WSfj9n2Nh3H_qqA/edit?usp=sharing
>> 
>>
>> I'll go though Jira later this day or tomorrow and complete the document
>> with those tickets as well.
>>
>> Thanks.
>>
>>  _/
>> _/ Alex Van Boxel
>>
>


Re: SQL PostCommit failure: ClassCastException: java.lang.Integer cannot be cast to java.lang.Long

2020-02-05 Thread Tomo Suzuki
(My understanding)
The test ensures the CSV data stored in GCS should be readable through
Datacatalog. It fails because an Integer value in the CSV was read as Long
as per Datacatalog.


> setting up from scratch is a good idea.

I agree. Furthermore, it would be nice if it can test different type-cast
behaviors. I’m still feeling it’s ok to read an Integer as Long. (If this
is the case, how about Long to Integer? What if the long is small enough to
fit in 32 bits? and so on)

On Wed, Feb 5, 2020 at 23:15 Kenneth Knowles  wrote:

> I think that was me... sorry!
>
> Is this a test where it is important that the data is pre-existing?
> Otherwise I would say that setting up from scratch is a good idea. Does
> anyone have context on it? I am happy to take on the small bit of coding,
> since I broke it.
>
> Kenn
>
> On Wed, Feb 5, 2020 at 1:22 PM Brian Hulette  wrote:
>
>> So it looks like the schema for `integ_test_small_csv_test_1` was updated
>> yesterday around the same time that PR#10563 went in, and it no longer
>> matches the schema we expect in the test.
>>
>> I'm just going to change it back for now. I am curious who changed it and
>> why, if the perpetrator is on this list please let us know :)
>>
>>
>> Note the updateTime:
>> ```
>> ❯ gcloud beta data-catalog entries lookup
>> '`datacatalog`.`entry`.`apache-beam-testing`.`us-central1`.`samples`.`integ_test_small_csv_test_1`'``
>> gcsFilesetSpec:
>>   filePatterns:
>>   - gs://apache-beam-samples/integration_test_small_csv/test.csv
>> linkedResource: //
>> datacatalog.googleapis.com/projects/apache-beam-testing/locations/us-central1/entryGroups/samples/entries/integ_test_small_csv_test_1
>> name:
>> projects/apache-beam-testing/locations/us-central1/entryGroups/samples/entries/integ_test_small_csv_test_1
>> schema:
>>   columns:
>>   - column: id
>> mode: NULLABLE
>> type: INT64
>>   - column: name
>> mode: NULLABLE
>> type: STRING
>>   - column: type
>> mode: NULLABLE
>> type: STRING
>> sourceSystemTimestamps:
>>   createTime: '2019-08-16T01:49:06.235Z'
>>   updateTime: '2020-02-04T17:18:17.671Z'
>> type: FILESET
>> ```
>>
> --
Regards,
Tomo


Re: Deterministic field ordering in derived schemas

2020-02-05 Thread Reuven Lax
Let's understand the use case first.

My concern was with making SchemaCoder compatible between different
invocations of a pipeline, and that's why I introduced encoding_position.
This allows the field id to change, but we can preserve the same
encoding_position. However this is internal to a pipeline.

If the worry is writing rows to a sink, how are the rows being written? I
would highly advise against using Beam's internal binary representation to
write rows external to a pipeline. That representation is meant to be an
internal detail of schemas, not a public binary format. Rows should be
converted to some public format before being written.

I wonder if a convenience method on Row - getValuesOrderedByName() - would
be sufficient for this use case?

Reuven

On Wed, Feb 5, 2020 at 8:49 PM Kenneth Knowles  wrote:

> Are we in danger of reinventing protobuf's practice of giving fields
> numbers? (this practice itself almost certainly used decades before
> protobufs creation). Could we just use the same practice?
>
> Schema fields already have integer IDs and "encoding_position" (see
> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/schema.proto).
> Are these the same as proto field numbers? Do we need both? What is the
> expectation around how they interact? The proto needs
> comments/documentation!
>
> This does not directly address the question, but any solution related to
> how auto-generated schemas work should be specified in terms of the proto.
> For example, annotations to suggest one or both of these fields. Or,
> lacking that, sorting by name (giving up on "new fields come last"
> behavior. Or warning that the schema is unstable. Etc.
>
> Kenn
>
> On Wed, Feb 5, 2020 at 10:47 AM Luke Cwik  wrote:
>
>> The Java compiler doesn't know about whether a field was added or removed
>> when compiling source to class so there is no way for it to provide an
>> ordering that puts "new" fields at the end and the source specification
>> doesn't allow for users to state the field ordering that should be used.
>> You can ask users to annotate a field ordering[1] using custom annotations
>> but a general solution will require some type of sorting.
>>
>> 1: https://stackoverflow.com/a/1099389/4368200
>>
>> On Wed, Feb 5, 2020 at 10:31 AM Reuven Lax  wrote:
>>
>>> I have yet to figure out a way to make Schema inference
>>> deterministically ordered, because Java reflection provides no guaranteed
>>> ordering (I suspect that the JVM returns functions by iterating over a hash
>>> map, or something of that form). Ideas such as "sort all the fields"
>>> actually makes things worse, because new fields will end up in the middle
>>> of the field list.
>>>
>>> This is a problem for runners that support an "update" functionality.
>>> Currently the solution I was working on was to allow the runner to inspect
>>> the previous graph on an update, to ensure that we maintain the previous
>>> order.
>>>
>>> If you know a way to ensure deterministic ordering, I would love to
>>> know. I even went so far as to try and open the .class file to get members
>>> in the order defined there, but that is very complex, error prone, and I
>>> believe still doesn't guarantee order stability.
>>>
>>> On Wed, Feb 5, 2020 at 9:15 AM Robert Bradshaw 
>>> wrote:
>>>
 +1 to standardizing on a deterministic ordering for inference if none
 is imposed by the structure.

 On Wed, Feb 5, 2020, 8:55 AM Gleb Kanterov  wrote:

> There are Beam schema providers that use Java reflection to get fields
> for classes with fields and auto-value classes. It isn't relevant for 
> POJOs
> with "creators", because function arguments are ordered. We cache 
> instances
> of schema coders, but there is no guarantee that it's deterministic 
> between
> JVMs. As a result, I've seen cases when the construction of pipeline 
> graphs
> and output schema is non-deterministic. It's especially relevant when
> writing data to external storage, where row schema becomes a table schema.
> There is a workaround to apply a transform that would make schema
> deterministic, for instance, by ordering fields by name.
>
> I would see a benefit in making schemas deterministic by default or at
> least introducing a way to do so without writing custom code. What are 
> your
> thoughts?
>



Re: Deterministic field ordering in derived schemas

2020-02-05 Thread Kenneth Knowles
Are we in danger of reinventing protobuf's practice of giving fields
numbers? (this practice itself almost certainly used decades before
protobufs creation). Could we just use the same practice?

Schema fields already have integer IDs and "encoding_position" (see
https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/schema.proto).
Are these the same as proto field numbers? Do we need both? What is the
expectation around how they interact? The proto needs
comments/documentation!

This does not directly address the question, but any solution related to
how auto-generated schemas work should be specified in terms of the proto.
For example, annotations to suggest one or both of these fields. Or,
lacking that, sorting by name (giving up on "new fields come last"
behavior. Or warning that the schema is unstable. Etc.

Kenn

On Wed, Feb 5, 2020 at 10:47 AM Luke Cwik  wrote:

> The Java compiler doesn't know about whether a field was added or removed
> when compiling source to class so there is no way for it to provide an
> ordering that puts "new" fields at the end and the source specification
> doesn't allow for users to state the field ordering that should be used.
> You can ask users to annotate a field ordering[1] using custom annotations
> but a general solution will require some type of sorting.
>
> 1: https://stackoverflow.com/a/1099389/4368200
>
> On Wed, Feb 5, 2020 at 10:31 AM Reuven Lax  wrote:
>
>> I have yet to figure out a way to make Schema inference deterministically
>> ordered, because Java reflection provides no guaranteed ordering (I suspect
>> that the JVM returns functions by iterating over a hash map, or something
>> of that form). Ideas such as "sort all the fields" actually makes things
>> worse, because new fields will end up in the middle of the field list.
>>
>> This is a problem for runners that support an "update" functionality.
>> Currently the solution I was working on was to allow the runner to inspect
>> the previous graph on an update, to ensure that we maintain the previous
>> order.
>>
>> If you know a way to ensure deterministic ordering, I would love to know.
>> I even went so far as to try and open the .class file to get members in the
>> order defined there, but that is very complex, error prone, and I believe
>> still doesn't guarantee order stability.
>>
>> On Wed, Feb 5, 2020 at 9:15 AM Robert Bradshaw 
>> wrote:
>>
>>> +1 to standardizing on a deterministic ordering for inference if none is
>>> imposed by the structure.
>>>
>>> On Wed, Feb 5, 2020, 8:55 AM Gleb Kanterov  wrote:
>>>
 There are Beam schema providers that use Java reflection to get fields
 for classes with fields and auto-value classes. It isn't relevant for POJOs
 with "creators", because function arguments are ordered. We cache instances
 of schema coders, but there is no guarantee that it's deterministic between
 JVMs. As a result, I've seen cases when the construction of pipeline graphs
 and output schema is non-deterministic. It's especially relevant when
 writing data to external storage, where row schema becomes a table schema.
 There is a workaround to apply a transform that would make schema
 deterministic, for instance, by ordering fields by name.

 I would see a benefit in making schemas deterministic by default or at
 least introducing a way to do so without writing custom code. What are your
 thoughts?

>>>


Re: SQL PostCommit failure: ClassCastException: java.lang.Integer cannot be cast to java.lang.Long

2020-02-05 Thread Kenneth Knowles
I think that was me... sorry!

Is this a test where it is important that the data is pre-existing?
Otherwise I would say that setting up from scratch is a good idea. Does
anyone have context on it? I am happy to take on the small bit of coding,
since I broke it.

Kenn

On Wed, Feb 5, 2020 at 1:22 PM Brian Hulette  wrote:

> So it looks like the schema for `integ_test_small_csv_test_1` was updated
> yesterday around the same time that PR#10563 went in, and it no longer
> matches the schema we expect in the test.
>
> I'm just going to change it back for now. I am curious who changed it and
> why, if the perpetrator is on this list please let us know :)
>
>
> Note the updateTime:
> ```
> ❯ gcloud beta data-catalog entries lookup
> '`datacatalog`.`entry`.`apache-beam-testing`.`us-central1`.`samples`.`integ_test_small_csv_test_1`'``
> gcsFilesetSpec:
>   filePatterns:
>   - gs://apache-beam-samples/integration_test_small_csv/test.csv
> linkedResource: //
> datacatalog.googleapis.com/projects/apache-beam-testing/locations/us-central1/entryGroups/samples/entries/integ_test_small_csv_test_1
> name:
> projects/apache-beam-testing/locations/us-central1/entryGroups/samples/entries/integ_test_small_csv_test_1
> schema:
>   columns:
>   - column: id
> mode: NULLABLE
> type: INT64
>   - column: name
> mode: NULLABLE
> type: STRING
>   - column: type
> mode: NULLABLE
> type: STRING
> sourceSystemTimestamps:
>   createTime: '2019-08-16T01:49:06.235Z'
>   updateTime: '2020-02-04T17:18:17.671Z'
> type: FILESET
> ```
>


Re: Schema Aware PCollection stabilization

2020-02-05 Thread Brian Hulette
Can you open up the document for commenting or editing?

On Wed, Feb 5, 2020 at 4:48 AM Alex Van Boxel  wrote:

> I have the feeling people want to bring Schema Aware PCollection out of
> experimental. I've started a document to follow up this track, please could
> all interested parties add the open issues they have to the document. We
> can discuss the open issues in the doc and resolve and/or create JIRA
> tickets for the issues and visa-versa.
>
>
> https://docs.google.com/document/d/1WseNjxFXYrpjWjbIxfWUpwHmjTL7WSfj9n2Nh3H_qqA/edit?usp=sharing
> 
>
> I'll go though Jira later this day or tomorrow and complete the document
> with those tickets as well.
>
> Thanks.
>
>  _/
> _/ Alex Van Boxel
>


[PROPOSAL] Add licenses and notices to SDK docker images

2020-02-05 Thread Hannah Jiang
Hello

I wrote a design document about adding licenses and notices for third party
dependencies to SDK docker images.
I reviewed several tools for this purpose, please recommend other tools if
anything in your mind, I am happy to review those as well.
Link: https://s.apache.org/eauq6

Any kind of comments are welcome.

Thanks,
Hannah


Re: [DISCUSS] Autoformat python code with Black

2020-02-05 Thread Robert Bradshaw
No, perhaps not. I agree there's consensus, just wondering what the
next steps should be to get this in. (The presubmits look like they're
all passing, with the exception of some breakage in java that should
be completely unrelated. Of course there's already merge conflicts...)

On Wed, Feb 5, 2020 at 3:55 PM Ahmet Altay  wrote:
>
> Do we need a formal vote? There is consensus on this thread and on the PR.
>
> On Wed, Feb 5, 2020 at 3:37 PM Robert Bradshaw  wrote:
>>
>> The PR is looking good. Should we call a vote?
>>
>> On Mon, Jan 27, 2020 at 11:03 AM Robert Bradshaw  wrote:
>> >
>> > Thanks. I commented on the PR. I think if we're going this route we
>> > should add a pre-commit, plus instructions on how to run the tool
>> > (similar to spotless).
>> >
>> > On Mon, Jan 27, 2020 at 10:00 AM Udi Meiri  wrote:
>> > >
>> > > I've done a pass on the PR on code I'm familiar with.
>> > > Please make a pass and add your suggestions on the PR.
>> > >
>> > > On Fri, Jan 24, 2020 at 7:15 AM Ismaël Mejía  wrote:
>> > >>
>> > >> Java build fails on any unformatted code so python probably should be 
>> > >> like that.
>> > >> We have to ensure however that it fails early on that.
>> > >> As Robert said time to debate the knobs :)
>> > >>
>> > >> On Fri, Jan 24, 2020 at 3:19 PM Kamil Wasilewski 
>> > >>  wrote:
>> > >>>
>> > >>> PR is ready: https://github.com/apache/beam/pull/10684. Please share 
>> > >>> your comments ;-) I've managed to reduce the impact a bit:
>> > >>> 501 files changed, 18245 insertions(+), 19495 deletions(-)
>> > >>>
>> > >>> We still need to consider how to enforce the usage of autoformatter. 
>> > >>> Pre-commit sounds like a nice addition, but it still needs to be 
>> > >>> installed manually by a developer. On the other hand, Jenkins 
>> > >>> precommit job that fails if any unformatted code is detected looks 
>> > >>> like too strict. What do you think?
>> > >>>
>> > >>> On Thu, Jan 23, 2020 at 8:37 PM Robert Bradshaw  
>> > >>> wrote:
>> > 
>> >  Thanks! Now we get to debate what knobs to twiddle :-P
>> > 
>> >  FYI, I did a simple run (just pushed to
>> >  https://github.com/apache/beam/compare/master...robertwb:yapf) to see
>> >  the impact. The diff is
>> > 
>> >  $ git diff --stat master
>> >  ...
>> >   547 files changed, 22118 insertions(+), 21129 deletions(-)
>> > 
>> >  For reference
>> > 
>> >  $ find sdks/python/apache_beam -name '*.py' | xargs wc
>> >  ...
>> >  200424  612002 7431637 total
>> > 
>> >  which means a little over 10% of lines get touched. I think there are
>> >  some options, such as SPLIT_ALL_TOP_LEVEL_COMMA_SEPARATED_VALUES and
>> >  COALESCE_BRACKETS, that will conform more to the style we are already
>> >  (mostly) following.
>> > 
>> > 
>> >  On Thu, Jan 23, 2020 at 1:59 AM Kamil Wasilewski
>> >   wrote:
>> >  >
>> >  > Thank you Michał for creating the ticket. I have some free time and 
>> >  > I'd like to volunteer myself for this task.
>> >  > Indeed, it looks like there's consensus for `yapf`, so I'll try 
>> >  > `yapf` first.
>> >  >
>> >  > Best,
>> >  > Kamil
>> >  >
>> >  >
>> >  > On Thu, Jan 23, 2020 at 10:37 AM Michał Walenia 
>> >  >  wrote:
>> >  >>
>> >  >> Hi all,
>> >  >> I created a JIRA issue for this and summarized the available tools
>> >  >>
>> >  >> https://issues.apache.org/jira/browse/BEAM-9175
>> >  >>
>> >  >> Cheers,
>> >  >> Michal
>> >  >>
>> >  >> On Thu, Jan 23, 2020 at 1:49 AM Udi Meiri  wrote:
>> >  >>>
>> >  >>> Sorry, backing off on this due to time constraints.
>> >  >>>
>> >  >>> On Wed, Jan 22, 2020 at 3:39 PM Udi Meiri  
>> >  >>> wrote:
>> >  
>> >   It sounds like there's a consensus for yapf. I volunteer to take 
>> >   this on
>> >  
>> >   On Wed, Jan 22, 2020, 10:31 Udi Meiri  wrote:
>> >  >
>> >  > +1 to autoformatting
>> >  >
>> >  > On Wed, Jan 22, 2020 at 9:57 AM Luke Cwik  
>> >  > wrote:
>> >  >>
>> >  >> +1 to autoformatters. Also the Beam Java SDK went through a 
>> >  >> one time pass to apply the spotless formatting.
>> >  >>
>> >  >> On Tue, Jan 21, 2020 at 9:52 PM Ahmet Altay  
>> >  >> wrote:
>> >  >>>
>> >  >>> +1 to autoformatters and yapf. It appears to be a well 
>> >  >>> maintained project. I do support making a one time pass to 
>> >  >>> apply formatting the whole code base.
>> >  >>>
>> >  >>> On Tue, Jan 21, 2020 at 5:38 PM Chad Dombrova 
>> >  >>>  wrote:
>> >  >
>> >  >
>> >  > It'd be good if there was a way to only apply to violating 
>> >  > (or at
>> >  > least changed) lines.
>> >  
>> >  
>> 

Re: Jenkins jobs not running for my PR 10438

2020-02-05 Thread Rui Wang
Tried to trigger tests in the PR. Let's continue following up there.



-Rui

On Wed, Feb 5, 2020 at 4:07 PM Tomo Suzuki  wrote:

> Hi Beam Committers,
>
> Would you run the precommit checks for
> https://github.com/apache/beam/pull/10769
> with the following 6 additional commands (one command per comment) ?
>
> Run Java PostCommit
> Run Java HadoopFormatIO Performance Test
> Run BigQueryIO Streaming Performance Test Java
> Run Dataflow ValidatesRunner
> Run Spark ValidatesRunner
> Run SQL Postcommit
>
> --
> Regards,
> Tomo
>


Re: Jenkins jobs not running for my PR 10438

2020-02-05 Thread Tomo Suzuki
Hi Beam Committers,

Would you run the precommit checks for https://github.com/apache/beam/pull/10769
with the following 6 additional commands (one command per comment) ?

Run Java PostCommit
Run Java HadoopFormatIO Performance Test
Run BigQueryIO Streaming Performance Test Java
Run Dataflow ValidatesRunner
Run Spark ValidatesRunner
Run SQL Postcommit

-- 
Regards,
Tomo


Re: [DISCUSS] Autoformat python code with Black

2020-02-05 Thread Ahmet Altay
Do we need a formal vote? There is consensus on this thread and on the PR.

On Wed, Feb 5, 2020 at 3:37 PM Robert Bradshaw  wrote:

> The PR is looking good. Should we call a vote?
>
> On Mon, Jan 27, 2020 at 11:03 AM Robert Bradshaw 
> wrote:
> >
> > Thanks. I commented on the PR. I think if we're going this route we
> > should add a pre-commit, plus instructions on how to run the tool
> > (similar to spotless).
> >
> > On Mon, Jan 27, 2020 at 10:00 AM Udi Meiri  wrote:
> > >
> > > I've done a pass on the PR on code I'm familiar with.
> > > Please make a pass and add your suggestions on the PR.
> > >
> > > On Fri, Jan 24, 2020 at 7:15 AM Ismaël Mejía 
> wrote:
> > >>
> > >> Java build fails on any unformatted code so python probably should be
> like that.
> > >> We have to ensure however that it fails early on that.
> > >> As Robert said time to debate the knobs :)
> > >>
> > >> On Fri, Jan 24, 2020 at 3:19 PM Kamil Wasilewski <
> kamil.wasilew...@polidea.com> wrote:
> > >>>
> > >>> PR is ready: https://github.com/apache/beam/pull/10684. Please
> share your comments ;-) I've managed to reduce the impact a bit:
> > >>> 501 files changed, 18245 insertions(+), 19495 deletions(-)
> > >>>
> > >>> We still need to consider how to enforce the usage of autoformatter.
> Pre-commit sounds like a nice addition, but it still needs to be installed
> manually by a developer. On the other hand, Jenkins precommit job that
> fails if any unformatted code is detected looks like too strict. What do
> you think?
> > >>>
> > >>> On Thu, Jan 23, 2020 at 8:37 PM Robert Bradshaw 
> wrote:
> > 
> >  Thanks! Now we get to debate what knobs to twiddle :-P
> > 
> >  FYI, I did a simple run (just pushed to
> >  https://github.com/apache/beam/compare/master...robertwb:yapf) to
> see
> >  the impact. The diff is
> > 
> >  $ git diff --stat master
> >  ...
> >   547 files changed, 22118 insertions(+), 21129 deletions(-)
> > 
> >  For reference
> > 
> >  $ find sdks/python/apache_beam -name '*.py' | xargs wc
> >  ...
> >  200424  612002 7431637 total
> > 
> >  which means a little over 10% of lines get touched. I think there
> are
> >  some options, such as SPLIT_ALL_TOP_LEVEL_COMMA_SEPARATED_VALUES and
> >  COALESCE_BRACKETS, that will conform more to the style we are
> already
> >  (mostly) following.
> > 
> > 
> >  On Thu, Jan 23, 2020 at 1:59 AM Kamil Wasilewski
> >   wrote:
> >  >
> >  > Thank you Michał for creating the ticket. I have some free time
> and I'd like to volunteer myself for this task.
> >  > Indeed, it looks like there's consensus for `yapf`, so I'll try
> `yapf` first.
> >  >
> >  > Best,
> >  > Kamil
> >  >
> >  >
> >  > On Thu, Jan 23, 2020 at 10:37 AM Michał Walenia <
> michal.wale...@polidea.com> wrote:
> >  >>
> >  >> Hi all,
> >  >> I created a JIRA issue for this and summarized the available
> tools
> >  >>
> >  >> https://issues.apache.org/jira/browse/BEAM-9175
> >  >>
> >  >> Cheers,
> >  >> Michal
> >  >>
> >  >> On Thu, Jan 23, 2020 at 1:49 AM Udi Meiri 
> wrote:
> >  >>>
> >  >>> Sorry, backing off on this due to time constraints.
> >  >>>
> >  >>> On Wed, Jan 22, 2020 at 3:39 PM Udi Meiri 
> wrote:
> >  
> >   It sounds like there's a consensus for yapf. I volunteer to
> take this on
> >  
> >   On Wed, Jan 22, 2020, 10:31 Udi Meiri 
> wrote:
> >  >
> >  > +1 to autoformatting
> >  >
> >  > On Wed, Jan 22, 2020 at 9:57 AM Luke Cwik 
> wrote:
> >  >>
> >  >> +1 to autoformatters. Also the Beam Java SDK went through a
> one time pass to apply the spotless formatting.
> >  >>
> >  >> On Tue, Jan 21, 2020 at 9:52 PM Ahmet Altay <
> al...@google.com> wrote:
> >  >>>
> >  >>> +1 to autoformatters and yapf. It appears to be a well
> maintained project. I do support making a one time pass to apply formatting
> the whole code base.
> >  >>>
> >  >>> On Tue, Jan 21, 2020 at 5:38 PM Chad Dombrova <
> chad...@gmail.com> wrote:
> >  >
> >  >
> >  > It'd be good if there was a way to only apply to
> violating (or at
> >  > least changed) lines.
> >  
> >  
> >   I assumed the first thing we’d do is convert all of the
> code in one go, since it’s a very safe operation. Did you have something
> else in mind?
> >  
> >   -chad
> >  
> >  
> >  
> >  
> >  >
> >  >
> >  > On Tue, Jan 21, 2020 at 1:56 PM Chad Dombrova <
> chad...@gmail.com> wrote:
> >  > >
> >  > > +1 to autoformatting
> >  > >
> >  > > Let me add some nuance to that.
> >  

Re: [DISCUSS] Autoformat python code with Black

2020-02-05 Thread Robert Bradshaw
The PR is looking good. Should we call a vote?

On Mon, Jan 27, 2020 at 11:03 AM Robert Bradshaw  wrote:
>
> Thanks. I commented on the PR. I think if we're going this route we
> should add a pre-commit, plus instructions on how to run the tool
> (similar to spotless).
>
> On Mon, Jan 27, 2020 at 10:00 AM Udi Meiri  wrote:
> >
> > I've done a pass on the PR on code I'm familiar with.
> > Please make a pass and add your suggestions on the PR.
> >
> > On Fri, Jan 24, 2020 at 7:15 AM Ismaël Mejía  wrote:
> >>
> >> Java build fails on any unformatted code so python probably should be like 
> >> that.
> >> We have to ensure however that it fails early on that.
> >> As Robert said time to debate the knobs :)
> >>
> >> On Fri, Jan 24, 2020 at 3:19 PM Kamil Wasilewski 
> >>  wrote:
> >>>
> >>> PR is ready: https://github.com/apache/beam/pull/10684. Please share your 
> >>> comments ;-) I've managed to reduce the impact a bit:
> >>> 501 files changed, 18245 insertions(+), 19495 deletions(-)
> >>>
> >>> We still need to consider how to enforce the usage of autoformatter. 
> >>> Pre-commit sounds like a nice addition, but it still needs to be 
> >>> installed manually by a developer. On the other hand, Jenkins precommit 
> >>> job that fails if any unformatted code is detected looks like too strict. 
> >>> What do you think?
> >>>
> >>> On Thu, Jan 23, 2020 at 8:37 PM Robert Bradshaw  
> >>> wrote:
> 
>  Thanks! Now we get to debate what knobs to twiddle :-P
> 
>  FYI, I did a simple run (just pushed to
>  https://github.com/apache/beam/compare/master...robertwb:yapf) to see
>  the impact. The diff is
> 
>  $ git diff --stat master
>  ...
>   547 files changed, 22118 insertions(+), 21129 deletions(-)
> 
>  For reference
> 
>  $ find sdks/python/apache_beam -name '*.py' | xargs wc
>  ...
>  200424  612002 7431637 total
> 
>  which means a little over 10% of lines get touched. I think there are
>  some options, such as SPLIT_ALL_TOP_LEVEL_COMMA_SEPARATED_VALUES and
>  COALESCE_BRACKETS, that will conform more to the style we are already
>  (mostly) following.
> 
> 
>  On Thu, Jan 23, 2020 at 1:59 AM Kamil Wasilewski
>   wrote:
>  >
>  > Thank you Michał for creating the ticket. I have some free time and 
>  > I'd like to volunteer myself for this task.
>  > Indeed, it looks like there's consensus for `yapf`, so I'll try `yapf` 
>  > first.
>  >
>  > Best,
>  > Kamil
>  >
>  >
>  > On Thu, Jan 23, 2020 at 10:37 AM Michał Walenia 
>  >  wrote:
>  >>
>  >> Hi all,
>  >> I created a JIRA issue for this and summarized the available tools
>  >>
>  >> https://issues.apache.org/jira/browse/BEAM-9175
>  >>
>  >> Cheers,
>  >> Michal
>  >>
>  >> On Thu, Jan 23, 2020 at 1:49 AM Udi Meiri  wrote:
>  >>>
>  >>> Sorry, backing off on this due to time constraints.
>  >>>
>  >>> On Wed, Jan 22, 2020 at 3:39 PM Udi Meiri  wrote:
>  
>   It sounds like there's a consensus for yapf. I volunteer to take 
>   this on
>  
>   On Wed, Jan 22, 2020, 10:31 Udi Meiri  wrote:
>  >
>  > +1 to autoformatting
>  >
>  > On Wed, Jan 22, 2020 at 9:57 AM Luke Cwik  wrote:
>  >>
>  >> +1 to autoformatters. Also the Beam Java SDK went through a one 
>  >> time pass to apply the spotless formatting.
>  >>
>  >> On Tue, Jan 21, 2020 at 9:52 PM Ahmet Altay  
>  >> wrote:
>  >>>
>  >>> +1 to autoformatters and yapf. It appears to be a well 
>  >>> maintained project. I do support making a one time pass to apply 
>  >>> formatting the whole code base.
>  >>>
>  >>> On Tue, Jan 21, 2020 at 5:38 PM Chad Dombrova 
>  >>>  wrote:
>  >
>  >
>  > It'd be good if there was a way to only apply to violating (or 
>  > at
>  > least changed) lines.
>  
>  
>   I assumed the first thing we’d do is convert all of the code in 
>   one go, since it’s a very safe operation. Did you have 
>   something else in mind?
>  
>   -chad
>  
>  
>  
>  
>  >
>  >
>  > On Tue, Jan 21, 2020 at 1:56 PM Chad Dombrova 
>  >  wrote:
>  > >
>  > > +1 to autoformatting
>  > >
>  > > Let me add some nuance to that.
>  > >
>  > > The way I see it there are 2 varieties of formatters:  those 
>  > > which take the original formatting into consideration 
>  > > (autopep8) and those which disregard it (yapf, black).
>  > >
>  > > I much prefer 

Re: Enabling a new Jenkins job

2020-02-05 Thread Heejong Lee
Fixed. Seed job was overridden by another scheduled seed job. Thanks, Udi!


On Wed, Feb 5, 2020 at 2:04 PM Heejong Lee  wrote:

> I created a new Jenkins job in my PR[1] and the new job shows "This
> project is currently disabled"[2]. Does anybody know how to enable the
> new job?
>
> [1]: https://github.com/apache/beam/pull/10758
> [2]: https://builds.apache.org/job/beam_PostCommit_XVR_Spark/
>


Enabling a new Jenkins job

2020-02-05 Thread Heejong Lee
I created a new Jenkins job in my PR[1] and the new job shows "This project
is currently disabled"[2]. Does anybody know how to enable the new job?

[1]: https://github.com/apache/beam/pull/10758
[2]: https://builds.apache.org/job/beam_PostCommit_XVR_Spark/


Re: SQL PostCommit failure: ClassCastException: java.lang.Integer cannot be cast to java.lang.Long

2020-02-05 Thread Brian Hulette
So it looks like the schema for `integ_test_small_csv_test_1` was updated
yesterday around the same time that PR#10563 went in, and it no longer
matches the schema we expect in the test.

I'm just going to change it back for now. I am curious who changed it and
why, if the perpetrator is on this list please let us know :)


Note the updateTime:
```
❯ gcloud beta data-catalog entries lookup
'`datacatalog`.`entry`.`apache-beam-testing`.`us-central1`.`samples`.`integ_test_small_csv_test_1`'``
gcsFilesetSpec:
  filePatterns:
  - gs://apache-beam-samples/integration_test_small_csv/test.csv
linkedResource: //
datacatalog.googleapis.com/projects/apache-beam-testing/locations/us-central1/entryGroups/samples/entries/integ_test_small_csv_test_1
name:
projects/apache-beam-testing/locations/us-central1/entryGroups/samples/entries/integ_test_small_csv_test_1
schema:
  columns:
  - column: id
mode: NULLABLE
type: INT64
  - column: name
mode: NULLABLE
type: STRING
  - column: type
mode: NULLABLE
type: STRING
sourceSystemTimestamps:
  createTime: '2019-08-16T01:49:06.235Z'
  updateTime: '2020-02-04T17:18:17.671Z'
type: FILESET
```


Re: SQL PostCommit failure: ClassCastException: java.lang.Integer cannot be cast to java.lang.Long

2020-02-05 Thread Tomo Suzuki
Brian,

Thank you! (I don't need access as long as it's resolved)


On Wed, Feb 5, 2020 at 4:05 PM Brian Hulette  wrote:
>
> I can take a look at this.
>
> I think the access issue (and your problem running locally) is because you 
> need access to apache-beam-testing. I'm not sure if we have any formal 
> process for that.
>
> Brian
>
> On Wed, Feb 5, 2020 at 5:31 AM Tomo Suzuki  wrote:
>>
>> HI Beam developers,
>>
>> Can somebody help this SQL PostCommit integration test failure?
>> https://issues.apache.org/jira/browse/BEAM-9253
>> (since https://github.com/apache/beam/pull/10563)
>>
>> SQL PostCommit failure: ClassCastException: java.lang.Integer cannot
>> be cast to java.lang.Long
>>
>> (Why not?)
>>
>> The difficulty is that the integration test requires access to Google
>> Cloud Storage objects and thus Alexey (PR#10563 author) or I cannot
>> test locally.
>>
>> --
>> Regards,
>> Tomo



-- 
Regards,
Tomo


Re: SQL PostCommit failure: ClassCastException: java.lang.Integer cannot be cast to java.lang.Long

2020-02-05 Thread Brian Hulette
I can take a look at this.

I think the access issue (and your problem running locally) is because you
need access to apache-beam-testing. I'm not sure if we have any formal
process for that.

Brian

On Wed, Feb 5, 2020 at 5:31 AM Tomo Suzuki  wrote:

> HI Beam developers,
>
> Can somebody help this SQL PostCommit integration test failure?
> https://issues.apache.org/jira/browse/BEAM-9253
> (since https://github.com/apache/beam/pull/10563)
>
> SQL PostCommit failure: ClassCastException: java.lang.Integer cannot
> be cast to java.lang.Long
>
> (Why not?)
>
> The difficulty is that the integration test requires access to Google
> Cloud Storage objects and thus Alexey (PR#10563 author) or I cannot
> test locally.
>
> --
> Regards,
> Tomo
>


Re: A new reworked Elasticsearch 7+ IO module

2020-02-05 Thread Chamikara Jayalath
On Wed, Feb 5, 2020 at 6:35 AM Etienne Chauchot 
wrote:

> Still there is something I don't agree with is that IOs can be tested on
> mock. We don't really test IO behavior with mocks: there is always special
> behaviors that cannot be reproduced in mocks (split, load, with corner
> cases etc...). There was in the past IOs that were tested using mocks and
> that happened to be nonfunctional.
>
> Regarding ITests we have very few comparing to UTests and they are not as
> closely observed as UTests.
>
> Etienne
> On 05/02/2020 11:32, Jean-Baptiste Onofre wrote:
>
> Hi,
>
> We talked in the past about multiple/single module.
>
> IMHO the always preferred goal is to have a single module. However, it’s
> tricky when we have such difference, including on the user facing API. So,
> I would go with module per version, or use a specified version for a target
> Beam release.
>
> For the test, we should distinguish utest from itest. Utest can be done
> with mock, the purpose is really to test the IO behavior. Then we can have
> itest using concrete ES instance.
>
> Anyway, I’m OK with the proposal and I would like to work on this IO (I
> have other improvements coming on other IOs anyway) with you guys (and
> Ludovic especially).
>
> Regards
> JB
>
> Le 5 févr. 2020 à 10:44, Etienne Chauchot  a écrit :
>
> Hi all,
>
> We had a long discussion with Ludovic about this IO. I'd like to share
> with you to keep you informed and also gather your opinions
>
> 1. regarding version support: ES v2 is no more maintained by Elastic since
> 2018/02 so we plan to remove it from the IO. In the past we already retired
> versions (like spark 1.6 for instance).
>
>
My only concern here is that there might be users who use the existing
module who might not be able to easily upgrade the Beam version if we
remove it. But given that V2 is 5 versions behind the latest release this
might be OK.


> 2. regarding the user: the aim is to unlock some new features (listed by
> Ludovic) and give the user more flexibility on his request. For that, it
> requires to use high level java ES client in place of the low level REST
> client (that was used because it is the only one compatible with all ES
> versions). We plan to replace the API (json document in and out) by more
> complete standard ES objects that contain de request logic (insert/update,
> doc routing etc...) and the data. There are already IOs like SpannerIO that
> use similar objects in input PCollection rather than pure POJOs.
>
>
Won't this be a breaking change for all users ? IMO using POJOs in
PCollections is safer since we have to worry about changes to the
underlying client library API. Exception would be when underlying client
library offers a backwards compatibility guarantee that we can rely on for
the foreseeable future (for example, BQ TableRow).


> 3. regarding multiple/single module: the aim is to have only one
> production code to ease the maintenance.  The problem is that using high
> level client makes the code dependent to an ES lib version. We would like
> to make it invisible to the user. He should select only one jar and the IO
> should decide the lib to use behind the scene. We are thinking about using
> one module and sub-modules per version and use relocation, wrappers and a
> factory that detects the version the IO actually points to to instantiate
> the correct client version. It would also require to have DTOs in the IO
> because the high level ES java objects are not exactly the same among the
> ES versions.
>
> +1 for adding a level of indirection to make this easy for users.

> 4. regarding tests: the aim is always to target real ES backends to have
> relevant tests (for reasons I already explained in another thread). The
> problem is that es-test-framework used today is version dependent and is a
> pain to use. We plan on using test containers per version (validated by ES
> dev advocate) and launching them as part of the UTests. Obviously we will
> launch only one container at the time per version and do all the test with
> it to avoid paying the cost of launch too much. And the tests will be
> shipped in per-version sub-modules and not in test dedicated modules like
> it is now.
>
>
Using a real ES backend for unit tests can be expensive ? Ideally we should
use a Fake (if one available) or mocking (test test out functionality) and
use real backend for IT tests that can be expensive. If this is a local
container that can be shared between unit tests with reasonable efficiency
that is OK. I'm mainly worried about introducing flakes into unit tests due
to network errors or slowness.

Thanks,
Cham


> WDYT ?
>
> Best !
>
> Etienne
> On 30/01/2020 17:55, Alexey Romanenko wrote:
>
> I’m second for this question. We have a similar (maybe a bit less painful)
> issue for KafkaIO and it would be useful to have a general strategy for
> such cases about how to deal with that.
>
> On 24 Jan 2020, at 21:54, Kenneth Knowles  wrote:
>
> Would it make sense to have 

Re: Deterministic field ordering in derived schemas

2020-02-05 Thread Luke Cwik
The Java compiler doesn't know about whether a field was added or removed
when compiling source to class so there is no way for it to provide an
ordering that puts "new" fields at the end and the source specification
doesn't allow for users to state the field ordering that should be used.
You can ask users to annotate a field ordering[1] using custom annotations
but a general solution will require some type of sorting.

1: https://stackoverflow.com/a/1099389/4368200

On Wed, Feb 5, 2020 at 10:31 AM Reuven Lax  wrote:

> I have yet to figure out a way to make Schema inference deterministically
> ordered, because Java reflection provides no guaranteed ordering (I suspect
> that the JVM returns functions by iterating over a hash map, or something
> of that form). Ideas such as "sort all the fields" actually makes things
> worse, because new fields will end up in the middle of the field list.
>
> This is a problem for runners that support an "update" functionality.
> Currently the solution I was working on was to allow the runner to inspect
> the previous graph on an update, to ensure that we maintain the previous
> order.
>
> If you know a way to ensure deterministic ordering, I would love to know.
> I even went so far as to try and open the .class file to get members in the
> order defined there, but that is very complex, error prone, and I believe
> still doesn't guarantee order stability.
>
> On Wed, Feb 5, 2020 at 9:15 AM Robert Bradshaw 
> wrote:
>
>> +1 to standardizing on a deterministic ordering for inference if none is
>> imposed by the structure.
>>
>> On Wed, Feb 5, 2020, 8:55 AM Gleb Kanterov  wrote:
>>
>>> There are Beam schema providers that use Java reflection to get fields
>>> for classes with fields and auto-value classes. It isn't relevant for POJOs
>>> with "creators", because function arguments are ordered. We cache instances
>>> of schema coders, but there is no guarantee that it's deterministic between
>>> JVMs. As a result, I've seen cases when the construction of pipeline graphs
>>> and output schema is non-deterministic. It's especially relevant when
>>> writing data to external storage, where row schema becomes a table schema.
>>> There is a workaround to apply a transform that would make schema
>>> deterministic, for instance, by ordering fields by name.
>>>
>>> I would see a benefit in making schemas deterministic by default or at
>>> least introducing a way to do so without writing custom code. What are your
>>> thoughts?
>>>
>>


Re: Deterministic field ordering in derived schemas

2020-02-05 Thread Reuven Lax
I have yet to figure out a way to make Schema inference deterministically
ordered, because Java reflection provides no guaranteed ordering (I suspect
that the JVM returns functions by iterating over a hash map, or something
of that form). Ideas such as "sort all the fields" actually makes things
worse, because new fields will end up in the middle of the field list.

This is a problem for runners that support an "update" functionality.
Currently the solution I was working on was to allow the runner to inspect
the previous graph on an update, to ensure that we maintain the previous
order.

If you know a way to ensure deterministic ordering, I would love to know. I
even went so far as to try and open the .class file to get members in the
order defined there, but that is very complex, error prone, and I believe
still doesn't guarantee order stability.

On Wed, Feb 5, 2020 at 9:15 AM Robert Bradshaw  wrote:

> +1 to standardizing on a deterministic ordering for inference if none is
> imposed by the structure.
>
> On Wed, Feb 5, 2020, 8:55 AM Gleb Kanterov  wrote:
>
>> There are Beam schema providers that use Java reflection to get fields
>> for classes with fields and auto-value classes. It isn't relevant for POJOs
>> with "creators", because function arguments are ordered. We cache instances
>> of schema coders, but there is no guarantee that it's deterministic between
>> JVMs. As a result, I've seen cases when the construction of pipeline graphs
>> and output schema is non-deterministic. It's especially relevant when
>> writing data to external storage, where row schema becomes a table schema.
>> There is a workaround to apply a transform that would make schema
>> deterministic, for instance, by ordering fields by name.
>>
>> I would see a benefit in making schemas deterministic by default or at
>> least introducing a way to do so without writing custom code. What are your
>> thoughts?
>>
>


Re: Deterministic field ordering in derived schemas

2020-02-05 Thread Robert Bradshaw
+1 to standardizing on a deterministic ordering for inference if none is
imposed by the structure.

On Wed, Feb 5, 2020, 8:55 AM Gleb Kanterov  wrote:

> There are Beam schema providers that use Java reflection to get fields for
> classes with fields and auto-value classes. It isn't relevant for POJOs
> with "creators", because function arguments are ordered. We cache instances
> of schema coders, but there is no guarantee that it's deterministic between
> JVMs. As a result, I've seen cases when the construction of pipeline graphs
> and output schema is non-deterministic. It's especially relevant when
> writing data to external storage, where row schema becomes a table schema.
> There is a workaround to apply a transform that would make schema
> deterministic, for instance, by ordering fields by name.
>
> I would see a benefit in making schemas deterministic by default or at
> least introducing a way to do so without writing custom code. What are your
> thoughts?
>


Deterministic field ordering in derived schemas

2020-02-05 Thread Gleb Kanterov
There are Beam schema providers that use Java reflection to get fields for
classes with fields and auto-value classes. It isn't relevant for POJOs
with "creators", because function arguments are ordered. We cache instances
of schema coders, but there is no guarantee that it's deterministic between
JVMs. As a result, I've seen cases when the construction of pipeline graphs
and output schema is non-deterministic. It's especially relevant when
writing data to external storage, where row schema becomes a table schema.
There is a workaround to apply a transform that would make schema
deterministic, for instance, by ordering fields by name.

I would see a benefit in making schemas deterministic by default or at
least introducing a way to do so without writing custom code. What are your
thoughts?


Re: A new reworked Elasticsearch 7+ IO module

2020-02-05 Thread Etienne Chauchot
Still there is something I don't agree with is that IOs can be tested on 
mock. We don't really test IO behavior with mocks: there is always 
special behaviors that cannot be reproduced in mocks (split, load, with 
corner cases etc...). There was in the past IOs that were tested using 
mocks and that happened to be nonfunctional.


Regarding ITests we have very few comparing to UTests and they are not 
as closely observed as UTests.


Etienne

On 05/02/2020 11:32, Jean-Baptiste Onofre wrote:

Hi,

We talked in the past about multiple/single module.

IMHO the always preferred goal is to have a single module. However, 
it’s tricky when we have such difference, including on the user facing 
API. So, I would go with module per version, or use a specified 
version for a target Beam release.


For the test, we should distinguish utest from itest. Utest can be 
done with mock, the purpose is really to test the IO behavior. Then we 
can have itest using concrete ES instance.


Anyway, I’m OK with the proposal and I would like to work on this IO 
(I have other improvements coming on other IOs anyway) with you guys 
(and Ludovic especially).


Regards
JB

Le 5 févr. 2020 à 10:44, Etienne Chauchot > a écrit :


Hi all,

We had a long discussion with Ludovic about this IO. I'd like to 
share with you to keep you informed and also gather your opinions


1. regarding version support: ES v2 is no more maintained by Elastic 
since 2018/02 so we plan to remove it from the IO. In the past we 
already retired versions (like spark 1.6 for instance).


2. regarding the user: the aim is to unlock some new features (listed 
by Ludovic) and give the user more flexibility on his request. For 
that, it requires to use high level java ES client in place of the 
low level REST client (that was used because it is the only one 
compatible with all ES versions). We plan to replace the API (json 
document in and out) by more complete standard ES objects that 
contain de request logic (insert/update, doc routing etc...) and the 
data. There are already IOs like SpannerIO that use similar objects 
in input PCollection rather than pure POJOs.


3. regarding multiple/single module: the aim is to have only one 
production code to ease the maintenance.  The problem is that using 
high level client makes the code dependent to an ES lib version. We 
would like to make it invisible to the user. He should select only 
one jar and the IO should decide the lib to use behind the scene. We 
are thinking about using one module and sub-modules per version and 
use relocation, wrappers and a factory that detects the version the 
IO actually points to to instantiate the correct client version. It 
would also require to have DTOs in the IO because the high level ES 
java objects are not exactly the same among the ES versions.


4. regarding tests: the aim is always to target real ES backends to 
have relevant tests (for reasons I already explained in another 
thread). The problem is that es-test-framework used today is version 
dependent and is a pain to use. We plan on using test containers per 
version (validated by ES dev advocate) and launching them as part of 
the UTests. Obviously we will launch only one container at the time 
per version and do all the test with it to avoid paying the cost of 
launch too much. And the tests will be shipped in per-version 
sub-modules and not in test dedicated modules like it is now.


WDYT ?

Best !

Etienne

On 30/01/2020 17:55, Alexey Romanenko wrote:
I’m second for this question. We have a similar (maybe a bit less 
painful) issue for KafkaIO and it would be useful to have a general 
strategy for such cases about how to deal with that.


On 24 Jan 2020, at 21:54, Kenneth Knowles > wrote:


Would it make sense to have different version-specialized 
connectors with a common core library and common API package?


On Fri, Jan 24, 2020 at 11:52 AM Chamikara Jayalath 
mailto:chamik...@google.com>> wrote:


Thanks for the contribution. I agree with Alexey that we should
try to add any new features brought in with the new PR into
existing connector instead of trying to maintain two
implementations.

Thanks,
Cham

On Fri, Jan 24, 2020 at 9:01 AM Alexey Romanenko
mailto:aromanenko@gmail.com>> wrote:

Hi Ludovic,

Thank you for working on this and sharing the details with
us. This is really great job!

As I recall, we already have some support of Elasticsearch7
in current ElasticsearchIO (afaik, at least they are
compatible), thanks to Zhong Chen and Etienne Chauchot, who
were working on adding this [1][2] and it should be
released in Beam 2.19.

Would you think you can leverage this in your work on
adding new Elasticsearch7 features? IMHO, supporting two
different related IOs can be quite tough task and I‘d
rather raise my hand to add a new 

SQL PostCommit failure: ClassCastException: java.lang.Integer cannot be cast to java.lang.Long

2020-02-05 Thread Tomo Suzuki
HI Beam developers,

Can somebody help this SQL PostCommit integration test failure?
https://issues.apache.org/jira/browse/BEAM-9253
(since https://github.com/apache/beam/pull/10563)

SQL PostCommit failure: ClassCastException: java.lang.Integer cannot
be cast to java.lang.Long

(Why not?)

The difficulty is that the integration test requires access to Google
Cloud Storage objects and thus Alexey (PR#10563 author) or I cannot
test locally.

-- 
Regards,
Tomo


Schema Aware PCollection stabilization

2020-02-05 Thread Alex Van Boxel
I have the feeling people want to bring Schema Aware PCollection out of
experimental. I've started a document to follow up this track, please could
all interested parties add the open issues they have to the document. We
can discuss the open issues in the doc and resolve and/or create JIRA
tickets for the issues and visa-versa.

https://docs.google.com/document/d/1WseNjxFXYrpjWjbIxfWUpwHmjTL7WSfj9n2Nh3H_qqA/edit?usp=sharing


I'll go though Jira later this day or tomorrow and complete the document
with those tickets as well.

Thanks.

 _/
_/ Alex Van Boxel


Re: [PROPOSAL] Beam Schema Options

2020-02-05 Thread Alex Van Boxel
I would appreciate if someone would look at the following PR and get it to
master:

https://github.com/apache/beam/pull/10413#

a lot of work needs to follow, but if we have the base already on master
the next layers can follow. As a reminder, this is the base proposal:
https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit?usp=sharing

I've also looked for prior work, and saw that Spark actually has something
comparable:
https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/Row.html

but when the options are finished it will be far more powerful as it is not
limited on fields.

 _/
_/ Alex Van Boxel


On Wed, Jan 29, 2020 at 4:55 AM Kenneth Knowles  wrote:

> Using schema types for the metadata values is a nice touch.
>
> Are the options expected to be common across many fields? Perhaps the name
> should be a URN to make it clear to be careful about collisions? (just a
> synonym for "name" in practice, but with different connotation)
>
> I generally like this... but the examples (all but one) are weird things
> that I don't really understand how they are done or who is responsible for
> them.
>
> One way to go is this: if options are maybe not understood by all
> consumers, then they can't really change behavior. Kind of like how URN and
> payload on a composite transform can be ignored and just the expansion used.
>
> Kenn
>
> On Sun, Jan 26, 2020 at 8:27 AM Alex Van Boxel  wrote:
>
>> Hi everyone,
>>
>> I'm proud to announce my first real proposal. The proposal describes Beam
>> Schema Options. This is an extension to the Schema API to add typed meta
>> data to to Rows, Field and Logical Types:
>>
>>
>> https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit?usp=sharing
>>
>> To give you some context where this proposal comes from: We've been using
>> dynamic meta driven pipelines for a while, but till now in an awkward and
>> hacky way (see my talks at the previous Beam Summits). This proposal would
>> bring a good way to work with metadata on the metadata :-).
>>
>> The proposal points to 2 pull requests with the implementation, one for
>> the API another for translating proto options to beam options.
>>
>> Thanks to Brian Hulette and Reuven Lax for the initial feedback. All
>> feedback is welcome.
>>
>>  _/
>> _/ Alex Van Boxel
>>
>


Re: A new reworked Elasticsearch 7+ IO module

2020-02-05 Thread Jean-Baptiste Onofre
Hi,

We talked in the past about multiple/single module.

IMHO the always preferred goal is to have a single module. However, it’s tricky 
when we have such difference, including on the user facing API. So, I would go 
with module per version, or use a specified version for a target Beam release.

For the test, we should distinguish utest from itest. Utest can be done with 
mock, the purpose is really to test the IO behavior. Then we can have itest 
using concrete ES instance.

Anyway, I’m OK with the proposal and I would like to work on this IO (I have 
other improvements coming on other IOs anyway) with you guys (and Ludovic 
especially).

Regards
JB

> Le 5 févr. 2020 à 10:44, Etienne Chauchot  a écrit :
> 
> Hi all, 
> 
> We had a long discussion with Ludovic about this IO. I'd like to share with 
> you to keep you informed and also gather your opinions
> 
> 1. regarding version support: ES v2 is no more maintained by Elastic since 
> 2018/02 so we plan to remove it from the IO. In the past we already retired 
> versions (like spark 1.6 for instance).
> 
> 2. regarding the user: the aim is to unlock some new features (listed by 
> Ludovic) and give the user more flexibility on his request. For that, it 
> requires to use high level java ES client in place of the low level REST 
> client (that was used because it is the only one compatible with all ES 
> versions). We plan to replace the API (json document in and out) by more 
> complete standard ES objects that contain de request logic (insert/update, 
> doc routing etc...) and the data. There are already IOs like SpannerIO that 
> use similar objects in input PCollection rather than pure POJOs. 
> 
> 3. regarding multiple/single module: the aim is to have only one production 
> code to ease the maintenance.  The problem is that using high level client 
> makes the code dependent to an ES lib version. We would like to make it 
> invisible to the user. He should select only one jar and the IO should decide 
> the lib to use behind the scene. We are thinking about using one module and 
> sub-modules per version and use relocation, wrappers and a factory that 
> detects the version the IO actually points to to instantiate the correct 
> client version. It would also require to have DTOs in the IO because the high 
> level ES java objects are not exactly the same among the ES versions.
> 
> 4. regarding tests: the aim is always to target real ES backends to have 
> relevant tests (for reasons I already explained in another thread). The 
> problem is that es-test-framework used today is version dependent and is a 
> pain to use. We plan on using test containers per version (validated by ES 
> dev advocate) and launching them as part of the UTests. Obviously we will 
> launch only one container at the time per version and do all the test with it 
> to avoid paying the cost of launch too much. And the tests will be shipped in 
> per-version sub-modules and not in test dedicated modules like it is now.
> 
> WDYT ?
> 
> Best !
> 
> Etienne
> 
> On 30/01/2020 17:55, Alexey Romanenko wrote:
>> I’m second for this question. We have a similar (maybe a bit less painful) 
>> issue for KafkaIO and it would be useful to have a general strategy for such 
>> cases about how to deal with that.
>> 
>>> On 24 Jan 2020, at 21:54, Kenneth Knowles >> > wrote:
>>> 
>>> Would it make sense to have different version-specialized connectors with a 
>>> common core library and common API package?
>>> 
>>> On Fri, Jan 24, 2020 at 11:52 AM Chamikara Jayalath >> > wrote:
>>> Thanks for the contribution. I agree with Alexey that we should try to add 
>>> any new features brought in with the new PR into existing connector instead 
>>> of trying to maintain two implementations.  
>>> 
>>> Thanks,
>>> Cham
>>> 
>>> On Fri, Jan 24, 2020 at 9:01 AM Alexey Romanenko >> > wrote:
>>> Hi Ludovic,
>>> 
>>> Thank you for working on this and sharing the details with us. This is 
>>> really great job!
>>> 
>>> As I recall, we already have some support of Elasticsearch7 in current 
>>> ElasticsearchIO (afaik, at least they are compatible), thanks to Zhong Chen 
>>> and Etienne Chauchot, who were working on adding this [1][2] and it should 
>>> be released in Beam 2.19.
>>> 
>>> Would you think you can leverage this in your work on adding new 
>>> Elasticsearch7 features? IMHO, supporting two different related IOs can be 
>>> quite tough task and I‘d rather raise my hand to add a new functionality 
>>> into existing IO than creating a new one, if it’s possible.
>>> 
>>> [1] https://issues.apache.org/jira/browse/BEAM-5192 
>>> 
>>> [2] https://github.com/apache/beam/pull/10433 
>>> 
>>> 
 On 22 Jan 2020, at 19:23, Ludovic Boutros >>> > wrote:
 
 Dear all,
 
 I have written a 

Re: A new reworked Elasticsearch 7+ IO module

2020-02-05 Thread Etienne Chauchot

Hi all,

We had a long discussion with Ludovic about this IO. I'd like to share 
with you to keep you informed and also gather your opinions


1. regarding version support: ES v2 is no more maintained by Elastic 
since 2018/02 so we plan to remove it from the IO. In the past we 
already retired versions (like spark 1.6 for instance).


2. regarding the user: the aim is to unlock some new features (listed by 
Ludovic) and give the user more flexibility on his request. For that, it 
requires to use high level java ES client in place of the low level REST 
client (that was used because it is the only one compatible with all ES 
versions). We plan to replace the API (json document in and out) by more 
complete standard ES objects that contain de request logic 
(insert/update, doc routing etc...) and the data. There are already IOs 
like SpannerIO that use similar objects in input PCollection rather than 
pure POJOs.


3. regarding multiple/single module: the aim is to have only one 
production code to ease the maintenance.  The problem is that using high 
level client makes the code dependent to an ES lib version. We would 
like to make it invisible to the user. He should select only one jar and 
the IO should decide the lib to use behind the scene. We are thinking 
about using one module and sub-modules per version and use relocation, 
wrappers and a factory that detects the version the IO actually points 
to to instantiate the correct client version. It would also require to 
have DTOs in the IO because the high level ES java objects are not 
exactly the same among the ES versions.


4. regarding tests: the aim is always to target real ES backends to have 
relevant tests (for reasons I already explained in another thread). The 
problem is that es-test-framework used today is version dependent and is 
a pain to use. We plan on using test containers per version (validated 
by ES dev advocate) and launching them as part of the UTests. Obviously 
we will launch only one container at the time per version and do all the 
test with it to avoid paying the cost of launch too much. And the tests 
will be shipped in per-version sub-modules and not in test dedicated 
modules like it is now.


WDYT ?

Best !

Etienne

On 30/01/2020 17:55, Alexey Romanenko wrote:
I’m second for this question. We have a similar (maybe a bit less 
painful) issue for KafkaIO and it would be useful to have a general 
strategy for such cases about how to deal with that.


On 24 Jan 2020, at 21:54, Kenneth Knowles > wrote:


Would it make sense to have different version-specialized connectors 
with a common core library and common API package?


On Fri, Jan 24, 2020 at 11:52 AM Chamikara Jayalath 
mailto:chamik...@google.com>> wrote:


Thanks for the contribution. I agree with Alexey that we should
try to add any new features brought in with the new PR into
existing connector instead of trying to maintain two
implementations.

Thanks,
Cham

On Fri, Jan 24, 2020 at 9:01 AM Alexey Romanenko
mailto:aromanenko@gmail.com>> wrote:

Hi Ludovic,

Thank you for working on this and sharing the details with
us. This is really great job!

As I recall, we already have some support of Elasticsearch7
in current ElasticsearchIO (afaik, at least they are
compatible), thanks to Zhong Chen and Etienne Chauchot, who
were working on adding this [1][2] and it should be released
in Beam 2.19.

Would you think you can leverage this in your work on adding
new Elasticsearch7 features? IMHO, supporting two different
related IOs can be quite tough task and I‘d rather raise my
hand to add a new functionality into existing IO than
creating a new one, if it’s possible.

[1] https://issues.apache.org/jira/browse/BEAM-5192
[2] https://github.com/apache/beam/pull/10433


On 22 Jan 2020, at 19:23, Ludovic Boutros
mailto:boutr...@gmail.com>> wrote:

Dear all,

I have written a completely reworked Elasticsearch 7+ IO module.
It can be found here:

https://github.com/ludovic-boutros/beam/tree/fresh-reworked-elasticsearch-io-v7/sdks/java/io/elasticsearch7

This is a quite advance WIP work but I'm a quite new user of
Apache Beam and I would like to get some help on this :)

I can create a JIRA issue now but I prefer to wait for your
wise avises first.

_Why a new module ?_

The current module was compliant with Elasticsearch 2.x, 5.x
and 6.x. This seems to be a good point but so many things
have been changed since Elasticsearch 2.x.



Probably this is not correct anymore due to
https://github.com/apache/beam/pull/10433 ?


Elasticsearch 7.x is now partially supported (document type
are removed, occ, updates...).

A fresh new module, only compliant with the last