Re: [ANNOUNCE] New committer: Reza Ardeshir Rokni

2020-09-10 Thread Connell O'Callaghan
Excellent- well done Reza!!!

On Thu, Sep 10, 2020 at 7:35 PM Austin Bennett 
wrote:

> Thanks and congrats, Reza!
>
> On Thu, Sep 10, 2020 at 5:48 PM Heejong Lee  wrote:
>
>> Congratulations!
>>
>> On Thu, Sep 10, 2020 at 4:42 PM Robert Bradshaw 
>> wrote:
>>
>>> Thank you and welcome, Reza!
>>>
>>> On Thu, Sep 10, 2020 at 4:00 PM Ahmet Altay  wrote:
>>>
 Congratulations Reza! And thank you for your contributions!

 On Thu, Sep 10, 2020 at 3:59 PM Chamikara Jayalath <
 chamik...@google.com> wrote:

> Congrats Reza!
>
> On Thu, Sep 10, 2020 at 10:35 AM Kenneth Knowles 
> wrote:
>
>> Hi all,
>>
>> Please join me and the rest of the Beam PMC in welcoming a new
>> committer: Reza Ardeshir Rokni.
>>
>> Reza has been part of the Beam community since 2017! Reza has
>> spearheaded advanced Beam examples [1], blogged and presented at multiple
>> Beam Summits. Reza helps out users on the mailing lists [2] and
>> StackOverflow [3]. When Reza's work uncovers a missing feature in Beam, 
>> he
>> adds it [4]. Considering these contributions, the Beam PMC trusts Reza 
>> with
>> the responsibilities of a Beam committer [5].
>>
>> Thank you, Reza, for your contributions.
>>
>> Kenn
>>
>> [1] https://github.com/apache/beam/pull/3961
>> [2]
>> https://lists.apache.org/list.html?u...@beam.apache.org:gte=0d:reza%20rokni
>> [3] https://stackoverflow.com/tags/apache-beam/topusers
>> [4] https://github.com/apache/beam/pull/11929
>> [5]
>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>
>>
>>
>
>


>>>
>>>
>>
>>
>
> --
Your feedback welcomed for Connello!!!



Re: [ANNOUNCE] New committer: Reza Ardeshir Rokni

2020-09-10 Thread Austin Bennett
Thanks and congrats, Reza!

On Thu, Sep 10, 2020 at 5:48 PM Heejong Lee  wrote:

> Congratulations!
>
> On Thu, Sep 10, 2020 at 4:42 PM Robert Bradshaw 
> wrote:
>
>> Thank you and welcome, Reza!
>>
>> On Thu, Sep 10, 2020 at 4:00 PM Ahmet Altay  wrote:
>>
>>> Congratulations Reza! And thank you for your contributions!
>>>
>>> On Thu, Sep 10, 2020 at 3:59 PM Chamikara Jayalath 
>>> wrote:
>>>
 Congrats Reza!

 On Thu, Sep 10, 2020 at 10:35 AM Kenneth Knowles 
 wrote:

> Hi all,
>
> Please join me and the rest of the Beam PMC in welcoming a new
> committer: Reza Ardeshir Rokni.
>
> Reza has been part of the Beam community since 2017! Reza has
> spearheaded advanced Beam examples [1], blogged and presented at multiple
> Beam Summits. Reza helps out users on the mailing lists [2] and
> StackOverflow [3]. When Reza's work uncovers a missing feature in Beam, he
> adds it [4]. Considering these contributions, the Beam PMC trusts Reza 
> with
> the responsibilities of a Beam committer [5].
>
> Thank you, Reza, for your contributions.
>
> Kenn
>
> [1] https://github.com/apache/beam/pull/3961
> [2]
> https://lists.apache.org/list.html?u...@beam.apache.org:gte=0d:reza%20rokni
> [3] https://stackoverflow.com/tags/apache-beam/topusers
> [4] https://github.com/apache/beam/pull/11929
> [5]
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>



Contributor permission for Beam Jira tickets

2020-09-10 Thread Monica Song
Hi all--

I'm Monica, a SWE at Google! Can someone add me as a contributor for Beam's
Jira issue tracker? My username is monicadsong.

I would like to create/assign tickets for my work. I have 3 so
far: [BEAM-10824] [BEAM-7654] [BEAM-10855].

Thanks,
Monica


Re: [ANNOUNCE] New committer: Reza Ardeshir Rokni

2020-09-10 Thread Heejong Lee
Congratulations!

On Thu, Sep 10, 2020 at 4:42 PM Robert Bradshaw  wrote:

> Thank you and welcome, Reza!
>
> On Thu, Sep 10, 2020 at 4:00 PM Ahmet Altay  wrote:
>
>> Congratulations Reza! And thank you for your contributions!
>>
>> On Thu, Sep 10, 2020 at 3:59 PM Chamikara Jayalath 
>> wrote:
>>
>>> Congrats Reza!
>>>
>>> On Thu, Sep 10, 2020 at 10:35 AM Kenneth Knowles 
>>> wrote:
>>>
 Hi all,

 Please join me and the rest of the Beam PMC in welcoming a new
 committer: Reza Ardeshir Rokni.

 Reza has been part of the Beam community since 2017! Reza has
 spearheaded advanced Beam examples [1], blogged and presented at multiple
 Beam Summits. Reza helps out users on the mailing lists [2] and
 StackOverflow [3]. When Reza's work uncovers a missing feature in Beam, he
 adds it [4]. Considering these contributions, the Beam PMC trusts Reza with
 the responsibilities of a Beam committer [5].

 Thank you, Reza, for your contributions.

 Kenn

 [1] https://github.com/apache/beam/pull/3961
 [2]
 https://lists.apache.org/list.html?u...@beam.apache.org:gte=0d:reza%20rokni
 [3] https://stackoverflow.com/tags/apache-beam/topusers
 [4] https://github.com/apache/beam/pull/11929
 [5]
 https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer

>>>


Re: Go SDK & Dataflow

2020-09-10 Thread Robert Burke
This is a default limit on the receive Buffer on the SDK side.

We set that it's maximum on July 30th. It's likely your copy doesn't have
that change yet.

https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/util/grpcx/dial.go#L39

The Go SDK is experimental and not meaningfully versioned so syncing to
head to pick up the change isn't unreasonable.

Further Dataflow doesn't officially support the Go SDK at this time, and is
currently use at your own risk. It may or may not work, and isn't
sufficiently tested on Dataflow for all cases at this time (in particular
the IOs).

On Thu, Sep 10, 2020, 5:09 PM Mitchell, Patrick 
wrote:

> Beam team,
>
> I’m currently making use of the Beam Go SDK to construct a pipeline for
> creating and inserting large text data files into GCS.
>
> The pipeline is relatively simple; prepare a bracket of source data,
> transform it into fleshed out data sets, write to GCS with the textio
> package.
>
> func main() {
>flag.Parse()
>ctx := context.Background()
>beam.Init()
>p := beam.NewPipeline()
>s := p.Root()
>// Get a PCollection of account numbers
>accNumCol := beam.CreateList(s, makeRange(*accountStart, *accountEnd))
>// Transform the PCollection of account numbers into a collection of
> strings that each represent a group of account data
>accountCol := beam.ParDo(s, accountNumToAccountWithRows, accNumCol)
>// Write all account blocks to a single file in GCS
>textio.Write(s, fmt.Sprintf("gs://files/CC%d-ACC%d.txt", *accountStart,
> *accountEnd), accountCol)
>if err := beamx.Run(context.Background(), p); err != nil {
>   log.Exitf(ctx, "Failed to execute job: %v", err)
>}
> }
>
> Currently, for smaller files (e.g, 13MB or 30,000 lines of text) we are
> not experiencing any issues with the pipeline when executed in Dataflow.
> When preparing larger datasets that would generate medium sized files of
> approx. 130MB, executing the write to GCS step in the Dataflow pipeline
> fails consistently with the following kind of errors:
>
> Error message from worker: process bundle failed for instruction
> process_bundle-3 using plan process-bundle-descriptor-47 : while executing
> Process for Plan[process-bundle-descriptor-47]:
> 2: ParDo[textio.writeFileFn] Out:[]
> 1: DataSource[S[ptransform-46@localhost:12371], 0]
> Coder:W;coder-63>!GWC
> Out:2
>  caused by:
> source failed
>  caused by:
> rpc error: code = ResourceExhausted desc = grpc: received message larger
> than max (104858536 vs. 52428800)
> &
>   "jsonPayload": {
> "worker": "go-job-1-1599375664191334-09060001-zizk-harness-ptm6",
> "job": "2020-09-06_00_01_25-2750019264422346896",
> "work": "process_bundle-1",
> "message": "DataChannel.read localhost:12371 bad: rpc error: code =
> ResourceExhausted desc = grpc: received message larger than max (104938332
> vs. 52428800)",
> "logger": /vendor/
> github.com/apache/beam/sdks/go/pkg/beam/core/runtime/harness/datamgr.go:261
> ",
> "portability_worker_id": "1"
> }
>
> When executing the pipeline locally, without integration with GCS i.e.
> writing to example.txt instead of gs://example.txt no issues are apparent
> and files of arbitrary size can be generated. Running the pipeline locally
> with an integration to GCS also seems to work with files of any size,
> thought the upload can be quite slow. It is only when executing the
> pipeline on Dataflow that writing the files to GCS fail, it seems odd that
> we would encounter issues uploading files of relatively medium sizes to
> GCS, with such a simple pipeline. The errors seem to indicate that
> something is receiving a message size larger than 52MB (52428800), but I am
> unable to determine what, and where this magic number is coming from.
>
> I would appreciate any insight into this issue with our pipeline’s
> integration with GCS, anyone seen this before?
>
> Regards,
> Patrick Mitchell
> "This e-mail and any attachments to it (the "Communication") is, unless
> otherwise stated, confidential, may contain copyright material and is for
> the use only of the intended recipient. If you receive the Communication in
> error, please notify the sender immediately by return e-mail, delete the
> Communication and the return e-mail, and do not read, copy, retransmit or
> otherwise deal with it. Any views expressed in the Communication are those
> of the individual sender only, unless expressly stated to be those of
> Australia and New Zealand Banking Group Limited ABN 11 005 357 522, or any
> of its related entities including ANZ Bank New Zealand Limited (together
> "ANZ"). ANZ does not accept liability in connection with the integrity of
> or errors in the Communication, computer virus, data corruption,
> interference or delay arising from or in respect of the Communication."
>


Go SDK & Dataflow

2020-09-10 Thread Mitchell, Patrick
Beam team,

I’m currently making use of the Beam Go SDK to construct a pipeline for 
creating and inserting large text data files into GCS.

The pipeline is relatively simple; prepare a bracket of source data, transform 
it into fleshed out data sets, write to GCS with the textio package.

func main() {
   flag.Parse()
   ctx := context.Background()
   beam.Init()
   p := beam.NewPipeline()
   s := p.Root()
   // Get a PCollection of account numbers
   accNumCol := beam.CreateList(s, makeRange(*accountStart, *accountEnd))
   // Transform the PCollection of account numbers into a collection of strings 
that each represent a group of account data
   accountCol := beam.ParDo(s, accountNumToAccountWithRows, accNumCol)
   // Write all account blocks to a single file in GCS
   textio.Write(s, fmt.Sprintf("gs://files/CC%d-ACC%d.txt", *accountStart, 
*accountEnd), accountCol)
   if err := beamx.Run(context.Background(), p); err != nil {
  log.Exitf(ctx, "Failed to execute job: %v", err)
   }
}

Currently, for smaller files (e.g, 13MB or 30,000 lines of text) we are not 
experiencing any issues with the pipeline when executed in Dataflow. When 
preparing larger datasets that would generate medium sized files of approx. 
130MB, executing the write to GCS step in the Dataflow pipeline fails 
consistently with the following kind of errors:

Error message from worker: process bundle failed for instruction 
process_bundle-3 using plan process-bundle-descriptor-47 : while executing 
Process for Plan[process-bundle-descriptor-47]:
2: ParDo[textio.writeFileFn] Out:[]
1: DataSource[S[ptransform-46@localhost:12371], 0] 
Coder:W;coder-63>!GWC
 Out:2
 caused by:
source failed
 caused by:
rpc error: code = ResourceExhausted desc = grpc: received message larger than 
max (104858536 vs. 52428800)
&
  "jsonPayload": {
"worker": "go-job-1-1599375664191334-09060001-zizk-harness-ptm6",
"job": "2020-09-06_00_01_25-2750019264422346896",
"work": "process_bundle-1",
"message": "DataChannel.read localhost:12371 bad: rpc error: code = 
ResourceExhausted desc = grpc: received message larger than max (104938332 vs. 
52428800)",
"logger": 
/vendor/github.com/apache/beam/sdks/go/pkg/beam/core/runtime/harness/datamgr.go:261",
"portability_worker_id": "1"
}

When executing the pipeline locally, without integration with GCS i.e. writing 
to example.txt instead of gs://example.txt no issues are apparent and files of 
arbitrary size can be generated. Running the pipeline locally with an 
integration to GCS also seems to work with files of any size, thought the 
upload can be quite slow. It is only when executing the pipeline on Dataflow 
that writing the files to GCS fail, it seems odd that we would encounter issues 
uploading files of relatively medium sizes to GCS, with such a simple pipeline. 
The errors seem to indicate that something is receiving a message size larger 
than 52MB (52428800), but I am unable to determine what, and where this magic 
number is coming from.

I would appreciate any insight into this issue with our pipeline’s integration 
with GCS, anyone seen this before?

Regards,
Patrick Mitchell
"This e-mail and any attachments to it (the "Communication") is, unless 
otherwise stated, confidential, may contain copyright material and is for the 
use only of the intended recipient. If you receive the Communication in error, 
please notify the sender immediately by return e-mail, delete the Communication 
and the return e-mail, and do not read, copy, retransmit or otherwise deal with 
it. Any views expressed in the Communication are those of the individual sender 
only, unless expressly stated to be those of Australia and New Zealand Banking 
Group Limited ABN 11 005 357 522, or any of its related entities including ANZ 
Bank New Zealand Limited (together "ANZ"). ANZ does not accept liability in 
connection with the integrity of or errors in the Communication, computer 
virus, data corruption, interference or delay arising from or in respect of the 
Communication."


Re: [VOTE] Release 2.24.0, release candidate #3

2020-09-10 Thread Thomas Weise
+1 (binding)

Rebased fork and run internal performance tests.

While doing so, I run into the unit test issue below with the fn_runner
(Python direct runner), which did not occur with 2.21 [1]. That processing
time timers are not supported wasn't an issue previously, because the
timer, though declared, wasn't exercised in the unit test.

Is there a plan/JIRA to support processing time timers with the direct
runner?

Thanks,
Thomas


[1]
https://gist.github.com/tweise/6f8ca6341711f579b0ed9943b8f25138#file-synthetic_stateful-py-L250

/code/venvs/venv/lib/python3.6/site-packages/apache_beam/pipeline.py:555:
in __exit__
self.result = self.run()
/code/venvs/venv/lib/python3.6/site-packages/apache_beam/pipeline.py:521: in run
allow_proto_holders=True).run(False)
/code/venvs/venv/lib/python3.6/site-packages/apache_beam/pipeline.py:534: in run
return self.runner.run_pipeline(self, self._options)
/code/venvs/venv/lib/python3.6/site-packages/apache_beam/runners/direct/direct_runner.py:119:
in run_pipeline
return runner.run_pipeline(pipeline, options)
/code/venvs/venv/lib/python3.6/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py:176:
in run_pipeline
pipeline.to_runner_api(default_environment=self._default_environment))
/code/venvs/venv/lib/python3.6/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py:182:
in run_via_runner_api
self._check_requirements(pipeline_proto)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = 
pipeline_proto = components {
  transforms {
key: "ref_AppliedPTransform_AppliedPTransform_1"
value {
  subtransforms: "ref...}
}
root_transform_ids: "ref_AppliedPTransform_AppliedPTransform_1"
requirements: "beam:requirement:pardo:stateful:v1"


def _check_requirements(self, pipeline_proto):
  """Check that this runner can satisfy all pipeline requirements."""
  supported_requirements = set(self.supported_requirements())
  for requirement in pipeline_proto.requirements:
if requirement not in supported_requirements:
  raise ValueError(
  'Unable to run pipeline with requirement: %s' % requirement)
  for transform in pipeline_proto.components.transforms.values():
if transform.spec.urn == common_urns.primitives.TEST_STREAM.urn:
  raise NotImplementedError(transform.spec.urn)
elif transform.spec.urn in translations.PAR_DO_URNS:
  payload = proto_utils.parse_Bytes(
  transform.spec.payload, beam_runner_api_pb2.ParDoPayload)
  for timer in payload.timer_family_specs.values():
if timer.time_domain != beam_runner_api_pb2.TimeDomain.EVENT_TIME:
> raise NotImplementedError(timer.time_domain)
E NotImplementedError: 2

/code/venvs/venv/lib/python3.6/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py:283:
NotImplementedError



On Thu, Sep 10, 2020 at 4:41 PM Robert Bradshaw  wrote:

> Given the additional information, I am upgrading my vote to +1 (binding)
> based on my prior analysis.
>
> On Thu, Sep 10, 2020 at 4:14 PM Kyle Weaver  wrote:
>
>> +1 (non-binding)
>>
>> Validated wordcount with Python 3.7.8 and Flink 1.10.0 (both loopback and
>> using the Docker image). Also Python 3.7.8 loopback with an embedded Spark
>> cluster.
>>
>> On Thu, Sep 10, 2020 at 2:32 PM Daniel Oliveira 
>> wrote:
>>
>>> By the way, most of the validation so far has covered Direct runner and
>>> Dataflow, but Flink and Spark still have little validation, so if anyone
>>> can help with those it will help speed up the release.
>>>
>>> On Thu, Sep 10, 2020 at 2:12 PM Daniel Oliveira 
>>> wrote:
>>>
 So I tracked the --temp_location issue down to
 https://github.com/apache/beam/pull/12203 and asked @Pablo Estrada
  and @Chamikara Jayalath  about
 it. It's not exactly a bug, but an intended change in requirements for
 WriteToBigQuery, so the only fix I'll need to do is update the test script
 with the appropriate flag, which should be easy. It also won't require
 building a new release candidate.

 There is a possibility that user pipelines will break if they're using
 BigQuery with the Python Direct Runner, so I'll add a note to the changelog
 about it, but I don't think the change is significant enough to need
 anything beyond that.

 On Thu, Sep 10, 2020 at 1:47 PM Chamikara Jayalath <
 chamik...@google.com> wrote:

> +1 (non-binding)
>
> Thanks,
> Cham
>
> On Thu, Sep 10, 2020 at 11:26 AM Ahmet Altay  wrote:
>
>> +1 - validated py3 quickstarts. The problem I mentioned earlier is
>> resolved.
>>
>> On Wed, Sep 9, 2020 at 7:46 PM Daniel Oliveira <
>> danolive...@google.com> wrote:
>>
>>> Good news: According to
>>> https://ci-beam.apache.org/job/beam_PostRelease_Python_Candidate/188/consoleFull
>>>  the
>>> Streaming Wordcount 

Re: [ANNOUNCE] New committer: Reza Ardeshir Rokni

2020-09-10 Thread Robert Bradshaw
Thank you and welcome, Reza!

On Thu, Sep 10, 2020 at 4:00 PM Ahmet Altay  wrote:

> Congratulations Reza! And thank you for your contributions!
>
> On Thu, Sep 10, 2020 at 3:59 PM Chamikara Jayalath 
> wrote:
>
>> Congrats Reza!
>>
>> On Thu, Sep 10, 2020 at 10:35 AM Kenneth Knowles  wrote:
>>
>>> Hi all,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming a new
>>> committer: Reza Ardeshir Rokni.
>>>
>>> Reza has been part of the Beam community since 2017! Reza has
>>> spearheaded advanced Beam examples [1], blogged and presented at multiple
>>> Beam Summits. Reza helps out users on the mailing lists [2] and
>>> StackOverflow [3]. When Reza's work uncovers a missing feature in Beam, he
>>> adds it [4]. Considering these contributions, the Beam PMC trusts Reza with
>>> the responsibilities of a Beam committer [5].
>>>
>>> Thank you, Reza, for your contributions.
>>>
>>> Kenn
>>>
>>> [1] https://github.com/apache/beam/pull/3961
>>> [2]
>>> https://lists.apache.org/list.html?u...@beam.apache.org:gte=0d:reza%20rokni
>>> [3] https://stackoverflow.com/tags/apache-beam/topusers
>>> [4] https://github.com/apache/beam/pull/11929
>>> [5]
>>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>>
>>


Re: [VOTE] Release 2.24.0, release candidate #3

2020-09-10 Thread Robert Bradshaw
Given the additional information, I am upgrading my vote to +1 (binding)
based on my prior analysis.

On Thu, Sep 10, 2020 at 4:14 PM Kyle Weaver  wrote:

> +1 (non-binding)
>
> Validated wordcount with Python 3.7.8 and Flink 1.10.0 (both loopback and
> using the Docker image). Also Python 3.7.8 loopback with an embedded Spark
> cluster.
>
> On Thu, Sep 10, 2020 at 2:32 PM Daniel Oliveira 
> wrote:
>
>> By the way, most of the validation so far has covered Direct runner and
>> Dataflow, but Flink and Spark still have little validation, so if anyone
>> can help with those it will help speed up the release.
>>
>> On Thu, Sep 10, 2020 at 2:12 PM Daniel Oliveira 
>> wrote:
>>
>>> So I tracked the --temp_location issue down to
>>> https://github.com/apache/beam/pull/12203 and asked @Pablo Estrada
>>>  and @Chamikara Jayalath  about
>>> it. It's not exactly a bug, but an intended change in requirements for
>>> WriteToBigQuery, so the only fix I'll need to do is update the test script
>>> with the appropriate flag, which should be easy. It also won't require
>>> building a new release candidate.
>>>
>>> There is a possibility that user pipelines will break if they're using
>>> BigQuery with the Python Direct Runner, so I'll add a note to the changelog
>>> about it, but I don't think the change is significant enough to need
>>> anything beyond that.
>>>
>>> On Thu, Sep 10, 2020 at 1:47 PM Chamikara Jayalath 
>>> wrote:
>>>
 +1 (non-binding)

 Thanks,
 Cham

 On Thu, Sep 10, 2020 at 11:26 AM Ahmet Altay  wrote:

> +1 - validated py3 quickstarts. The problem I mentioned earlier is
> resolved.
>
> On Wed, Sep 9, 2020 at 7:46 PM Daniel Oliveira 
> wrote:
>
>> Good news: According to
>> https://ci-beam.apache.org/job/beam_PostRelease_Python_Candidate/188/consoleFull
>>  the
>> Streaming Wordcount quickstart work for Dataflow with Python 2.7. So it
>> looks like the container issue might be fixed.
>>
>> Bad news: That same Jenkins job failed on "Running HourlyTeamScore
>> example with DirectRunner" because it's missing a --temp_location flag,
>> despite using the DirectRunner. This looks like a bug, but I'm still
>> investigating whether it'll need another cherry-pick and RC to fix or if
>> the validation script just needs to be updated. I'll update the thread 
>> if I
>> find anything.
>>
>
> Probably it does not require a cherry-pick. We have not validated that
> workflow in the past few releases.
>
>
>>
>> On Wed, Sep 9, 2020 at 4:58 PM Daniel Oliveira <
>> danolive...@google.com> wrote:
>>
>>> The Dataflow Python Batch worker issue should be fixed now. I tried
>>> verifying it myself via the rc validation script, but I've been having 
>>> some
>>> trouble with the GCP authentication so if someone else can validate it,
>>> that would be a big help.
>>>
>>> On Tue, Sep 8, 2020 at 5:51 PM Robert Bradshaw 
>>> wrote:
>>>
 I verified the signatures and all the artifacts are correct, and
 tested a wheel in a fresh virtual environment. It'd be good to see the
 Dataflow issue confirmed as fixed though.

 On Tue, Sep 8, 2020 at 5:17 PM Valentyn Tymofieiev <
 valen...@google.com> wrote:

> This error comes from the Dataflow Python Batch worker.
>
> Streaming workflows use sdk worker, which is provided by
> apache-beam library, so the versions will match.
>
> The error should be fixed by setting the correct Dataflow worker
> version in Dataflow containers, and does not affect Beam RC.
>
> On Tue, Sep 8, 2020 at 4:52 PM Ahmet Altay 
> wrote:
>
>> -1 - I validated py3 quickstarts on dataflow and direct runner. I
>> ran into 1 issue with batch workflows on dataflow:
>>
>> "RuntimeError: Beam SDK base version 2.24.0 does not match
>> Dataflow Python worker version 2.24.0.dev. Please check Dataflow
>> worker startup logs and make sure that correct version of Beam SDK is
>> installed."
>>
>> It seems like the batch worker needs to be rebuild. Not sure why
>> the streaming worker did not fail (does it have the correct version? 
>> or
>> does it not have the same check?)
>>
>> Ahmet
>>
>> On Fri, Sep 4, 2020 at 1:33 PM Valentyn Tymofieiev <
>> valen...@google.com> wrote:
>>
>>> Dataflow containers are also available now.
>>>
>>> On Thu, Sep 3, 2020 at 11:47 PM Daniel Oliveira <
>>> danolive...@google.com> wrote:
>>>
 This should fix the BigQueryIO regression that Pablo caught.

 As before, Dataflow containers are not yet ready. I or someone
 else will chime in on the thread once it's 

Re: [VOTE] Release 2.24.0, release candidate #3

2020-09-10 Thread Kyle Weaver
+1 (non-binding)

Validated wordcount with Python 3.7.8 and Flink 1.10.0 (both loopback and
using the Docker image). Also Python 3.7.8 loopback with an embedded Spark
cluster.

On Thu, Sep 10, 2020 at 2:32 PM Daniel Oliveira 
wrote:

> By the way, most of the validation so far has covered Direct runner and
> Dataflow, but Flink and Spark still have little validation, so if anyone
> can help with those it will help speed up the release.
>
> On Thu, Sep 10, 2020 at 2:12 PM Daniel Oliveira 
> wrote:
>
>> So I tracked the --temp_location issue down to
>> https://github.com/apache/beam/pull/12203 and asked @Pablo Estrada
>>  and @Chamikara Jayalath  about
>> it. It's not exactly a bug, but an intended change in requirements for
>> WriteToBigQuery, so the only fix I'll need to do is update the test script
>> with the appropriate flag, which should be easy. It also won't require
>> building a new release candidate.
>>
>> There is a possibility that user pipelines will break if they're using
>> BigQuery with the Python Direct Runner, so I'll add a note to the changelog
>> about it, but I don't think the change is significant enough to need
>> anything beyond that.
>>
>> On Thu, Sep 10, 2020 at 1:47 PM Chamikara Jayalath 
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> Thanks,
>>> Cham
>>>
>>> On Thu, Sep 10, 2020 at 11:26 AM Ahmet Altay  wrote:
>>>
 +1 - validated py3 quickstarts. The problem I mentioned earlier is
 resolved.

 On Wed, Sep 9, 2020 at 7:46 PM Daniel Oliveira 
 wrote:

> Good news: According to
> https://ci-beam.apache.org/job/beam_PostRelease_Python_Candidate/188/consoleFull
>  the
> Streaming Wordcount quickstart work for Dataflow with Python 2.7. So it
> looks like the container issue might be fixed.
>
> Bad news: That same Jenkins job failed on "Running HourlyTeamScore
> example with DirectRunner" because it's missing a --temp_location flag,
> despite using the DirectRunner. This looks like a bug, but I'm still
> investigating whether it'll need another cherry-pick and RC to fix or if
> the validation script just needs to be updated. I'll update the thread if 
> I
> find anything.
>

 Probably it does not require a cherry-pick. We have not validated that
 workflow in the past few releases.


>
> On Wed, Sep 9, 2020 at 4:58 PM Daniel Oliveira 
> wrote:
>
>> The Dataflow Python Batch worker issue should be fixed now. I tried
>> verifying it myself via the rc validation script, but I've been having 
>> some
>> trouble with the GCP authentication so if someone else can validate it,
>> that would be a big help.
>>
>> On Tue, Sep 8, 2020 at 5:51 PM Robert Bradshaw 
>> wrote:
>>
>>> I verified the signatures and all the artifacts are correct, and
>>> tested a wheel in a fresh virtual environment. It'd be good to see the
>>> Dataflow issue confirmed as fixed though.
>>>
>>> On Tue, Sep 8, 2020 at 5:17 PM Valentyn Tymofieiev <
>>> valen...@google.com> wrote:
>>>
 This error comes from the Dataflow Python Batch worker.

 Streaming workflows use sdk worker, which is provided by
 apache-beam library, so the versions will match.

 The error should be fixed by setting the correct Dataflow worker
 version in Dataflow containers, and does not affect Beam RC.

 On Tue, Sep 8, 2020 at 4:52 PM Ahmet Altay 
 wrote:

> -1 - I validated py3 quickstarts on dataflow and direct runner. I
> ran into 1 issue with batch workflows on dataflow:
>
> "RuntimeError: Beam SDK base version 2.24.0 does not match
> Dataflow Python worker version 2.24.0.dev. Please check Dataflow
> worker startup logs and make sure that correct version of Beam SDK is
> installed."
>
> It seems like the batch worker needs to be rebuild. Not sure why
> the streaming worker did not fail (does it have the correct version? 
> or
> does it not have the same check?)
>
> Ahmet
>
> On Fri, Sep 4, 2020 at 1:33 PM Valentyn Tymofieiev <
> valen...@google.com> wrote:
>
>> Dataflow containers are also available now.
>>
>> On Thu, Sep 3, 2020 at 11:47 PM Daniel Oliveira <
>> danolive...@google.com> wrote:
>>
>>> This should fix the BigQueryIO regression that Pablo caught.
>>>
>>> As before, Dataflow containers are not yet ready. I or someone
>>> else will chime in on the thread once it's ready.
>>>
>>> On Thu, Sep 3, 2020 at 11:39 PM Daniel Oliveira <
>>> danolive...@google.com> wrote:
>>>
 Hi everyone,
 Please review and vote on the release candidate #3 for the
 version 2.24.0, as follows:
 [ ] +1, Approve the 

Support for file version/generation in GcsPath

2020-09-10 Thread Cody Herring
Hello,

I added an issue and PR for adding support for file generation to GcsPath:

https://issues.apache.org/jira/browse/BEAM-10867
https://github.com/apache/beam/pull/12805

This is already supported by GoogleCloudStorageReadChannel.java (I added that 
support in 2019), so  I think we should support it here.


Re: [DISCUSS] Move Avro dependency out of core Beam

2020-09-10 Thread Brian Hulette
On Tue, Sep 8, 2020 at 9:18 AM Robert Bradshaw  wrote:

> IIRC Dataflow (and perhaps others) implicitly depend on Avro to write
> out intermediate files (e.g. for non-shuffle Fusion breaks). Would
> this break if we just removed it?
>

I think Dataflow would just need to declare a dependency on the new
extension.


>
> On Thu, Sep 3, 2020 at 10:51 PM Reuven Lax  wrote:
> >
> > As for 2, maybe it's time to remove @Experimental from SchemaCoder?
> >
>

Probably worth a separate thread about dropping `@Experimental` on
SchemaCoder. I'd be ok with that, the only breaking change I have in mind
is that I think we should deprecate and remove the DATETIME primitive type,
replacing it with a logical type.


> > 1 is tricky though. Changes like this have caused a lot of trouble for
> users in the past, and I think some users still have unpleasant memories of
> being told "you just have to change some package names and imports."
> >
>

We could mitigate this by first adding the new extension module and
deprecating the core Beam counterpart for a release (or multiple releases).


> > On Thu, Sep 3, 2020 at 6:18 PM Brian Hulette 
> wrote:
> >>
> >> Hi everyone,
> >> The fact that core Beam has a dependency on Avro has led to a lot of
> headaches when users (or runners) are using a different version. zeidoo [1]
> was generous enough to put up a WIP PR [2] that moves everything that
> depends on Avro (primarily AvroCoder and the Avro SchemaProvider I believe)
> out of core Beam and into a separate extensions module. This way we could
> have multiple extensions for different versions of Avro in the future.
> >>
> >> As I understand it, the downsides to making this change are:
> >> 1) It's a breaking change, users with AvroCoder in their pipeline will
> need to change their build dependencies and import statements.
> >> 2) AvroCoder is the only (non-experimental) coder in core Beam that can
> encode complex user types. So new users will need to dabble with the
> Experimental SchemaCoder or add a second dependency to build a pipeline
> with their own types.
> >>
> >> I think these costs are outweighed by the benefit of removing the
> dependency in core Beam, but I wanted to reach out to the community to see
> if there are any objections.
> >>
> >> Brian
> >>
> >> [1] github.com/zeidoo
> >> [2] https://github.com/apache/beam/pull/12748
>


Re: [VOTE] Release 2.24.0, release candidate #3

2020-09-10 Thread Daniel Oliveira
By the way, most of the validation so far has covered Direct runner and
Dataflow, but Flink and Spark still have little validation, so if anyone
can help with those it will help speed up the release.

On Thu, Sep 10, 2020 at 2:12 PM Daniel Oliveira 
wrote:

> So I tracked the --temp_location issue down to
> https://github.com/apache/beam/pull/12203 and asked @Pablo Estrada
>  and @Chamikara Jayalath  about
> it. It's not exactly a bug, but an intended change in requirements for
> WriteToBigQuery, so the only fix I'll need to do is update the test script
> with the appropriate flag, which should be easy. It also won't require
> building a new release candidate.
>
> There is a possibility that user pipelines will break if they're using
> BigQuery with the Python Direct Runner, so I'll add a note to the changelog
> about it, but I don't think the change is significant enough to need
> anything beyond that.
>
> On Thu, Sep 10, 2020 at 1:47 PM Chamikara Jayalath 
> wrote:
>
>> +1 (non-binding)
>>
>> Thanks,
>> Cham
>>
>> On Thu, Sep 10, 2020 at 11:26 AM Ahmet Altay  wrote:
>>
>>> +1 - validated py3 quickstarts. The problem I mentioned earlier is
>>> resolved.
>>>
>>> On Wed, Sep 9, 2020 at 7:46 PM Daniel Oliveira 
>>> wrote:
>>>
 Good news: According to
 https://ci-beam.apache.org/job/beam_PostRelease_Python_Candidate/188/consoleFull
  the
 Streaming Wordcount quickstart work for Dataflow with Python 2.7. So it
 looks like the container issue might be fixed.

 Bad news: That same Jenkins job failed on "Running HourlyTeamScore
 example with DirectRunner" because it's missing a --temp_location flag,
 despite using the DirectRunner. This looks like a bug, but I'm still
 investigating whether it'll need another cherry-pick and RC to fix or if
 the validation script just needs to be updated. I'll update the thread if I
 find anything.

>>>
>>> Probably it does not require a cherry-pick. We have not validated that
>>> workflow in the past few releases.
>>>
>>>

 On Wed, Sep 9, 2020 at 4:58 PM Daniel Oliveira 
 wrote:

> The Dataflow Python Batch worker issue should be fixed now. I tried
> verifying it myself via the rc validation script, but I've been having 
> some
> trouble with the GCP authentication so if someone else can validate it,
> that would be a big help.
>
> On Tue, Sep 8, 2020 at 5:51 PM Robert Bradshaw 
> wrote:
>
>> I verified the signatures and all the artifacts are correct, and
>> tested a wheel in a fresh virtual environment. It'd be good to see the
>> Dataflow issue confirmed as fixed though.
>>
>> On Tue, Sep 8, 2020 at 5:17 PM Valentyn Tymofieiev <
>> valen...@google.com> wrote:
>>
>>> This error comes from the Dataflow Python Batch worker.
>>>
>>> Streaming workflows use sdk worker, which is provided by apache-beam
>>> library, so the versions will match.
>>>
>>> The error should be fixed by setting the correct Dataflow worker
>>> version in Dataflow containers, and does not affect Beam RC.
>>>
>>> On Tue, Sep 8, 2020 at 4:52 PM Ahmet Altay  wrote:
>>>
 -1 - I validated py3 quickstarts on dataflow and direct runner. I
 ran into 1 issue with batch workflows on dataflow:

 "RuntimeError: Beam SDK base version 2.24.0 does not match Dataflow
 Python worker version 2.24.0.dev. Please check Dataflow worker
 startup logs and make sure that correct version of Beam SDK is 
 installed."

 It seems like the batch worker needs to be rebuild. Not sure why
 the streaming worker did not fail (does it have the correct version? or
 does it not have the same check?)

 Ahmet

 On Fri, Sep 4, 2020 at 1:33 PM Valentyn Tymofieiev <
 valen...@google.com> wrote:

> Dataflow containers are also available now.
>
> On Thu, Sep 3, 2020 at 11:47 PM Daniel Oliveira <
> danolive...@google.com> wrote:
>
>> This should fix the BigQueryIO regression that Pablo caught.
>>
>> As before, Dataflow containers are not yet ready. I or someone
>> else will chime in on the thread once it's ready.
>>
>> On Thu, Sep 3, 2020 at 11:39 PM Daniel Oliveira <
>> danolive...@google.com> wrote:
>>
>>> Hi everyone,
>>> Please review and vote on the release candidate #3 for the
>>> version 2.24.0, as follows:
>>> [ ] +1, Approve the release
>>> [ ] -1, Do not approve the release (please provide specific
>>> comments)
>>>
>>>
>>> The complete staging area is available for your review, which
>>> includes:
>>> * JIRA release notes [1],
>>> * the official Apache source release to be deployed to
>>> dist.apache.org [2], which is signed with the 

Re: [VOTE] Release 2.24.0, release candidate #3

2020-09-10 Thread Daniel Oliveira
So I tracked the --temp_location issue down to
https://github.com/apache/beam/pull/12203 and asked @Pablo Estrada
 and @Chamikara Jayalath  about
it. It's not exactly a bug, but an intended change in requirements for
WriteToBigQuery, so the only fix I'll need to do is update the test script
with the appropriate flag, which should be easy. It also won't require
building a new release candidate.

There is a possibility that user pipelines will break if they're using
BigQuery with the Python Direct Runner, so I'll add a note to the changelog
about it, but I don't think the change is significant enough to need
anything beyond that.

On Thu, Sep 10, 2020 at 1:47 PM Chamikara Jayalath 
wrote:

> +1 (non-binding)
>
> Thanks,
> Cham
>
> On Thu, Sep 10, 2020 at 11:26 AM Ahmet Altay  wrote:
>
>> +1 - validated py3 quickstarts. The problem I mentioned earlier is
>> resolved.
>>
>> On Wed, Sep 9, 2020 at 7:46 PM Daniel Oliveira 
>> wrote:
>>
>>> Good news: According to
>>> https://ci-beam.apache.org/job/beam_PostRelease_Python_Candidate/188/consoleFull
>>>  the
>>> Streaming Wordcount quickstart work for Dataflow with Python 2.7. So it
>>> looks like the container issue might be fixed.
>>>
>>> Bad news: That same Jenkins job failed on "Running HourlyTeamScore
>>> example with DirectRunner" because it's missing a --temp_location flag,
>>> despite using the DirectRunner. This looks like a bug, but I'm still
>>> investigating whether it'll need another cherry-pick and RC to fix or if
>>> the validation script just needs to be updated. I'll update the thread if I
>>> find anything.
>>>
>>
>> Probably it does not require a cherry-pick. We have not validated that
>> workflow in the past few releases.
>>
>>
>>>
>>> On Wed, Sep 9, 2020 at 4:58 PM Daniel Oliveira 
>>> wrote:
>>>
 The Dataflow Python Batch worker issue should be fixed now. I tried
 verifying it myself via the rc validation script, but I've been having some
 trouble with the GCP authentication so if someone else can validate it,
 that would be a big help.

 On Tue, Sep 8, 2020 at 5:51 PM Robert Bradshaw 
 wrote:

> I verified the signatures and all the artifacts are correct, and
> tested a wheel in a fresh virtual environment. It'd be good to see the
> Dataflow issue confirmed as fixed though.
>
> On Tue, Sep 8, 2020 at 5:17 PM Valentyn Tymofieiev <
> valen...@google.com> wrote:
>
>> This error comes from the Dataflow Python Batch worker.
>>
>> Streaming workflows use sdk worker, which is provided by apache-beam
>> library, so the versions will match.
>>
>> The error should be fixed by setting the correct Dataflow worker
>> version in Dataflow containers, and does not affect Beam RC.
>>
>> On Tue, Sep 8, 2020 at 4:52 PM Ahmet Altay  wrote:
>>
>>> -1 - I validated py3 quickstarts on dataflow and direct runner. I
>>> ran into 1 issue with batch workflows on dataflow:
>>>
>>> "RuntimeError: Beam SDK base version 2.24.0 does not match Dataflow
>>> Python worker version 2.24.0.dev. Please check Dataflow worker
>>> startup logs and make sure that correct version of Beam SDK is 
>>> installed."
>>>
>>> It seems like the batch worker needs to be rebuild. Not sure why the
>>> streaming worker did not fail (does it have the correct version? or 
>>> does it
>>> not have the same check?)
>>>
>>> Ahmet
>>>
>>> On Fri, Sep 4, 2020 at 1:33 PM Valentyn Tymofieiev <
>>> valen...@google.com> wrote:
>>>
 Dataflow containers are also available now.

 On Thu, Sep 3, 2020 at 11:47 PM Daniel Oliveira <
 danolive...@google.com> wrote:

> This should fix the BigQueryIO regression that Pablo caught.
>
> As before, Dataflow containers are not yet ready. I or someone
> else will chime in on the thread once it's ready.
>
> On Thu, Sep 3, 2020 at 11:39 PM Daniel Oliveira <
> danolive...@google.com> wrote:
>
>> Hi everyone,
>> Please review and vote on the release candidate #3 for the
>> version 2.24.0, as follows:
>> [ ] +1, Approve the release
>> [ ] -1, Do not approve the release (please provide specific
>> comments)
>>
>>
>> The complete staging area is available for your review, which
>> includes:
>> * JIRA release notes [1],
>> * the official Apache source release to be deployed to
>> dist.apache.org [2], which is signed with the key with
>> fingerprint D0E7B69D911ADA3C0482BAA1C4E6B2F8C71D742F [3],
>> * all artifacts to be deployed to the Maven Central Repository
>> [4],
>> * source code tag "v2.24.0-RC3" [5],
>> * website pull request listing the release [6], publishing the
>> API reference manual [7], and the blog post [8].
>> * Java artifacts were built 

Re: [VOTE] Release 2.24.0, release candidate #3

2020-09-10 Thread Chamikara Jayalath
+1 (non-binding)

Thanks,
Cham

On Thu, Sep 10, 2020 at 11:26 AM Ahmet Altay  wrote:

> +1 - validated py3 quickstarts. The problem I mentioned earlier is
> resolved.
>
> On Wed, Sep 9, 2020 at 7:46 PM Daniel Oliveira 
> wrote:
>
>> Good news: According to
>> https://ci-beam.apache.org/job/beam_PostRelease_Python_Candidate/188/consoleFull
>>  the
>> Streaming Wordcount quickstart work for Dataflow with Python 2.7. So it
>> looks like the container issue might be fixed.
>>
>> Bad news: That same Jenkins job failed on "Running HourlyTeamScore
>> example with DirectRunner" because it's missing a --temp_location flag,
>> despite using the DirectRunner. This looks like a bug, but I'm still
>> investigating whether it'll need another cherry-pick and RC to fix or if
>> the validation script just needs to be updated. I'll update the thread if I
>> find anything.
>>
>
> Probably it does not require a cherry-pick. We have not validated that
> workflow in the past few releases.
>
>
>>
>> On Wed, Sep 9, 2020 at 4:58 PM Daniel Oliveira 
>> wrote:
>>
>>> The Dataflow Python Batch worker issue should be fixed now. I tried
>>> verifying it myself via the rc validation script, but I've been having some
>>> trouble with the GCP authentication so if someone else can validate it,
>>> that would be a big help.
>>>
>>> On Tue, Sep 8, 2020 at 5:51 PM Robert Bradshaw 
>>> wrote:
>>>
 I verified the signatures and all the artifacts are correct, and tested
 a wheel in a fresh virtual environment. It'd be good to see the Dataflow
 issue confirmed as fixed though.

 On Tue, Sep 8, 2020 at 5:17 PM Valentyn Tymofieiev 
 wrote:

> This error comes from the Dataflow Python Batch worker.
>
> Streaming workflows use sdk worker, which is provided by apache-beam
> library, so the versions will match.
>
> The error should be fixed by setting the correct Dataflow worker
> version in Dataflow containers, and does not affect Beam RC.
>
> On Tue, Sep 8, 2020 at 4:52 PM Ahmet Altay  wrote:
>
>> -1 - I validated py3 quickstarts on dataflow and direct runner. I ran
>> into 1 issue with batch workflows on dataflow:
>>
>> "RuntimeError: Beam SDK base version 2.24.0 does not match Dataflow
>> Python worker version 2.24.0.dev. Please check Dataflow worker
>> startup logs and make sure that correct version of Beam SDK is 
>> installed."
>>
>> It seems like the batch worker needs to be rebuild. Not sure why the
>> streaming worker did not fail (does it have the correct version? or does 
>> it
>> not have the same check?)
>>
>> Ahmet
>>
>> On Fri, Sep 4, 2020 at 1:33 PM Valentyn Tymofieiev <
>> valen...@google.com> wrote:
>>
>>> Dataflow containers are also available now.
>>>
>>> On Thu, Sep 3, 2020 at 11:47 PM Daniel Oliveira <
>>> danolive...@google.com> wrote:
>>>
 This should fix the BigQueryIO regression that Pablo caught.

 As before, Dataflow containers are not yet ready. I or someone else
 will chime in on the thread once it's ready.

 On Thu, Sep 3, 2020 at 11:39 PM Daniel Oliveira <
 danolive...@google.com> wrote:

> Hi everyone,
> Please review and vote on the release candidate #3 for the version
> 2.24.0, as follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific
> comments)
>
>
> The complete staging area is available for your review, which
> includes:
> * JIRA release notes [1],
> * the official Apache source release to be deployed to
> dist.apache.org [2], which is signed with the key with
> fingerprint D0E7B69D911ADA3C0482BAA1C4E6B2F8C71D742F [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "v2.24.0-RC3" [5],
> * website pull request listing the release [6], publishing the API
> reference manual [7], and the blog post [8].
> * Java artifacts were built with Maven 3.6.3 and OpenJDK 1.8.0.
> * Python artifacts are deployed along with the source release to
> the dist.apache.org [2].
> * Validation sheet with a tab for 2.24.0 release to help with
> validation [9].
> * Docker images published to Docker Hub [10].
>
> The vote will be open for at least 72 hours. It is adopted by
> majority approval, with at least 3 PMC affirmative votes.
>
> Thanks,
> Release Manager
>
> [1]
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12347146
> [2] https://dist.apache.org/repos/dist/dev/beam/2.24.0/
> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> [4]
> 

Re: Ability to link to "latest" of python docs

2020-09-10 Thread Lynn Root
Hey Brian - 100% this is what I was looking for! It works with intersphinx
mapping just fine (looks like it handles redirects). Thanks so much!

On Thu, Sep 10, 2020 at 1:50 PM Brian Hulette  wrote:

> There's https://beam.apache.org/releases/pydoc/current (and
> https://beam.apache.org/releases/javadoc/current) which redirect to the
> most recent release. These get updated somewhere in the release process.
>
> They're not very discoverable since it's a redirect and the URL changes
> when you click on it, but maybe that can still work for intersphinx mapping?
>
> Brian
>
> On Tue, Sep 8, 2020 at 7:54 PM Austin Bennett 
> wrote:
>
>> +dev 
>>
>> Lynn,
>>
>> Seems totally doable.  If others don't speak up with a good way to do
>> this (or in opposition), I'm sure we can sort something out to accomplish
>> this (will dig into intersphinx mapping tomorrow).
>>
>> Cheers,
>> Austin
>>
>>
>>
>>
>> On Tue, Sep 8, 2020, 5:19 PM Lynn Root  wrote:
>>
>>> Hey folks -
>>>
>>> I'm wondering if there's a way to link to the latest SDK version of the
>>> Python documentation. I see that if I go here
>>> , it lists all the available
>>> documented SDK versions. But it'd be really nice to go to a link like "
>>> https://beam.apache.org/releases/pydoc/latest; and be automatically
>>> pointed to the latest one. This is particularly handy for documenting
>>> libraries that use beam via intersphinx mapping
>>> 
>>> .
>>>
>>> Thanks!
>>>
>>> --
>>> Lynn Root
>>> Staff Engineer, Spotify
>>>
>>

-- 
Lynn Root
Staff Engineer, DI NYC | FOSS Board


Re: Adding Twister2 documentation

2020-09-10 Thread Pulasthi Supun Wickramasinghe
@Ismaël Mejía 

would you have some cycles to validates these changes?

Best Regards,
Pulasthi

On Mon, Aug 31, 2020 at 10:49 PM Pulasthi Supun Wickramasinghe <
pulasthi...@gmail.com> wrote:

> Hi Devs.
>
> I created two PR's to add the missing Twister2 runner documentation[1]
> [2], however, would need some guidance on how to fill out the compatibility
> matrix.
>
>
>1. Is there some mapping between test classes and entries in the
>compatibility matrix I can use to crosscheck and tick off boxes in the
>matrix.
>2. Since the Twister2 runner only supports batch mode at the moment
>how would I reflect that in the compatibility matrix?
>
> [1] https://github.com/apache/beam/pull/12731
> [2] https://github.com/apache/beam/pull/12744
>
> Best Regards,
> Pulasthi
> --
> Pulasthi S. Wickramasinghe
> PhD Candidate  | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> cell: 224-386-9035
>


-- 
Pulasthi S. Wickramasinghe
PhD Candidate  | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
cell: 224-386-9035


[ANNOUNCE] New committer: Reza Ardeshir Rokni

2020-09-10 Thread Kenneth Knowles
Hi all,

Please join me and the rest of the Beam PMC in welcoming a new committer:
Reza Ardeshir Rokni.

Reza has been part of the Beam community since 2017! Reza has spearheaded
advanced Beam examples [1], blogged and presented at multiple Beam Summits.
Reza helps out users on the mailing lists [2] and StackOverflow [3]. When
Reza's work uncovers a missing feature in Beam, he adds it [4]. Considering
these contributions, the Beam PMC trusts Reza with the responsibilities of
a Beam committer [5].

Thank you, Reza, for your contributions.

Kenn

[1] https://github.com/apache/beam/pull/3961
[2]
https://lists.apache.org/list.html?u...@beam.apache.org:gte=0d:reza%20rokni
[3] https://stackoverflow.com/tags/apache-beam/topusers
[4] https://github.com/apache/beam/pull/11929
[5]
https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer


Notification of analysis on publicly available project data

2020-09-10 Thread Griselda Cuevas
Dear PMC,


I’m contacting you because your project has been selected by the ASF D
committee which is leading a research project to evaluate and understand
the current state of diversity in our community [1]. As part of this
research, we will analyze publicly available data about your project such
as Git logs, Jira boards and mailing lists, to better understand the state
of diversity in Apache projects and to complement the findings we obtained
from the Community Survey that was run this year [2].


This analysis will be performed by Bitegia [3], a vendor specializing in
researching open source projects and foundations. The results will be
published in a report similar to the OpenStack Foundation Analysis
published in 2018 [4].


The analysis will be done only on aggregated data at the project level during
and after processing, ensuring we do not report anything that could
identify a single individual. The data we analyze will be deleted right
after the research is done and won’t be retained by either the researcher
or the ASF.


If you have any concerns or questions, please raise them to the diversity
committee (d...@diversity.apache.org) and/or to the data privacy committee (
priv...@apache.org).


Regards,

Griselda Cuevas

V.P. of Diversity and Inclusion

Apache Software Foundation


[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=127405614

[2] https://youtu.be/4Mr1CRtKqUI

[3] https://bitergia.com/bitergia-analytics/

[4] https://superuser.openstack.org/articles/2018-gender-diversity-report/


Re: Suggestion to let KafkaIO support the deserializer API with headers

2020-09-10 Thread Lourens Naude
Thanks Luke,

I took a stab at this in https://issues.apache.org/jira/browse/BEAM-10865 -
also outlined extensively avenues explored with the range of Kafka APIs to
support and how Record headers and then eventually the
(Extended)Deserializer APIs evolved. Tested with kafka-clients 1.0.0
through latest

Best,
Lourens

On Fri, Aug 21, 2020 at 5:06 PM Luke Cwik  wrote:

> Sounds good.
>
> Note that you'll also want to update ReadFromKafkaDoFn[1] and provide
> tests that cover both to make sure we don't regress and stop providing
> headers.
>
> 1:
> https://github.com/apache/beam/blob/cfa448d121297398312d09c531258a72b413488b/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/ReadFromKafkaDoFn.java#L309
>
> On Fri, Aug 21, 2020 at 8:29 AM Lourens Naude 
> wrote:
>
>> Hi everyone,
>>
>> We bumped into an API issue with the deserializer called on constructing
>> KafaRecord instances in the KafkaIO module.
>>
>> I wanted to float this past the mailing list for discussion first before
>> exploring further.
>>
>> The callsite referenced: KafkaIO only calls the deserializer with the
>> simplified API that does not include Kafka record headers (even though they
>> are available to pass as an argument):
>> https://github.com/apache/beam/blob/release-2.20.0/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaUnboundedReader.java#L202-L203
>>
>> Our SerDes implementaton relies on Kafka Headers support and it was added
>> to Kafka records via KIP as a means to include metadata cleanly and not
>> abuse keys or values for such purposes.
>>
>> It is also a valid Deserializer API as per the official Kafka interface:
>>
>> *
>> https://github.com/apache/kafka/blob/35a0de32ee3823dfb548a1cd5d5faf4f7c99e4e0/clients/src/main/java/org/apache/kafka/common/serialization/Deserializer.java#L59-L61
>> * It delegates to the simplified version as it's default implementation
>> (which requires a formal implementation) in
>> https://github.com/apache/kafka/blob/35a0de32ee3823dfb548a1cd5d5faf4f7c99e4e0/clients/src/main/java/org/apache/kafka/common/serialization/Deserializer.java#L60
>> * The default behaviour is thus backwards compatible, with a preference
>> for the header specific API
>>
>> We've used the custom SerDes without issues in a complex Connect and
>> Streams pipeline, but bumped into this API divergence of not preferring the
>> deserializer API with headers as the primary deserializer mechanism.
>>
>> The same API used elsewhere.
>>
>> * It's the default for the stock Java consumer:
>> https://github.com/apache/kafka/blob/4cd2396db31418c90005c998d9107ad40df055b2/clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java#L1362
>> (header enabled calls simplified API)
>> * Ditto Kafka Connect:
>> https://github.com/apache/kafka/blob/b399a731a39c28bdd89998edc7c9fd732c56eee1/connect/api/src/main/java/org/apache/kafka/connect/storage/Converter.java#L48-L64
>> * And Kafka Streams:
>> https://github.com/apache/kafka/blob/92828d53b18703000159f4dd7dc8b3170667db25/streams/src/main/java/org/apache/kafka/streams/processor/internals/RecordDeserializer.java#L65-L66
>>
>> Any thoughts on the proposed change with the additional headers argument
>> passed on deserialization?
>>
>> Best,
>> Lourens
>>
>


Re: [ANNOUNCE] New committer: Heejong Lee

2020-09-10 Thread Andrew Psaltis
Congratulations!

On Thu, Sep 10, 2020 at 4:41 PM Reza Rokni  wrote:

> Congratulations!
>
> On Thu, Sep 10, 2020 at 12:20 AM Alexey Romanenko <
> aromanenko@gmail.com> wrote:
>
>> Congratulations!
>>
>> On 9 Sep 2020, at 09:50, Jan Lukavský  wrote:
>>
>> Congratulations!
>> On 9/9/20 1:00 AM, Chamikara Jayalath wrote:
>>
>> Congrats Heejong!
>>
>> On Tue, Sep 8, 2020 at 1:55 PM Yichi Zhang  wrote:
>>
>>> Congratulations Heejong!
>>>
>>> On Tue, Sep 8, 2020 at 1:42 PM Ankur Goenka  wrote:
>>>
 Congratulations Heejong!

 On Tue, Sep 8, 2020 at 1:40 PM Udi Meiri  wrote:

> Congrats Heejong!
>
> On Tue, Sep 8, 2020 at 12:33 PM Tyson Hamilton 
> wrote:
>
>> Congratulations!
>>
>> On Tue, Sep 8, 2020, 12:10 PM Robert Bradshaw 
>> wrote:
>>
>>> Congratulations, Heejong!
>>>
>>> On Tue, Sep 8, 2020 at 11:41 AM Rui Wang  wrote:
>>>
 Congrats, Heejong!


 -Rui

 On Tue, Sep 8, 2020 at 11:26 AM Robin Qiu 
 wrote:

> Congrats, Heejong!
>
> On Tue, Sep 8, 2020 at 11:23 AM Valentyn Tymofieiev <
> valen...@google.com> wrote:
>
>> Congratulations, Heejong!
>>
>> On Tue, Sep 8, 2020 at 11:14 AM Ahmet Altay 
>> wrote:
>>
>>> Hi everyone,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming
>>> a new committer: Heejong Lee .
>>>
>>> Heejong has been active in the community for more than 2 years,
>>> worked on various IOs (parquet, kafka, file, pubsub) and most 
>>> recently
>>> worked on adding cross language transforms feature to Beam [1].
>>>
>>> In consideration of his contributions, the Beam PMC trusts him
>>> with the responsibilities of a Beam committer [2].
>>>
>>> Thank you for your contributions Heejong!
>>>
>>> -Ahmet, on behalf of the Apache Beam PMC
>>>
>>> [1]
>>> https://issues.apache.org/jira/browse/BEAM-10634?jql=project%20%3D%20BEAM%20AND%20assignee%20in%20(heejong)%20ORDER%20BY%20resolved%20DESC%2C%20affectedVersion%20ASC%2C%20priority%20DESC%2C%20updated%20DESC
>>> [2]
>>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>>
>>
>>


Re: [ANNOUNCE] New committer: Heejong Lee

2020-09-10 Thread Reza Rokni
Congratulations!

On Thu, Sep 10, 2020 at 12:20 AM Alexey Romanenko 
wrote:

> Congratulations!
>
> On 9 Sep 2020, at 09:50, Jan Lukavský  wrote:
>
> Congratulations!
> On 9/9/20 1:00 AM, Chamikara Jayalath wrote:
>
> Congrats Heejong!
>
> On Tue, Sep 8, 2020 at 1:55 PM Yichi Zhang  wrote:
>
>> Congratulations Heejong!
>>
>> On Tue, Sep 8, 2020 at 1:42 PM Ankur Goenka  wrote:
>>
>>> Congratulations Heejong!
>>>
>>> On Tue, Sep 8, 2020 at 1:40 PM Udi Meiri  wrote:
>>>
 Congrats Heejong!

 On Tue, Sep 8, 2020 at 12:33 PM Tyson Hamilton 
 wrote:

> Congratulations!
>
> On Tue, Sep 8, 2020, 12:10 PM Robert Bradshaw 
> wrote:
>
>> Congratulations, Heejong!
>>
>> On Tue, Sep 8, 2020 at 11:41 AM Rui Wang  wrote:
>>
>>> Congrats, Heejong!
>>>
>>>
>>> -Rui
>>>
>>> On Tue, Sep 8, 2020 at 11:26 AM Robin Qiu 
>>> wrote:
>>>
 Congrats, Heejong!

 On Tue, Sep 8, 2020 at 11:23 AM Valentyn Tymofieiev <
 valen...@google.com> wrote:

> Congratulations, Heejong!
>
> On Tue, Sep 8, 2020 at 11:14 AM Ahmet Altay 
> wrote:
>
>> Hi everyone,
>>
>> Please join me and the rest of the Beam PMC in welcoming
>> a new committer: Heejong Lee .
>>
>> Heejong has been active in the community for more than 2 years,
>> worked on various IOs (parquet, kafka, file, pubsub) and most 
>> recently
>> worked on adding cross language transforms feature to Beam [1].
>>
>> In consideration of his contributions, the Beam PMC trusts him
>> with the responsibilities of a Beam committer [2].
>>
>> Thank you for your contributions Heejong!
>>
>> -Ahmet, on behalf of the Apache Beam PMC
>>
>> [1]
>> https://issues.apache.org/jira/browse/BEAM-10634?jql=project%20%3D%20BEAM%20AND%20assignee%20in%20(heejong)%20ORDER%20BY%20resolved%20DESC%2C%20affectedVersion%20ASC%2C%20priority%20DESC%2C%20updated%20DESC
>> [2]
>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>
>
>


Re: [PROPOSAL] Preparing for Beam 2.25.0 release

2020-09-10 Thread Reza Rokni
Thanx Robin!

On Thu, Sep 10, 2020 at 2:30 AM Ahmet Altay  wrote:

> Thank you Robin!
>
> On Wed, Sep 9, 2020 at 10:23 AM Rui Wang  wrote:
>
>> Thanks Robin for working on this!
>>
>>
>> -Rui
>>
>> On Wed, Sep 9, 2020 at 10:11 AM Robin Qiu  wrote:
>>
>>> Hello everyone,
>>>
>>> The next Beam release (2.25.0) is scheduled to be cut on September 23
>>> according to the release calendar [1].
>>>
>>> I'd like to volunteer myself to handle this release. I plan on cutting
>>> the branch on that date and cherry-picking in release-blocking fixes
>>> afterwards. So unresolved release blocking JIRA issues should have
>>> their "Fix Version/s" marked as "2.25.0".
>>>
>>> Any comments or objections?
>>>
>>> Thanks,
>>> Robin Qiu
>>>
>>> [1]
>>> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
>>>
>>