Re: [PROPOSAL] Preparing for Beam 2.23.0 release

2020-06-23 Thread Valentyn Tymofieiev
Friendly reminder that the release cut is slated next week.

If you are aware of *release-blocking* issues, please open a JIRA and set
the "Fix version" to be 2.23.0.

Please do not set "Fix version" for open non-blocking issues, instead set
"Fix version" once the issue is actually resolved.

Thanks,
Valentyn

On Mon, Jun 15, 2020 at 1:14 PM Rui Wang  wrote:

> Thank you Valentyn!
>
> On Mon, Jun 15, 2020 at 1:08 PM Ahmet Altay  wrote:
>
>> Thank you Valentyn!
>>
>> On Mon, Jun 15, 2020 at 12:46 PM Ankur Goenka  wrote:
>>
>>> Thanks Valentyn!
>>>
>>> On Mon, Jun 15, 2020 at 12:41 PM Kyle Weaver 
>>> wrote:
>>>
 Sounds good, thanks Valentyn!

 On Mon, Jun 15, 2020 at 12:31 PM Valentyn Tymofieiev <
 valen...@google.com> wrote:

> Hi all,
>
> According to the Beam release calendar [1], the next (2.23.0) release
> branch cut is scheduled for July 1.
>
> I would be happy to help with this release and volunteer myself to be
> the next release manager.
>
> As usual, the plan is to cut the branch on that date, and cherrypick
> release-blocking fixes afterwards if any.
>
> Any unresolved release blocking JIRA issues for 2.23.0 should have
> their "Fix Version/s" marked as "2.23.0".
>
> Any comments or objections?
>
> [1]
> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
>
>


Re: Request for Java PR review

2020-06-23 Thread Chamikara Jayalath
Thanks. I'm taking a look.

On Tue, Jun 23, 2020 at 3:07 AM Niel Markwick  wrote:

> Hey devs...
>
> I have 3 PRs sitting waiting for a code review to fix potential bugs (and
> improve memory use) in SpannerIO. 2 small, and one quite large -- I would
> really like these to be in 2.23...
>
> https://github.com/apache/beam/pulls/nielm
>
> Would someone be willing to have a look?
>
> Thanks!
>
> --
> 
> * •  **Niel Markwick*
> * •  *Cloud Solutions Architect 
> * •  *Google Belgium
> * •  *ni...@google.com
> * •  *+32 2 894 6771
>
>
> Google Belgium NV/SA, Steenweg op Etterbeek 180, 1040 Brussel, Belgie. RPR: 
> 0878.065.378
>
> If you have received this communication by mistake, please don't forward
> it to anyone else (it may contain confidential or privileged information),
> please erase all copies of it, including all attachments, and please let
> the sender know it went to the wrong person. Thanks
>


Re: JIRA contributor permissions

2020-06-23 Thread Robert Burke
Welcome! I was going to suggest this when we finished up with your current
PR. One the PMC members will grant your permissions.

As you've already discovered, please mention me (@lostluck) on GitHubfor Go
SDK changes, as I have the context for Go at present, and can merge PRs
when they're ready.

You're naturally free to help wherever you like, file jira issues for other
work you'd like to take on etc. If you need direction, I'm happy to help
out with a suggestion or two.

Looking forward to your contributions!

Robert Burke


On Tue, Jun 23, 2020, 4:57 PM Brian Michalski  wrote:

> Greetings!
>
> I'm wading my way a few small Go SDK tickets. Can I have contributor
> permissions on JIRA?  My username is bamnet.
>
> Thanks,
> ~Brian M
>


JIRA contributor permissions

2020-06-23 Thread Brian Michalski
Greetings!

I'm wading my way a few small Go SDK tickets. Can I have contributor
permissions on JIRA?  My username is bamnet.

Thanks,
~Brian M


Re: Running Beam pipeline using Spark on YARN

2020-06-23 Thread Kyle Weaver
> So hopefully setting --spark-master-url to be yarn will work too.

This is not supported.

On Tue, Jun 23, 2020 at 2:58 PM Xinyu Liu  wrote:

> I am doing some prototyping on this too. I used spark-submit script
> instead of the rest api. In my simple setup, I ran
> SparkJobServerDriver.main() directly in the AM as a spark job, which
> will submit the python job to the default spark master url pointing to
> "local". I also use --files in the spark-submit script to upload the python
> packages and boot script. On the python side, I was using the following
> pipeline options for submission (thanks to Thomas):
>
> pipeline_options = PipelineOptions([
>
> "--runner=PortableRunner",
>
> "--job_endpoint=your-job-server:8099",
>
> "--environment_type=PROCESS",
> "--environment_config={\"command\": \"./boot\"}")]
>
> I used my own boot script for customized python packaging. WIth this setup
> I was able to get a simple hello-world program running. I haven't tried to
> run the job server separately from the AM yet. So hopefully setting
> --spark-master-url to be yarn will work too.
>
> Thanks,
> Xinyu
>
> On Tue, Jun 23, 2020 at 12:18 PM Kyle Weaver  wrote:
>
>> Hi Kamil, there is a JIRA for this:
>> https://issues.apache.org/jira/browse/BEAM-8970 It's theoretically
>> possible but remains untested as far as I know :)
>>
>> As I indicated in a comment, you can set --output_executable_path to
>> create a jar that you can then submit to yarn via spark-submit.
>>
>> If you can get this working, I'd additionally like to script the jar
>> submission in python to save users the extra step.
>>
>> Thanks,
>> Kyle
>>
>> On Tue, Jun 23, 2020 at 9:16 AM Kamil Wasilewski <
>> kamil.wasilew...@polidea.com> wrote:
>>
>>> Hi all,
>>>
>>> I'm trying to run a Beam pipeline using Spark on YARN. My pipeline is
>>> written in Python, so I need to use a portable runner. Does anybody know
>>> how I should configure job server parameters, especially
>>> --spark-master-url?  Is there anything else I need to be aware of while
>>> using such setup?
>>>
>>> If it makes a difference, I use Google Dataproc.
>>>
>>> Best,
>>> Kamil
>>>
>>


Re: Canceling Jenkins builds when the update to PR makes prior build irrelevant

2020-06-23 Thread Kenneth Knowles
+1 to Andrew's analysis

On Tue, Jun 23, 2020 at 12:13 PM Ahmet Altay  wrote:

> Would it be possible to cancel any running _Phrase or _Commit variants, if
> either one of them is triggered?
>
> On Tue, Jun 23, 2020 at 10:41 AM Andrew Pilloud 
> wrote:
>
>> I believe we split _Commit and _Phrase to work around a bug with job
>> filtering. For example, when you make a python change only the python tests
>> are run based on the commit. We still want to be able to run the java jobs
>> by trigger phrase if needed. There are also performance tests (Nexmark for
>> example) that have different jobs to ensure PR runs don't end up published
>> in the performance dashboard, but i think those have a split of _Phrase and
>> _Cron.
>>
>> As for canceling jobs, don't forget that the github status APIs are keyed
>> on commit hash and job name (not PR). It is possible for a commit to be on
>> multiple PRs and it is possible for a single PR to have multiple commits.
>> There are workflows that will be broken if you are keying off of a PR to
>> automatically cancel jobs.
>>
>> On Tue, Jun 23, 2020 at 9:59 AM Tyson Hamilton 
>> wrote:
>>
>>> +1 the ability to cancel in-flight jobs is worth deduplicating _Phrase
>>> and _Commit. I don't see a benefit for having both.
>>>
>>> On Tue, Jun 23, 2020 at 9:02 AM Luke Cwik  wrote:
>>>
 I think this is a great improvement to prevent the Jenkins queue from
 growing too large and has been suggested in the past but we were unable to
 do due to difficulty with the version of the ghrpb plugin that was used at
 the time.

 I know that we created different variants of the tests because we
 wanted to track metrics based upon whether something was a post commit
 (_Cron suffix) vs precommits but don't know why we split _Phrase and
 _Commit.

 On Tue, Jun 23, 2020 at 3:35 AM Tobiasz Kędzierski <
 tobiasz.kedzier...@polidea.com> wrote:

> Hi everyone,
>
> I was investigating the possibility of canceling Jenkins builds when
> the update to PR makes prior build irrelevant. (related to
> https://issues.apache.org/jira/browse/BEAM-3105)
> In the `GitHub Pull Request Builder Jenkins plugin [ghprb-plugin]
> there is a hidden option `Cancel build on update` that seems to work fine.
> e.g.
>
>1.
>
>I make a PR
>2.
>
>ghprb-plugin triggers  beam_PreCommit_PythonLint_Commit
>3.
>
>I make a new commit to the PR
>
>4.
>
>ghprb-plugin aborts the previous
>`beam_PreCommit_PythonLint_Commit` and adds to the queue the new one 
> with
>updated sha1.
>
>
>
> This option seems to significantly improve the experience with build
> triggering and we are planning to enable it shortly.
>
> However, putting a phrase “Run PythonLint PreCommit” in the comment
> triggers new `beam_PreCommit_PythonLint_Phrase` build, but does not
> touch already queued or running `beam_PreCommit_PythonLint_Commit` builds,
> that are technically speaking, different jobs.
>
> For testing purposes I made a single job which was a “_Commit” job
> with added “Trigger phrase” and it works well (commit builds cancelled
> after putting phrase comment in PR)
>
> Hence my question: do we need separate “_Phrase” and “_Commit” jobs?
>
> BR
> Tobiasz
>



Re: Running Beam pipeline using Spark on YARN

2020-06-23 Thread Xinyu Liu
I am doing some prototyping on this too. I used spark-submit script instead
of the rest api. In my simple setup, I ran SparkJobServerDriver.main()
directly in the AM as a spark job, which will submit the python job to the
default spark master url pointing to "local". I also use --files in the
spark-submit script to upload the python packages and boot script. On the
python side, I was using the following pipeline options for submission
(thanks to Thomas):

pipeline_options = PipelineOptions([

"--runner=PortableRunner",

"--job_endpoint=your-job-server:8099",

"--environment_type=PROCESS",
"--environment_config={\"command\": \"./boot\"}")]

I used my own boot script for customized python packaging. WIth this setup
I was able to get a simple hello-world program running. I haven't tried to
run the job server separately from the AM yet. So hopefully setting
--spark-master-url to be yarn will work too.

Thanks,
Xinyu

On Tue, Jun 23, 2020 at 12:18 PM Kyle Weaver  wrote:

> Hi Kamil, there is a JIRA for this:
> https://issues.apache.org/jira/browse/BEAM-8970 It's theoretically
> possible but remains untested as far as I know :)
>
> As I indicated in a comment, you can set --output_executable_path to
> create a jar that you can then submit to yarn via spark-submit.
>
> If you can get this working, I'd additionally like to script the jar
> submission in python to save users the extra step.
>
> Thanks,
> Kyle
>
> On Tue, Jun 23, 2020 at 9:16 AM Kamil Wasilewski <
> kamil.wasilew...@polidea.com> wrote:
>
>> Hi all,
>>
>> I'm trying to run a Beam pipeline using Spark on YARN. My pipeline is
>> written in Python, so I need to use a portable runner. Does anybody know
>> how I should configure job server parameters, especially
>> --spark-master-url?  Is there anything else I need to be aware of while
>> using such setup?
>>
>> If it makes a difference, I use Google Dataproc.
>>
>> Best,
>> Kamil
>>
>


Re: Running Beam pipeline using Spark on YARN

2020-06-23 Thread Kyle Weaver
Hi Kamil, there is a JIRA for this:
https://issues.apache.org/jira/browse/BEAM-8970 It's theoretically possible
but remains untested as far as I know :)

As I indicated in a comment, you can set --output_executable_path to create
a jar that you can then submit to yarn via spark-submit.

If you can get this working, I'd additionally like to script the jar
submission in python to save users the extra step.

Thanks,
Kyle

On Tue, Jun 23, 2020 at 9:16 AM Kamil Wasilewski <
kamil.wasilew...@polidea.com> wrote:

> Hi all,
>
> I'm trying to run a Beam pipeline using Spark on YARN. My pipeline is
> written in Python, so I need to use a portable runner. Does anybody know
> how I should configure job server parameters, especially
> --spark-master-url?  Is there anything else I need to be aware of while
> using such setup?
>
> If it makes a difference, I use Google Dataproc.
>
> Best,
> Kamil
>


Re: Canceling Jenkins builds when the update to PR makes prior build irrelevant

2020-06-23 Thread Ahmet Altay
Would it be possible to cancel any running _Phrase or _Commit variants, if
either one of them is triggered?

On Tue, Jun 23, 2020 at 10:41 AM Andrew Pilloud  wrote:

> I believe we split _Commit and _Phrase to work around a bug with job
> filtering. For example, when you make a python change only the python tests
> are run based on the commit. We still want to be able to run the java jobs
> by trigger phrase if needed. There are also performance tests (Nexmark for
> example) that have different jobs to ensure PR runs don't end up published
> in the performance dashboard, but i think those have a split of _Phrase and
> _Cron.
>
> As for canceling jobs, don't forget that the github status APIs are keyed
> on commit hash and job name (not PR). It is possible for a commit to be on
> multiple PRs and it is possible for a single PR to have multiple commits.
> There are workflows that will be broken if you are keying off of a PR to
> automatically cancel jobs.
>
> On Tue, Jun 23, 2020 at 9:59 AM Tyson Hamilton  wrote:
>
>> +1 the ability to cancel in-flight jobs is worth deduplicating _Phrase
>> and _Commit. I don't see a benefit for having both.
>>
>> On Tue, Jun 23, 2020 at 9:02 AM Luke Cwik  wrote:
>>
>>> I think this is a great improvement to prevent the Jenkins queue from
>>> growing too large and has been suggested in the past but we were unable to
>>> do due to difficulty with the version of the ghrpb plugin that was used at
>>> the time.
>>>
>>> I know that we created different variants of the tests because we wanted
>>> to track metrics based upon whether something was a post commit (_Cron
>>> suffix) vs precommits but don't know why we split _Phrase and _Commit.
>>>
>>> On Tue, Jun 23, 2020 at 3:35 AM Tobiasz Kędzierski <
>>> tobiasz.kedzier...@polidea.com> wrote:
>>>
 Hi everyone,

 I was investigating the possibility of canceling Jenkins builds when
 the update to PR makes prior build irrelevant. (related to
 https://issues.apache.org/jira/browse/BEAM-3105)
 In the `GitHub Pull Request Builder Jenkins plugin [ghprb-plugin] there
 is a hidden option `Cancel build on update` that seems to work fine.
 e.g.

1.

I make a PR
2.

ghprb-plugin triggers  beam_PreCommit_PythonLint_Commit
3.

I make a new commit to the PR

4.

ghprb-plugin aborts the previous `beam_PreCommit_PythonLint_Commit`
and adds to the queue the new one with updated sha1.



 This option seems to significantly improve the experience with build
 triggering and we are planning to enable it shortly.

 However, putting a phrase “Run PythonLint PreCommit” in the comment
 triggers new `beam_PreCommit_PythonLint_Phrase` build, but does not
 touch already queued or running `beam_PreCommit_PythonLint_Commit` builds,
 that are technically speaking, different jobs.

 For testing purposes I made a single job which was a “_Commit” job with
 added “Trigger phrase” and it works well (commit builds cancelled after
 putting phrase comment in PR)

 Hence my question: do we need separate “_Phrase” and “_Commit” jobs?

 BR
 Tobiasz

>>>


Re: Seasons of Technical Communications Project

2020-06-23 Thread Kyle Weaver
Hi Vikas,


Thank you for the introduction and your interest to work on Apache Beam
documentation with Season of Docs. To participate in the program you need
to follow the guides here [1] [2]. If you are new to the program, we
suggest:

   1.

   Start by studying our proposed project ideas and expected deliverables
   for each of them [3].
   2.

   Explore more in depth the existing related Beam documentation for each
   project idea. We provided links to the background material, known issues
   and current documentation for both project ideas [4] [5]. Choose one
   project you like the most.
   3.

   Start drafting a proposal with the gaps you have found and ideas for
   improvement, and how you would present the new/updated/full documentation.
   Here are more tips on how to make your proposal stronger [6]. Please,
   follow the guides and make sure you cover all points.
   4.

   Submit the project proposal to the Google program administrators during
   the technical writer application phase. It opens on June 9, 2020. If you
   want any feedback for your initial draft, consider using Google Docs and
   share on dev@beam.apache.org, so the community members can leave their
   comments and suggestions (please check the access to the doc before sending
   to the mailing list).

If you have any ideas that you want to brainstorm about, don’t hesitate to
start a discussion in the community list or reach out on Slack channel to
discuss issues related to GSoD documentation [7]. Once you create an
account join #beam-gsod channel.

Project administrators will assess proposals based on these guidelines [8].

Hope it helps. Let us know if you have more questions.

Thanks,

Beam GSoD team


[1] https://developers.google.com/season-of-docs/docs/tech-writer-guide

[2] https://developers.google.com/season-of-docs/terms/tech-writer-terms

[3] https://cwiki.apache.org/confluence/display/BEAM/Google+Season+of+Docs

[4]
https://cwiki.apache.org/confluence/display/BEAM/Google+Season+of+Docs#GoogleSeasonofDocs-1.DeploymentofaFlinkandSparkClusterswithPortableBeam

[5]
https://cwiki.apache.org/confluence/display/BEAM/Google+Season+of+Docs#GoogleSeasonofDocs-2.Updateoftherunnercomparisonpage/capabilitymatrix

[6]
https://developers.google.com/season-of-docs/docs/tech-writer-application-hints

[7]
https://join.slack.com/t/seasonofdocs/shared_invite/enQtNTc0NDgyOTQ5Nzc4LTliZjVlZjRmNmU5ZmFiNmViNmNiMTM5NTdlNTJiYzIzZTk3MDlhMjM3NzE2MzIxNzIxNmZiMmMzNzRmZTI4NmU

[8]
https://developers.google.com/season-of-docs/docs/project-selection#assess-proposal



On Tue, Jun 23, 2020 at 11:10 AM Vikas Wadhwa  wrote:

>
> Hi, Aizhamal:
> Good Morning!
> Through Google's initiative of Seasons of Docs 2020, I would like to take
> this opportunity to introduce myself and my intent to take up this
> 'Technical Communications' project with your organization.  At a high
> level, I went through the initial details about your project ideas and the
> requirements in specific and I would like to give it a shot.
> https://cwiki.apache.org/confluence/display/BEAM/Google+Season+of+Docs
>
> From an outlook, I'm interested in taking up Docker container specific
> content.
> Please let me know if we can discuss and explore further arenas.  :)
>
> Regards,
> Vikas
> https://www.linkedin.com/in/vikaswadhwa/
> --
> Attitude is a way of Living!
>
> --
> Attitude is a way of Living!
>


Seasons of Technical Communications Project

2020-06-23 Thread Vikas Wadhwa
Hi, Aizhamal:
Good Morning!
Through Google's initiative of Seasons of Docs 2020, I would like to take
this opportunity to introduce myself and my intent to take up this
'Technical Communications' project with your organization.  At a high
level, I went through the initial details about your project ideas and the
requirements in specific and I would like to give it a shot.
https://cwiki.apache.org/confluence/display/BEAM/Google+Season+of+Docs

>From an outlook, I'm interested in taking up Docker container specific
content.
Please let me know if we can discuss and explore further arenas.  :)

Regards,
Vikas
https://www.linkedin.com/in/vikaswadhwa/
--
Attitude is a way of Living!

-- 
Attitude is a way of Living!


Watermark-based trigger doesn't fire for 10+ minutes after message is received from Pub/Sub source

2020-06-23 Thread Alex Mordkovich
Hi Beam folks!

I'm running a simple Java Beam pipeline  on
DirectRunner. The pipeline reads in messages from a Pub/Sub topic and
aggregates them into windows: by processing time and by event time. The
custom timestamp option isn't used, so the event time should be the
message's publish time.

What I'm observing is that the watermark-based trigger doesn't fire for 10+
minutes after the message is received by the pipeline:
https://pastebin.com/zgfDy5ej. I realize that PubsubIO.java's support for
watermarks is somewhat limited and relies on heuristics. However, looking
at how PubsubIO.Read computes watermarks
,
it looks like it will advance the watermark to now() if there are no
incoming messages for over a minute. So it doesn't look like the watermark
should get stuck. In fact, the comment here

says that the watermark might get *ahead* of the true watermark, which
seems to be consistent with the code.

Is the behavior I'm observing actually expected currently? And if so, how
come?

Thanks in advance,
Alex


Re: Canceling Jenkins builds when the update to PR makes prior build irrelevant

2020-06-23 Thread Andrew Pilloud
I believe we split _Commit and _Phrase to work around a bug with job
filtering. For example, when you make a python change only the python tests
are run based on the commit. We still want to be able to run the java jobs
by trigger phrase if needed. There are also performance tests (Nexmark for
example) that have different jobs to ensure PR runs don't end up published
in the performance dashboard, but i think those have a split of _Phrase and
_Cron.

As for canceling jobs, don't forget that the github status APIs are keyed
on commit hash and job name (not PR). It is possible for a commit to be on
multiple PRs and it is possible for a single PR to have multiple commits.
There are workflows that will be broken if you are keying off of a PR to
automatically cancel jobs.

On Tue, Jun 23, 2020 at 9:59 AM Tyson Hamilton  wrote:

> +1 the ability to cancel in-flight jobs is worth deduplicating _Phrase and
> _Commit. I don't see a benefit for having both.
>
> On Tue, Jun 23, 2020 at 9:02 AM Luke Cwik  wrote:
>
>> I think this is a great improvement to prevent the Jenkins queue from
>> growing too large and has been suggested in the past but we were unable to
>> do due to difficulty with the version of the ghrpb plugin that was used at
>> the time.
>>
>> I know that we created different variants of the tests because we wanted
>> to track metrics based upon whether something was a post commit (_Cron
>> suffix) vs precommits but don't know why we split _Phrase and _Commit.
>>
>> On Tue, Jun 23, 2020 at 3:35 AM Tobiasz Kędzierski <
>> tobiasz.kedzier...@polidea.com> wrote:
>>
>>> Hi everyone,
>>>
>>> I was investigating the possibility of canceling Jenkins builds when the
>>> update to PR makes prior build irrelevant. (related to
>>> https://issues.apache.org/jira/browse/BEAM-3105)
>>> In the `GitHub Pull Request Builder Jenkins plugin [ghprb-plugin] there
>>> is a hidden option `Cancel build on update` that seems to work fine.
>>> e.g.
>>>
>>>1.
>>>
>>>I make a PR
>>>2.
>>>
>>>ghprb-plugin triggers  beam_PreCommit_PythonLint_Commit
>>>3.
>>>
>>>I make a new commit to the PR
>>>
>>>4.
>>>
>>>ghprb-plugin aborts the previous `beam_PreCommit_PythonLint_Commit`
>>>and adds to the queue the new one with updated sha1.
>>>
>>>
>>>
>>> This option seems to significantly improve the experience with build
>>> triggering and we are planning to enable it shortly.
>>>
>>> However, putting a phrase “Run PythonLint PreCommit” in the comment
>>> triggers new `beam_PreCommit_PythonLint_Phrase` build, but does not
>>> touch already queued or running `beam_PreCommit_PythonLint_Commit` builds,
>>> that are technically speaking, different jobs.
>>>
>>> For testing purposes I made a single job which was a “_Commit” job with
>>> added “Trigger phrase” and it works well (commit builds cancelled after
>>> putting phrase comment in PR)
>>>
>>> Hence my question: do we need separate “_Phrase” and “_Commit” jobs?
>>>
>>> BR
>>> Tobiasz
>>>
>>


Re: Canceling Jenkins builds when the update to PR makes prior build irrelevant

2020-06-23 Thread Tyson Hamilton
+1 the ability to cancel in-flight jobs is worth deduplicating _Phrase and
_Commit. I don't see a benefit for having both.

On Tue, Jun 23, 2020 at 9:02 AM Luke Cwik  wrote:

> I think this is a great improvement to prevent the Jenkins queue from
> growing too large and has been suggested in the past but we were unable to
> do due to difficulty with the version of the ghrpb plugin that was used at
> the time.
>
> I know that we created different variants of the tests because we wanted
> to track metrics based upon whether something was a post commit (_Cron
> suffix) vs precommits but don't know why we split _Phrase and _Commit.
>
> On Tue, Jun 23, 2020 at 3:35 AM Tobiasz Kędzierski <
> tobiasz.kedzier...@polidea.com> wrote:
>
>> Hi everyone,
>>
>> I was investigating the possibility of canceling Jenkins builds when the
>> update to PR makes prior build irrelevant. (related to
>> https://issues.apache.org/jira/browse/BEAM-3105)
>> In the `GitHub Pull Request Builder Jenkins plugin [ghprb-plugin] there
>> is a hidden option `Cancel build on update` that seems to work fine.
>> e.g.
>>
>>1.
>>
>>I make a PR
>>2.
>>
>>ghprb-plugin triggers  beam_PreCommit_PythonLint_Commit
>>3.
>>
>>I make a new commit to the PR
>>
>>4.
>>
>>ghprb-plugin aborts the previous `beam_PreCommit_PythonLint_Commit`
>>and adds to the queue the new one with updated sha1.
>>
>>
>>
>> This option seems to significantly improve the experience with build
>> triggering and we are planning to enable it shortly.
>>
>> However, putting a phrase “Run PythonLint PreCommit” in the comment
>> triggers new `beam_PreCommit_PythonLint_Phrase` build, but does not
>> touch already queued or running `beam_PreCommit_PythonLint_Commit` builds,
>> that are technically speaking, different jobs.
>>
>> For testing purposes I made a single job which was a “_Commit” job with
>> added “Trigger phrase” and it works well (commit builds cancelled after
>> putting phrase comment in PR)
>>
>> Hence my question: do we need separate “_Phrase” and “_Commit” jobs?
>>
>> BR
>> Tobiasz
>>
>


Running Beam pipeline using Spark on YARN

2020-06-23 Thread Kamil Wasilewski
Hi all,

I'm trying to run a Beam pipeline using Spark on YARN. My pipeline is
written in Python, so I need to use a portable runner. Does anybody know
how I should configure job server parameters, especially
--spark-master-url?  Is there anything else I need to be aware of while
using such setup?

If it makes a difference, I use Google Dataproc.

Best,
Kamil


Re: Match_Recognize Design Documentation

2020-06-23 Thread Rui Wang
Thank you Qihang.

I have been hearing some confusions offline, so to highlight: Qihang's
design doc is a commentable doc that is under
https://s.apache.org/beam-sql-pattern-recognization.


-Rui

On Wed, Jun 17, 2020 at 5:39 AM Qihang Zeng  wrote:

> Dear Beam development community,
>
> Hi! I am writing to share my design documentation for my proposed project: 
> Implementing
> Pattern Recognition Function in Beam SQL
> .
> Here is the link to my design documentation:
> https://s.apache.org/beam-sql-pattern-recognization
>
> @Rui Wang  helped me finish the documentation (I am
> very grateful for his help). I would very much appreciate it if there are
> additional suggestions!
>
> Many thanks,
> Qihang
>


Re: Canceling Jenkins builds when the update to PR makes prior build irrelevant

2020-06-23 Thread Luke Cwik
I think this is a great improvement to prevent the Jenkins queue from
growing too large and has been suggested in the past but we were unable to
do due to difficulty with the version of the ghrpb plugin that was used at
the time.

I know that we created different variants of the tests because we wanted to
track metrics based upon whether something was a post commit (_Cron suffix)
vs precommits but don't know why we split _Phrase and _Commit.

On Tue, Jun 23, 2020 at 3:35 AM Tobiasz Kędzierski <
tobiasz.kedzier...@polidea.com> wrote:

> Hi everyone,
>
> I was investigating the possibility of canceling Jenkins builds when the
> update to PR makes prior build irrelevant. (related to
> https://issues.apache.org/jira/browse/BEAM-3105)
> In the `GitHub Pull Request Builder Jenkins plugin [ghprb-plugin] there is
> a hidden option `Cancel build on update` that seems to work fine.
> e.g.
>
>1.
>
>I make a PR
>2.
>
>ghprb-plugin triggers  beam_PreCommit_PythonLint_Commit
>3.
>
>I make a new commit to the PR
>
>4.
>
>ghprb-plugin aborts the previous `beam_PreCommit_PythonLint_Commit`
>and adds to the queue the new one with updated sha1.
>
>
>
> This option seems to significantly improve the experience with build
> triggering and we are planning to enable it shortly.
>
> However, putting a phrase “Run PythonLint PreCommit” in the comment
> triggers new `beam_PreCommit_PythonLint_Phrase` build, but does not touch
> already queued or running `beam_PreCommit_PythonLint_Commit` builds, that
> are technically speaking, different jobs.
>
> For testing purposes I made a single job which was a “_Commit” job with
> added “Trigger phrase” and it works well (commit builds cancelled after
> putting phrase comment in PR)
>
> Hence my question: do we need separate “_Phrase” and “_Commit” jobs?
>
> BR
> Tobiasz
>


Re: On Auto-creating GCS buckets on behalf of users

2020-06-23 Thread David Cavazos
I like the idea of simplifying the user experience by automating part of
the initial setup. On the other hand, I see why silently creating billed
resources like a GCS bucket could be an issue. I don't think creating an
empty bucket is an issue since it doesn't incur any charges yet, but at
least logging that it was created by the script in the user's behalf would
be useful. There could be a logging message saying that it either found the
bucket with that name and it's using it, or that it didn't find it and it
created it.

If it were to be creating a resource that could incur potentially unwanted
charges (like a Bigtable database), then I would make a prompt before
creating it to make the users confirm they want that created. But for a GCS
bucket I don't think that's necessary, I think as long as there's an
explicit message saying it was created should be enough. That way it's not
a surprise when they see that bucket in their project, or at least they
know where it came from.

If the user wants more control over what their temp bucket needs, like
encryption or anything else, they can still pass an explicit
`--temp_location` parameter and they can use whatever bucket they provide.

I also like that it would be consistent with how the Java SDK works.

On Mon, Jun 22, 2020 at 3:35 PM Ahmet Altay  wrote:

> I do not have a strong opinion about this either way. I think this is
> fundamentally a UX tradeoff between making it easier to get started and
> potentially creating unwanted/misconfigured items. I do not have data about
> what would be more preferable for most users. I believe either option would
> be fine as long as we are clear with our messaging, logs, errors.
>
> On Mon, Jun 22, 2020 at 1:48 PM Luke Cwik  wrote:
>
>> I think creating the bucket makes sense since it is an improvement in the
>> users experience and simplifies first time users setup needs. We should be
>> clear to tell users that we are doing this on their behalf.
>>
>> On Mon, Jun 22, 2020 at 1:26 PM Pablo Estrada  wrote:
>>
>>> Hi everyone,
>>> I've gotten around to making this change, and Udi has been gracious to
>>> review it[1].
>>>
>>> I figured we have not fully answered the larger question of whether we
>>> would truly like to make this change. Here are some thoughts giving me
>>> pause:
>>>
>>> 1. Appropriate defaults - We are not sure we can select appropriate
>>> defaults on behalf of users. (We are erroring out in case of KMS keys, but
>>> how about other properties?)
>>> 2. Users have been using Beam's Python SDK the way it is for a long time
>>> now: Supplying temp_location when running on Dataflow, without a problem.
>>> 3. This has billing implications that users may not be fully aware of
>>>
>>> The behavior in [1] matches the behavior of the Java SDK (create a
>>> bucket when none is supplied AND running on Dataflow); but it still doesn't
>>> solve the problem of ReadFromBQ/WriteToBQ from non-Dataflow runners (this
>>> can be done in a follow up change using the Default Bucket functionality).
>>>
>>> My bias in this case is: If it isn't broken, why fix it? I do not know
>>> of anyone complaining about the required temp_location flag on Dataflow.
>>>
>>> I think we can create a default bucket when dealing with BQ outside of
>>> Dataflow, but for Dataflow, I think we don't need to fix what's not broken.
>>> What do others think?
>>>
>>> Best
>>> -P.
>>>
>>> [1] https://github.com/apache/beam/pull/11982
>>>
>>> On Tue, Jul 23, 2019 at 5:02 PM Ahmet Altay  wrote:
>>>
 I agree with the benefits of auto-creating buckets from an ease of use
 perspective. My counter argument is that the auto created buckets may not
 have the right settings for the users. A bucket has multiple settings, some
 required as (name, storage class) and some optional (acl policy,
 encryption, retention policy, labels). As the number of options increase
 our chances of having a good enough default goes down. For example, if a
 user wants to enable CMEK mode for encryption, they will enable it for
 their sources, sinks, and will instruct Dataflow runner encrypt its
 in-flight data. Creating a default (non-encrpyted) temp bucket for this
 user would be against user's intentions. We would not be able to create a
 bucket either, because we would not know what encryption keys to use for
 such a bucket. Our options would be to either not create a bucket at all,
 or fail if a temporary bucket was not specified and a CMEK mode is enabled.

 There is a similar issue with the region flag. If unspecified it
 defaults to us-central1. This is convenient for new users, but not making
 that flag required will expose a larger proportion of Dataflow users to
 events in that specific region.

 Robert's suggestion of having a flag for opt-in to a default set of GCP
 convenience flags sounds reasonable. At least users will explicitly
 acknowledge that certain things are auto managed 

Canceling Jenkins builds when the update to PR makes prior build irrelevant

2020-06-23 Thread Tobiasz Kędzierski
Hi everyone,

I was investigating the possibility of canceling Jenkins builds when the
update to PR makes prior build irrelevant. (related to
https://issues.apache.org/jira/browse/BEAM-3105)
In the `GitHub Pull Request Builder Jenkins plugin [ghprb-plugin] there is
a hidden option `Cancel build on update` that seems to work fine.
e.g.

   1.

   I make a PR
   2.

   ghprb-plugin triggers  beam_PreCommit_PythonLint_Commit
   3.

   I make a new commit to the PR

   4.

   ghprb-plugin aborts the previous `beam_PreCommit_PythonLint_Commit` and
   adds to the queue the new one with updated sha1.



This option seems to significantly improve the experience with build
triggering and we are planning to enable it shortly.

However, putting a phrase “Run PythonLint PreCommit” in the comment
triggers new `beam_PreCommit_PythonLint_Phrase` build, but does not touch
already queued or running `beam_PreCommit_PythonLint_Commit` builds, that
are technically speaking, different jobs.

For testing purposes I made a single job which was a “_Commit” job with
added “Trigger phrase” and it works well (commit builds cancelled after
putting phrase comment in PR)

Hence my question: do we need separate “_Phrase” and “_Commit” jobs?

BR
Tobiasz


Request for Java PR review

2020-06-23 Thread Niel Markwick
Hey devs...

I have 3 PRs sitting waiting for a code review to fix potential bugs (and
improve memory use) in SpannerIO. 2 small, and one quite large -- I would
really like these to be in 2.23...

https://github.com/apache/beam/pulls/nielm

Would someone be willing to have a look?

Thanks!

-- 

* •  **Niel Markwick*
* •  *Cloud Solutions Architect 
* •  *Google Belgium
* •  *ni...@google.com
* •  *+32 2 894 6771

Google Belgium NV/SA, Steenweg op Etterbeek 180, 1040 Brussel, Belgie.
RPR: 0878.065.378

If you have received this communication by mistake, please don't forward it
to anyone else (it may contain confidential or privileged information),
please erase all copies of it, including all attachments, and please let
the sender know it went to the wrong person. Thanks