Re: Beam/Samza Ensuring At Least Once semantics

2019-07-09 Thread Benenson, Mikhail
Hi

I have run a few experiments to verify if 'at least once' processing is 
guarantee on Beam 2.13.0 with Samza Runner 1.1.0

Beam application is a slightly modified Stream Word Count from Beam examples:

  *   read strings from input Kafka topic, print (topic, partition, offset, 
value)
  *   convert values to pairs (value, 1)
  *   grouping in Fixed Windows with duration 30 sec
  *   sum per key
  *   throw exception, if key starts with 'm'
  *   write (key, sum) to output Kafka topic

Tried KafkaIO.read() with and without commitOffsetsInFinalize() there is no 
difference in results.

Please, see src code attached.

Environment:

  *   Run with local zk & kafka, pre-create input & output topics with 1 
partition.
  *   samza.properties contains "task.commit.ms=2000". According to samza doc 
"this property determines how often a checkpoint is written. The value is the 
time between checkpoints, in milliseconds". See 
http://samza.apache.org/learn/documentation/latest/jobs/samza-configurations.html#checkpointing.
 Please, see samza config file and run script attached.


Scenario 1: Exception in transformation

Run

  *   Write 'a', 'b', 'c', 'm', 'd', 'e' into input topic
  *   start Beam app
  *   verify, that app log contains "read from topic=XXX, part=0, offset=100, 
val: e". Because input topic has only one partition, this means all data have 
been read from Kafka.
  *   wait, until app terminates, because of the exception, while processing 'm'

Expectation
The order of processing after grouping is not specified, so some data could be 
written to output topic before application terminates, but I expect that 
value=m with offset 98 and all later records must NOT be marked as processed, 
so if I restart Beam app, I expect it again throws the exception when 
processing value=m.
Comment: throwing exception in transformation is not a good idea, but such 
exception could be the result of application error. So, expectation is that 
after fixing the error, and restarting Beam app, it should process the record 
that cause an error.

Results
After I restarted app, it does NOT re-processing value m and does not throws an 
exception. If I add new value 'f' into input topic, I see  "read from 
topic=XXX, part=0, offset=101, val: f", and after some time I see 'm' in the 
output topic. So, the record with value 'm' is NOT processed.


Scenario 2: App termination

Run

  *   Write 'g', 'h', 'i', 'j' into input topic
  *   start Beam app
  *   verify, that app log contains "read from topic=XXX, part=0, offset=105, 
val: j". Because input topic has only one partition, this means that all data 
has been read from Kafka.
  *   wait about 10 sec, then terminate Beam app. The idea is to terminate app, 
when, ''g', 'h', 'i', 'j' are waiting in the 30 sec Fixed Windows, but after  
task.commit.ms=2000 pass, so offsets are committed.

Expectation
As records 'g', 'h', 'i', 'j'  are NOT processed, I expect that after app 
restarted, it again reads ‘g’, ‘h’, ‘I’, ‘j’ from input topic and process these 
records.

Results
After I restarted app, it does NOT re-process  ‘g’, ‘h’, ‘I’, ‘j’ values. If I 
add new value ‘k’ into input topic, I see  “read from topic=XXX, part=0, 
offset=106, val: k”, and after some time I see ‘k’ in the output topic. So, the 
records with values ‘g’, ‘h’, ‘I’, ‘j’ are NOT processed.


Based on these results I’m incline to conclude that Beam with Samza runner does 
NOT provides 'at least once' guarantee for processing.

If I missed something?

--
Michael Benenson


From: "LeVeck, Matt" 
Date: Monday, July 1, 2019 at 5:28 PM
To: "Deshpande, Omkar" , "Benenson, Mikhail" 
, Xinyu Liu , Xinyu Liu 
, Samarth Shetty , "Audo, Nicholas" 

Subject: Beam/Samza Ensuring At Least Once semantics

We’re seeing some behavior when using Beam’s KafkaIO and Samza as the runner 
that suggests checkpoints are getting committed even when an error gets 
throwing in the Beam Pipline while processing a batch.  Do you all have a 
recommended set of settings/patterns for using Beam with Samza to ensure that 
checkpoints are only updated after successful processing (i.e. the transforms 
succeed and the message is sent to the Beam pipeline’s final output sink)?

Our current settings for Samza are:
task.checkpoint.factory=org.apache.samza.checkpoint.kafka.KafkaCheckpointManagerFactory
task.checkpoint.system=kafka
task.shutdown.ms=1
task.commit.ms=2000

Nothing is specified with regards to checkpointing at the Beam level.

Thanks,
Matt


#!/bin/bash

java  -cp  
"target/data-strmprocess-samza-driver-0.1.0.jar:target/lib:target/lib/*"  \
  com.intuit.strmprocess.once.OnceDemoWordCount01  \
  --runner=SamzaRunner   \
  --jobName=driver-once \
  --jobInstance=001 \
  --maxSourceParallelism=10 \
  --samzaExecutionEnvironment=STANDALONE \
  --configFilePath=src/main/resources/samza-once-local.properties  


samza-once-local.properties
Description: samza-once-local.properties


OnceDemoWordCount01.java
Description: OnceD

Re: [DISCUSS] Contributor guidelines for iterating on PRs: when to squash commits.

2019-07-09 Thread Kenneth Knowles
My opinion: what is important is that we have a policy for what goes into
the master commit history. This is very simple IMO: each commit should
clearly do something that it states, and a commit should do just one thing.
Personally, I don't feel a need to set a rule for who does the squashing
(or non-squashing) or other actions necessary to maintain a clear history.

In PRs I review the question of who should squash has never come up as an
issue. Most PRs are either a bunch of random commits obviously meant for
squash, or carefully managed commits with good messages using the
git-supported "fixup!" syntax or clear "fixup:" commit messages. It is a
polarizing issue, which is a good thing in this case as it makes it very
clear how to merge.

Your original concern was authors force pushing during review making it
hard to review. For your point "3. After a force-push, comments made by
reviewers on earlier commit are hard to find." I thought GitHub had fixed
that. These comments used to vanish entirely, but now they are still on the
main PR page IIRC. If it is not fixed, it would make sense to add this to
the contribution guide, and even to the PR template.

Kenn

On Tue, Jul 9, 2019 at 2:18 PM Valentyn Tymofieiev 
wrote:

> Ok, I think if authors mark fixup commits with "fixup" prefix and
> committers routinely fixup commits before the merge without asking the
> contributors to do so, the authors should not have a particular reason to
> fixup/squash + force-push all changes into one commit after addressing
> review comments. This will make the review easier, however committers will
> have to take responsibility for merging fixup commits.
>
> Currently both committer guide[1] and contributor guide[2] assume that it
> is the author's responsibility to merge fixup commit.
>
> The reviewer should give the LGTM and then request that the author of the
>> pull request rebase, squash, split, etc, the commit
>
>
> "After review is complete and the PR accepted, multiple commits should be
>> squashed (see Git workflow tips)".
>
>
> Should we explicitly make squashing review-related commits a
> responsibility of committers?
>
> [1] https://beam.apache.org/contribute/committer-guide
> 
> [2] https://beam.apache.org/contribute/
>
>
> On Tue, Jul 9, 2019 at 12:22 PM Rui Wang  wrote:
>
>> "allow maintainers to edit" by default is enabled. Then the proposed
>> workflow looks reasonable to me now.
>>
>>
>> -Rui
>>
>> On Tue, Jul 9, 2019 at 11:26 AM Kenneth Knowles  wrote:
>>
>>> If you "allow maintainers to edit" the PR, it is easy for any committer
>>> to fix up the commits and merge. They should not have to ask you to do it,
>>> unless it is not obvious what to do.
>>>
>>> Kenn
>>>
>>> On Tue, Jul 9, 2019 at 11:05 AM Rui Wang  wrote:
>>>
 At least for me, because I usually don't know when PR review is done,
 in order to make PR to be merged into Beam repo faster, I keep squashing
 commits every time so that committers can review and then merge at a time,
 otherwise committers could approve a PR but then ask squashing commits,
 which leads to another ping and wait round.

 Thus I prefer committers do squash and merge, which will reduce PR
 authors' load during PR review process.


 -Rui


 On Mon, Jul 8, 2019 at 5:44 PM Valentyn Tymofieiev 
 wrote:

> Rui, committer guide[1] does say that all commits are standalone
> changes:
>
> We prefer small independent, incremental PRs with descriptive,
>> isolated commits. Each commit is a single clear change.
>>
>
> However in my opinion, this recommendation applies to moments when a
> PR is first sent for review, and when a PR is being merged. Committer 
> guide
> also mentions that during review iterations authors may add review-related
> commits.
>
> the pull request may have a collection of review-related commits that
>> are not meaningful to preserve in the history. The reviewer should give 
>> the
>> LGTM and then request that the author of the pull request rebase, squash,
>> split, etc, the commits, so that the history is most useful.
>
>
> Review-related commits don't have to be isolated independent changes,
> and perhaps committer guide and contributor guide [2] should spell out
> clearly that authors should not feel pressure to make review commits look
> like meaningful changes of their own when it does not make sense to do.  
> By
> the end of the review, review commits should be squashed by a committer or
> by the author.
>
> I think there are some incentives to always squash-and-force-push:
> - Committer will not ask the author to squash commits if there is only
> one commit.
> - We don't have to wait for another round of tests to pass on the
> final  PR.
>
> Both concerns are addressed if a

[VOTE] Vendored Dependencies Release

2019-07-09 Thread Lukasz Cwik
Please review the release of the following artifacts that we vendor:
 * beam-vendor-grpc_1_21_0
 * beam-vendor-guava-26_0-jre

Hi everyone,
Please review and vote on the release candidate #2 for the
org.apache.beam:beam-vendor-grpc_1_21_0:0.1 and
org.apache.beam:beam-vendor-guava-26_0-jre:0.1, as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)


The complete staging area is available for your review, which includes:
* the official Apache source release to be deployed to dist.apache.org [1],
which is signed with the key with fingerprint
EAD5DE293F4A03DD2E77565589E68A56E371CCA2 [2],
* all artifacts to be deployed to the Maven Central Repository [3],
* commit hash "b4efbb23cc5dec80b8bbd8745c62efecdadfa236" [4],

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

Thanks,
Luke

[1] https://dist.apache.org/repos/dist/dev/beam/vendor/
[2] https://dist.apache.org/repos/dist/release/beam/KEYS
[3] https://repository.apache.org/content/repositories/orgapachebeam-1076/
[4]
https://github.com/apache/beam/commit/b4efbb23cc5dec80b8bbd8745c62efecdadfa236


Re: [VOTE] Vendored dependencies release process

2019-07-09 Thread Lukasz Cwik
Thanks all for validating the process and voting.

I'm closing the vote with 5 binding +1s and 4 non-binding +1s. Binding
votes were provided by:
Maximilian Michels
Pablo Estrada
Ismaël Mejía
Lukasz Cwik
Kenneth Knowles

I'll work on getting the doc turned into a PR to update our websites
release guide and also work on creating RC #2 for the Guava 26.0 upgrade
following the guide.


On Tue, Jul 9, 2019 at 2:12 PM Ismaël Mejía  wrote:

> +1
>
> On Tue, Jul 9, 2019 at 10:53 PM Pablo Estrada  wrote:
> >
> > Thanks Luke for writing a thorough guide for this.
> > +1
> >
> > On Tue, Jul 9, 2019 at 1:27 PM Maximilian Michels 
> wrote:
> >>
> >> +1
> >>
> >> On 09.07.19 22:16, Udi Meiri wrote:
> >> > +1 LGTM
> >> >
> >> > On Mon, Jul 8, 2019 at 4:54 PM Lukasz Cwik  >> > > wrote:
> >> >
> >> > Thanks for taking a look. I followed up on your questions.
> >> >
> >> > On Mon, Jul 8, 2019 at 3:58 PM Udi Meiri  >> > > wrote:
> >> >
> >> > I left some comments. Being new to the Beam releasing process,
> >> > my question might be trivial to someone actually performing
> the
> >> > release.
> >> >
> >> > On Tue, Jul 2, 2019 at 4:49 PM Lukasz Cwik  >> > > wrote:
> >> >
> >> > Please vote based on the vendored dependencies release
> >> > process as discussed[1] and documented[2].
> >> >
> >> > Please vote as follows:
> >> > +1: Adopt the vendored dependency release process
> >> > -1: The vendored release process needs to change because
> ...
> >> >
> >> > Since many people in the US may be out due to the holiday
> >> > schedule, I'll try to close the vote and tally the results
> >> > on July 9th so please vote before then.
> >> >
> >> > 1:
> https://lists.apache.org/thread.html/e2c49a5efaee2ad416b083fbf3b9b6db60fdb04750208bfc34cecaf0@%3Cdev.beam.apache.org%3E
> >> > 2: https://s.apache.org/beam-release-vendored-artifacts
> >> >
> >>
>


Re: pipeline timeout

2019-07-09 Thread Hannah Jiang
Hi Chaim

If you want to cancel the job from airflow, here is how you can set
timeout. (Using *execution_timeout*)

aggregate_db_message_job = BashOperator(
task_id='aggregate_db_message_job',
execution_timeout=timedelta(hours=3),
pool='ep_data_pipeline_db_msg_agg',
bash_command=aggregate_db_message_job_cmd,
dag=dag)

Hope this helps.
Hannah

On Tue, Jul 9, 2019 at 11:15 AM Rui Wang  wrote:

> Just curious: how do you use python to kick off a Java pipeline in
> airflow?
>
> And, could you go to a direction that you have an airflow operator(maybe
> a  bash operator to execute a Jar?) to run such Java code:
>
> result = p.run()
> result.wait_until_finish(duration=TIMEOUT_SEC)
> if not PipelineState.is_terminal(result.state):
>   result.cancel()
>
> ?
>
> On Tue, Jul 9, 2019 at 1:14 AM Chaim Turkel  wrote:
>
>> sorry for not being explicit. My pipeline is in java and i am
>> executing it using python in airflow.
>> I would like from airflow to cancel the pipeline if running for more
>> than x minutes.
>> Currently i am doing this using the cli, but it is not optimal
>>
>> chaim
>>
>> On Mon, Jul 8, 2019 at 7:24 PM Mark Liu  wrote:
>> >
>> > Hi Chaim,
>> >
>> > You can checkout PipelineResult class and do something like:
>> >
>> > result = p.run()
>> > result.wait_until_finish(duration=TIMEOUT_SEC)
>> > if not PipelineState.is_terminal(result.state):
>> >   result.cancel()
>> >
>> > The implementation of PipelineResult depends on what runner you choose.
>> And you may find more useful functions in its subclass.
>> >
>> > Mark
>> >
>> >
>> > On Sun, Jul 7, 2019 at 12:59 AM Chaim Turkel  wrote:
>> >>
>> >> Hi,
>> >>   I have a pipeline that usually takes 15-30 minutes. Sometimes things
>> >> get stuck (from 3rd party side). I would like to know if there is a
>> >> way to cancel the job if it is running for more than x minutes? I know
>> >> there is a cli command but i would like it either on the pipeline
>> >> config or in the python sdk.
>> >> Any ideas?
>> >>
>> >> Chaim Turkel
>> >>
>> >> --
>> >>
>> >>
>> >> Loans are funded by
>> >> FinWise Bank, a Utah-chartered bank located in Sandy,
>> >> Utah, member FDIC, Equal
>> >> Opportunity Lender. Merchant Cash Advances are
>> >> made by Behalf. For more
>> >> information on ECOA, click here
>> >> . For important information about
>> >> opening a new
>> >> account, review Patriot Act procedures here
>> >> .
>> >> Visit Legal
>> >>  to
>> >> review our comprehensive program terms,
>> >> conditions, and disclosures.
>>
>> --
>>
>>
>> Loans are funded by
>> FinWise Bank, a Utah-chartered bank located in Sandy,
>> Utah, member FDIC, Equal
>> Opportunity Lender. Merchant Cash Advances are
>> made by Behalf. For more
>> information on ECOA, click here
>> . For important information about
>> opening a new
>> account, review Patriot Act procedures here
>> .
>> Visit Legal
>>  to
>> review our comprehensive program terms,
>> conditions, and disclosures.
>>
>


Re: [DISCUSS] Contributor guidelines for iterating on PRs: when to squash commits.

2019-07-09 Thread Valentyn Tymofieiev
Ok, I think if authors mark fixup commits with "fixup" prefix and
committers routinely fixup commits before the merge without asking the
contributors to do so, the authors should not have a particular reason to
fixup/squash + force-push all changes into one commit after addressing
review comments. This will make the review easier, however committers will
have to take responsibility for merging fixup commits.

Currently both committer guide[1] and contributor guide[2] assume that it
is the author's responsibility to merge fixup commit.

The reviewer should give the LGTM and then request that the author of the
> pull request rebase, squash, split, etc, the commit


"After review is complete and the PR accepted, multiple commits should be
> squashed (see Git workflow tips)".


Should we explicitly make squashing review-related commits a responsibility
of committers?

[1] https://beam.apache.org/contribute/committer-guide

[2] https://beam.apache.org/contribute/


On Tue, Jul 9, 2019 at 12:22 PM Rui Wang  wrote:

> "allow maintainers to edit" by default is enabled. Then the proposed
> workflow looks reasonable to me now.
>
>
> -Rui
>
> On Tue, Jul 9, 2019 at 11:26 AM Kenneth Knowles  wrote:
>
>> If you "allow maintainers to edit" the PR, it is easy for any committer
>> to fix up the commits and merge. They should not have to ask you to do it,
>> unless it is not obvious what to do.
>>
>> Kenn
>>
>> On Tue, Jul 9, 2019 at 11:05 AM Rui Wang  wrote:
>>
>>> At least for me, because I usually don't know when PR review is done, in
>>> order to make PR to be merged into Beam repo faster, I keep squashing
>>> commits every time so that committers can review and then merge at a time,
>>> otherwise committers could approve a PR but then ask squashing commits,
>>> which leads to another ping and wait round.
>>>
>>> Thus I prefer committers do squash and merge, which will reduce PR
>>> authors' load during PR review process.
>>>
>>>
>>> -Rui
>>>
>>>
>>> On Mon, Jul 8, 2019 at 5:44 PM Valentyn Tymofieiev 
>>> wrote:
>>>
 Rui, committer guide[1] does say that all commits are standalone
 changes:

 We prefer small independent, incremental PRs with descriptive, isolated
> commits. Each commit is a single clear change.
>

 However in my opinion, this recommendation applies to moments when a PR
 is first sent for review, and when a PR is being merged. Committer guide
 also mentions that during review iterations authors may add review-related
 commits.

 the pull request may have a collection of review-related commits that
> are not meaningful to preserve in the history. The reviewer should give 
> the
> LGTM and then request that the author of the pull request rebase, squash,
> split, etc, the commits, so that the history is most useful.


 Review-related commits don't have to be isolated independent changes,
 and perhaps committer guide and contributor guide [2] should spell out
 clearly that authors should not feel pressure to make review commits look
 like meaningful changes of their own when it does not make sense to do.  By
 the end of the review, review commits should be squashed by a committer or
 by the author.

 I think there are some incentives to always squash-and-force-push:
 - Committer will not ask the author to squash commits if there is only
 one commit.
 - We don't have to wait for another round of tests to pass on the
 final  PR.

 Both concerns are addressed if a committer follows squash-and-merge
 workflow.

 [1] https://beam.apache.org/contribute/committer-guide
 
 [2] https://beam.apache.org/contribute/

 On Mon, Jul 8, 2019 at 11:33 AM Rui Wang  wrote:

> Myself usually follows the pattern of "authors force-push their
> changes during every review iteration". The reason is after reading [1], I
> found it's hard to maintain a multiple commits PR as it's hard to create
> isolated commits for different logical pieces of code in practice.
> Therefore in practice I keep squash commits (and then have to force-push)
> to create at least a single isolated commit.
>
>
>
> [1]
> https://beam.apache.org/contribute/committer-guide/#pull-request-review-objectives
>
> -Rui
>
> On Mon, Jul 8, 2019 at 11:25 AM Udi Meiri  wrote:
>
>> I think there are already some guidelines here:
>> https://beam.apache.org/contribute/committer-guide/#pull-request-review-objectives
>>  (maybe
>> we could point to them from the PR template?)
>> Yes, it is acceptable to ask for squash or if it's ok to squash to a
>> single commit.
>>
>> On Mon, Jul 8, 2019 at 11:14 AM Valentyn Tymofieiev <
>> valen...@google.com> wro

Re: [VOTE] Vendored dependencies release process

2019-07-09 Thread Ismaël Mejía
+1

On Tue, Jul 9, 2019 at 10:53 PM Pablo Estrada  wrote:
>
> Thanks Luke for writing a thorough guide for this.
> +1
>
> On Tue, Jul 9, 2019 at 1:27 PM Maximilian Michels  wrote:
>>
>> +1
>>
>> On 09.07.19 22:16, Udi Meiri wrote:
>> > +1 LGTM
>> >
>> > On Mon, Jul 8, 2019 at 4:54 PM Lukasz Cwik > > > wrote:
>> >
>> > Thanks for taking a look. I followed up on your questions.
>> >
>> > On Mon, Jul 8, 2019 at 3:58 PM Udi Meiri > > > wrote:
>> >
>> > I left some comments. Being new to the Beam releasing process,
>> > my question might be trivial to someone actually performing the
>> > release.
>> >
>> > On Tue, Jul 2, 2019 at 4:49 PM Lukasz Cwik > > > wrote:
>> >
>> > Please vote based on the vendored dependencies release
>> > process as discussed[1] and documented[2].
>> >
>> > Please vote as follows:
>> > +1: Adopt the vendored dependency release process
>> > -1: The vendored release process needs to change because ...
>> >
>> > Since many people in the US may be out due to the holiday
>> > schedule, I'll try to close the vote and tally the results
>> > on July 9th so please vote before then.
>> >
>> > 1: 
>> > https://lists.apache.org/thread.html/e2c49a5efaee2ad416b083fbf3b9b6db60fdb04750208bfc34cecaf0@%3Cdev.beam.apache.org%3E
>> > 2: https://s.apache.org/beam-release-vendored-artifacts
>> >
>>


Re: [VOTE] Vendored dependencies release process

2019-07-09 Thread Pablo Estrada
Thanks Luke for writing a thorough guide for this.
+1

On Tue, Jul 9, 2019 at 1:27 PM Maximilian Michels  wrote:

> +1
>
> On 09.07.19 22:16, Udi Meiri wrote:
> > +1 LGTM
> >
> > On Mon, Jul 8, 2019 at 4:54 PM Lukasz Cwik  > > wrote:
> >
> > Thanks for taking a look. I followed up on your questions.
> >
> > On Mon, Jul 8, 2019 at 3:58 PM Udi Meiri  > > wrote:
> >
> > I left some comments. Being new to the Beam releasing process,
> > my question might be trivial to someone actually performing the
> > release.
> >
> > On Tue, Jul 2, 2019 at 4:49 PM Lukasz Cwik  > > wrote:
> >
> > Please vote based on the vendored dependencies release
> > process as discussed[1] and documented[2].
> >
> > Please vote as follows:
> > +1: Adopt the vendored dependency release process
> > -1: The vendored release process needs to change because ...
> >
> > Since many people in the US may be out due to the holiday
> > schedule, I'll try to close the vote and tally the results
> > on July 9th so please vote before then.
> >
> > 1:
> https://lists.apache.org/thread.html/e2c49a5efaee2ad416b083fbf3b9b6db60fdb04750208bfc34cecaf0@%3Cdev.beam.apache.org%3E
> > 2: https://s.apache.org/beam-release-vendored-artifacts
> >
>
>


Re: Spotless exclusions

2019-07-09 Thread Maximilian Michels
Thanks for asking here on the mailing list. Just saw the PR.

The PR breaks Spotless for the Flink Runner, although all of its source is 
under `src/`. The Flink Runner uses multiple source directories for supporting 
different Flink versions. Not all of them seem to be recognized anymore now:

  runners/flink/src/... <-- Spotless does not work
  runners/flink/1.5/src/... <-- Spotless works

I prefer Anton's changes over the previous solution, but I couldn't get it to 
work with the Flink project layout. I'll experiment but otherwise I would 
propose to revert to the old solution.

Cheers,
Max

On 27.06.19 01:50, Lukasz Cwik wrote:
>
> On Wed, Jun 26, 2019 at 4:22 PM Anton Kedin  > wrote:
>
>     Currently our spotless is configured globally [1] (for java at
>     least) to include all source files by '**/*.java'. And then we
>     exclude things explicitly. Don't know why, but these exclusions are
>     ignored for me sometimes, for example `./gradlew
>     :sdks:java:core:spotlessJavaCheck` always fails when checking the
>     generated files under
>     
> `.../build/generated-src/antlr/main/org/apache/beam/sdk/schemas/parser/generated`.
>
>     Few questions:
>      * can someone point me to a discussion or a jira about this behavior?
>
>
> BEAM-6399 and BEAM-7366 allude to something wonky going on.
>  
>
>      * do we actually have a use case of checking the source files that
>     are not under 'src'?
>
>
> No
>  
>
>      * if not, can we switch the config to only check for sources under
>     'src' [2]?
>
>
> Yes
>  
>
>      * alternatively, would it make sense to introduce project-specific
>     overrides?
>
>
> All src should be under src/ so it is unlikely to be useful.
>  
>
>
>     [1] 
> https://github.com/apache/beam/blob/af9362168606df9ec11319fe706b72466413798c/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L819
>     [2] https://github.com/apache/beam/pull/8954
>



Re: [Discuss] Create stackoverflow tags for python, java and go SDKs?

2019-07-09 Thread Robert Burke
The new tagging also helps if they identify the runner separately (flink,
spark, google-cloud-dataflow, etc.)

On Tue, Jul 9, 2019, 10:16 AM Rui Wang  wrote:

> I also agree `apache-beam` + `python` or `java` is an option. I can try to
> tag SO questions by this option when there is a need to see how it will
> work.
>
> Thanks for all of your participation to this discussion!
>
> -Rui
>
> On Tue, Jul 9, 2019 at 9:49 AM Kenneth Knowles  wrote:
>
>> With the new filtering UI, I think separate `apache-beam` tag and
>> `python` or `java` tag seems perfect. No users are using apache-beam-java,
>> apache-beam-python, or apache-beam-go so I don't think we need to make it
>> up.
>>
>> Kenn
>>
>> On Tue, Jul 9, 2019 at 9:38 AM Pablo Estrada  wrote:
>>
>>> I am able to create the tags if we decide to go that way:)
>>>
>>> I have seen users relying on the 'python' tag, along with the
>>> 'apache-beam' tag to specify an SDK. Maybe that's enough? What advantage
>>> would come from a dedicated beam tag?
>>>
>>> Perhaps users will be more easily accustomed to use the
>>> apache-beam-(SDK) tags, and in that case, it makes sense to add them.
>>>
>>> Thoughts?
>>> ,-P.
>>>
>>>
>>>
>>>
>>> On Tue, Jul 9, 2019, 8:52 AM Lukasz Cwik  wrote:
>>>
 That sounds like a good idea to me.

 On Wed, Jul 3, 2019 at 10:45 AM Rui Wang  wrote:

> Hi Community,
>
> When reading apache-beam related questions in stackoverflow, it
> happens that some questions only mention version number(e.g. 2.8.0) but 
> not
> mention which SDK related. Sometimes I can tell which SDK it is from code
> snippets, sometime I cannot as there is no code snippet. So in order to
> answer those questions I need to first comment and ask which SDK.
>
> I noticed that there is no tag for a specific SDK for apache beam.
> Adding such tags will be helpful when
> 1. Questions with such tag tell which SDK it is talking about.
> 2. If Questions do not mention SDK and without such tag, I can (or
> anyone else) help tag them.
>
> Note that creating tags is a privilege in SO that requires >1500
> reputation[1]. If people generally are ok with this idea, we will need to
> ask for help in the community to see who could be able to create tags.
>
>
> [1]: https://stackoverflow.com/help/privileges/create-tags
>
> Rui
>



Re: [VOTE] Vendored dependencies release process

2019-07-09 Thread Maximilian Michels
+1

On 09.07.19 22:16, Udi Meiri wrote:
> +1 LGTM
>
> On Mon, Jul 8, 2019 at 4:54 PM Lukasz Cwik  > wrote:
>
>     Thanks for taking a look. I followed up on your questions.
>
>     On Mon, Jul 8, 2019 at 3:58 PM Udi Meiri      > wrote:
>
>     I left some comments. Being new to the Beam releasing process,
>     my question might be trivial to someone actually performing the
>     release.
>
>     On Tue, Jul 2, 2019 at 4:49 PM Lukasz Cwik      > wrote:
>
>     Please vote based on the vendored dependencies release
>     process as discussed[1] and documented[2].
>
>     Please vote as follows:
>     +1: Adopt the vendored dependency release process
>     -1: The vendored release process needs to change because ...
>
>     Since many people in the US may be out due to the holiday
>     schedule, I'll try to close the vote and tally the results
>     on July 9th so please vote before then.
>
>     1: 
> https://lists.apache.org/thread.html/e2c49a5efaee2ad416b083fbf3b9b6db60fdb04750208bfc34cecaf0@%3Cdev.beam.apache.org%3E
>     2: https://s.apache.org/beam-release-vendored-artifacts
>



Re: Hazelcast Jet Runner

2019-07-09 Thread Maximilian Michels
We should fork the discussion around removing instances of @Experimental, but 
it was good to mention it here.

As for the Jet runner, I can only second Ismael: The Jet runner is the first 
runner I can think of that came with ValidatesRunner and Nexmark out of the 
box. Of course that doesn't mean the runner is "battled-tested", but we do not 
have other means to test its maturity.

For the future, we could come up with other criteria, e.g. a "probation 
period", but enforcing this now seems arbitrary.

If the authors of the Runners decide that it is experimental, so be it. 
Otherwise I would leave it to the user to decide (it might be helpful to list 
the inception date of each runner). That said, I value your concern Kenn. I can 
see that we establish a consistent onboarding of new runners which may involve 
marking them experimental for a while.

-Max

On 01.07.19 22:20, Kenneth Knowles wrote:
>
>
> On Wed, Jun 12, 2019 at 2:32 AM Ismaël Mejía  > wrote:
>
>     Seems the discussion moved a bit of my original intent that was to
>     make the Jet runner directory to be just called runners/jet in the
>     directory and mark the 'experimental' part of it in documentation as
>     we do for all other things in Beam.
>
>
> Thanks for returning to the one question at hand. We don't have to make
> an overall decision about all "experimental" things.
>  
>
>     Can we do this or is there still any considerable argument to not do it?
>
>
> I think we actually have some competing goals:
>
>     I agree 100% on the arguments, but let’s think in the reverse terms,
>     highlighting lack of maturity can play against the intended goal of
>     use and adoption even if for a noble reason. It is basic priming 101
>     [1].
>
>
> _My_ goal is exactly to highlight lack of maturity so that users are not
> harmed by either (1) necessary breaking changes or (2) permanent low
> quality. Only users who are willing to follow along with the project and
> update their own code regularly should use experimental features.
>
> Evaluating the Jet runner I am convinced by your arguments, because
> looking at the two dangers:
> (1) necessary breaking changes -- runners don't really have their own
> APIs to break, except their own small set of APIs and pipeline options
> (2) permanent low quality -- because there is no API design possible,
> there's no risk of permanent low quality except by fundamental
> mismatches. Plus as you mention the testing is already quite good.
>
> So I am OK to not call it experimental. But I have a slight remaining
> concern that it did not really go through what other runners went
> through. I hope this just means it is more mature. I hope it does not
> indicate that we are reducing rigor.
>
> Kenn
>  
>
>     On Wed, May 29, 2019 at 3:02 PM Reza Rokni      > wrote:
>     >
>     > Hi,
>     >
>     > Over 800 usages under java, might be worth doing a few PR...
>     >
>     > Also suggest we use a very light review process: First round go
>     for low hanging fruit, if anyone does a -1 against a change then we
>     leave that for round two.
>     >
>     > Thoughts?
>     >
>     > Cheers
>     >
>     > Reza
>     >
>     > On Wed, 29 May 2019 at 12:05, Kenneth Knowles      > wrote:
>     >>
>     >>
>     >>
>     >> On Mon, May 27, 2019 at 4:05 PM Reza Rokni      > wrote:
>     >>>
>     >>> "Many APIs that have been in place for years and are used by
>     most Beam users are still marked Experimental."
>     >>>
>     >>> Should there be a formal process in place to start 'graduating'
>     features out of @Experimental? Perhaps even target an up coming
>     release with a PR to remove the annotation from well established API's?
>     >>
>     >>
>     >> Good idea. I think a PR like this would be an opportunity to
>     discuss whether the feature is non-experimental. Probably many of
>     them are ready. It would help to address Ismael's very good point
>     that this new practice could make users think the old Experimental
>     stuff is not experimental. Maybe it is true that it is not really
>     still Experimental.
>     >>
>     >> Kenn
>     >>
>     >>
>     >>>
>     >>> On Tue, 28 May 2019 at 06:44, Reuven Lax      > wrote:
>     
>      We generally use Experimental for two different things, which
>     leads to confusion.
>        1. Features that work stably, but where we think we might
>     still make some changes to the API.
>        2. New features that we think might not yet be stable.
>     
>      This dual usage leads to a lot of confusion IMO. The fact that
>     we tend to forget to remove the @Experimental tag also makes it
>     somewhat useless. Many APIs that have been in place for years and
>     are used by most Beam users are still marked Experimental.
>     
>      Reuven
>     
>      On Mon, May 

Re: [VOTE] Vendored dependencies release process

2019-07-09 Thread Udi Meiri
+1 LGTM

On Mon, Jul 8, 2019 at 4:54 PM Lukasz Cwik  wrote:

> Thanks for taking a look. I followed up on your questions.
>
> On Mon, Jul 8, 2019 at 3:58 PM Udi Meiri  wrote:
>
>> I left some comments. Being new to the Beam releasing process, my
>> question might be trivial to someone actually performing the release.
>>
>> On Tue, Jul 2, 2019 at 4:49 PM Lukasz Cwik  wrote:
>>
>>> Please vote based on the vendored dependencies release process as
>>> discussed[1] and documented[2].
>>>
>>> Please vote as follows:
>>> +1: Adopt the vendored dependency release process
>>> -1: The vendored release process needs to change because ...
>>>
>>> Since many people in the US may be out due to the holiday schedule, I'll
>>> try to close the vote and tally the results on July 9th so please vote
>>> before then.
>>>
>>> 1:
>>> https://lists.apache.org/thread.html/e2c49a5efaee2ad416b083fbf3b9b6db60fdb04750208bfc34cecaf0@%3Cdev.beam.apache.org%3E
>>> 2: https://s.apache.org/beam-release-vendored-artifacts
>>>
>>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: Apache Beam issue | Reading Avro files and pushing to Bigquery

2019-07-09 Thread Lukasz Cwik
+user  (please use user@ for questions about using
the product and restrict to dev@ for questions related to developing the
product).

Can you provide the rest of the failing reason (and any stacktraces from
the workers related to the failures)?

On Tue, Jul 9, 2019 at 11:04 AM Dhiraj Sardana 
wrote:

> Hello,
>
>
>
> We’re using Apache Beam with Google dataflow. We’ve a pipeline which reads
> data from google storage (Avro files), transforms the same and pushes to
> Bigquery.
>
>
>
> Error we’re getting: Pipelines processes the data for some time and then
> stuck and in the ends, fails (and data does not reach the Bigquery):
>
>
>
> Here is error snippet:
>
> Workflow failed. Causes: S92:AvroIO.ReadAll/Read all via
> FileBasedSource/Reshuffle/Reshuffle/GroupBy...
>
> Workflow failed. Causes: S92:AvroIO.ReadAll/Read all via
> FileBasedSource/Reshuffle/Reshuffle/GroupByKey/Read+AvroIO.ReadAll/Read all
> via
> FileBasedSource/Reshuffle/Reshuffle/GroupByKey/GroupByWindow+AvroIO.ReadAll/Read
> all via
> FileBasedSource/Reshuffle/Reshuffle/ExpandIterable+AvroIO.ReadAll/Read all
> via FileBasedSource/Reshuffle/Values/Values/Map+AvroIO.ReadAll/Read all via
> FileBasedSource/Read ….
>
>
>
> Code snippet:
>
> PCollection records = pipeline.apply(Create.*of*
> (fileList)).setCoder(StringUtf8Coder.*of*());
> PCollection events = records.apply(AvroIO.
> *readAllGenericRecords*(userDefinedSchema));
>
>
>
> Apache Beam version: We tried with 2.12.0 & 2.8.0 both, but same error.
>
>
>
> It would be really helpful if we can get some hints/solution to this
> problem and let us know for more info.
>
>
>
>
>
> Regards,
>
> Dhiraj
>
>
> Dhiraj Sardana
> JEE Developer
> 
> *NetEnt* | Better Gaming™
> T: +46 760 024 812 <+46%20760%20024%20812>, M: +46 760 024 812
> <+46%20760%20024%20812>
> dhiraj.sard...@netent.com, www.netent.com
> Address: NetEnt AB (publ), Vasagatan 16, 111 20, Stockholm, SE
> This email and any attachments are confidential and may be legally
> privileged and protected by copyright. If you are not the intended
> recipient of this email you should not copy it or disclose its contents to
> anyone. If you have received this email in error, please notify the sender
> immediately and delete the email. Views or opinions in this email are
> solely those of the author. Unencrypted Internet communications are not
> secure and the sender does not accept responsibility for interception of
> this message by third parties. This communication is not intended to form a
> binding contract unless expressly indicated to the contrary and properly
> authorized. The recipient should scan this email and any attachments for
> the presence of viruses. The sender accepts no liability for any viruses
> transmitted in this email.
>


Re: [DISCUSS] Contributor guidelines for iterating on PRs: when to squash commits.

2019-07-09 Thread Rui Wang
"allow maintainers to edit" by default is enabled. Then the proposed
workflow looks reasonable to me now.


-Rui

On Tue, Jul 9, 2019 at 11:26 AM Kenneth Knowles  wrote:

> If you "allow maintainers to edit" the PR, it is easy for any committer to
> fix up the commits and merge. They should not have to ask you to do it,
> unless it is not obvious what to do.
>
> Kenn
>
> On Tue, Jul 9, 2019 at 11:05 AM Rui Wang  wrote:
>
>> At least for me, because I usually don't know when PR review is done, in
>> order to make PR to be merged into Beam repo faster, I keep squashing
>> commits every time so that committers can review and then merge at a time,
>> otherwise committers could approve a PR but then ask squashing commits,
>> which leads to another ping and wait round.
>>
>> Thus I prefer committers do squash and merge, which will reduce PR
>> authors' load during PR review process.
>>
>>
>> -Rui
>>
>>
>> On Mon, Jul 8, 2019 at 5:44 PM Valentyn Tymofieiev 
>> wrote:
>>
>>> Rui, committer guide[1] does say that all commits are standalone changes:
>>>
>>> We prefer small independent, incremental PRs with descriptive, isolated
 commits. Each commit is a single clear change.

>>>
>>> However in my opinion, this recommendation applies to moments when a PR
>>> is first sent for review, and when a PR is being merged. Committer guide
>>> also mentions that during review iterations authors may add review-related
>>> commits.
>>>
>>> the pull request may have a collection of review-related commits that
 are not meaningful to preserve in the history. The reviewer should give the
 LGTM and then request that the author of the pull request rebase, squash,
 split, etc, the commits, so that the history is most useful.
>>>
>>>
>>> Review-related commits don't have to be isolated independent changes,
>>> and perhaps committer guide and contributor guide [2] should spell out
>>> clearly that authors should not feel pressure to make review commits look
>>> like meaningful changes of their own when it does not make sense to do.  By
>>> the end of the review, review commits should be squashed by a committer or
>>> by the author.
>>>
>>> I think there are some incentives to always squash-and-force-push:
>>> - Committer will not ask the author to squash commits if there is only
>>> one commit.
>>> - We don't have to wait for another round of tests to pass on the final
>>> PR.
>>>
>>> Both concerns are addressed if a committer follows squash-and-merge
>>> workflow.
>>>
>>> [1] https://beam.apache.org/contribute/committer-guide
>>> 
>>> [2] https://beam.apache.org/contribute/
>>>
>>> On Mon, Jul 8, 2019 at 11:33 AM Rui Wang  wrote:
>>>
 Myself usually follows the pattern of "authors force-push their changes
 during every review iteration". The reason is after reading [1], I found
 it's hard to maintain a multiple commits PR as it's hard to create isolated
 commits for different logical pieces of code in practice. Therefore in
 practice I keep squash commits (and then have to force-push) to create at
 least a single isolated commit.



 [1]
 https://beam.apache.org/contribute/committer-guide/#pull-request-review-objectives

 -Rui

 On Mon, Jul 8, 2019 at 11:25 AM Udi Meiri  wrote:

> I think there are already some guidelines here:
> https://beam.apache.org/contribute/committer-guide/#pull-request-review-objectives
>  (maybe
> we could point to them from the PR template?)
> Yes, it is acceptable to ask for squash or if it's ok to squash to a
> single commit.
>
> On Mon, Jul 8, 2019 at 11:14 AM Valentyn Tymofieiev <
> valen...@google.com> wrote:
>
>> I have observed a pattern where authors force-push their changes
>> during every review iteration, so that a pull request always contains one
>> commit. This creates the following problems:
>>
>> 1. It is hard to see what has changed between review iterations.
>> 2. Sometimes authors  make changes in parts of pull requests that the
>> reviewer did not comment on, and such changes may be unnoticed by the
>> reviewer.
>> 3. After a force-push, comments made by reviewers on earlier commit
>> are hard to find.
>>
>> A better workflow may be to:
>> 1. Between review iterations authors push changes in new commit(s),
>> but also keep the original commit.
>> 2. If a follow-up commit does not constitute a meaningful change of
>> its own, it should be prefixed with "fixup: ".
>> 3. Once review has finished either:
>> - Authors squash fixup commits after all reviewers have approved the
>> PR per request of a reviewer.
>> - Committers squash fixup commits during merge.
>>
>> I am curious what thoughts or suggestions others have. In particular:
>> 1. Should we document guidelines for iterating on PRs in our
>

Re: [DISCUSS] Contributor guidelines for iterating on PRs: when to squash commits.

2019-07-09 Thread Kenneth Knowles
If you "allow maintainers to edit" the PR, it is easy for any committer to
fix up the commits and merge. They should not have to ask you to do it,
unless it is not obvious what to do.

Kenn

On Tue, Jul 9, 2019 at 11:05 AM Rui Wang  wrote:

> At least for me, because I usually don't know when PR review is done, in
> order to make PR to be merged into Beam repo faster, I keep squashing
> commits every time so that committers can review and then merge at a time,
> otherwise committers could approve a PR but then ask squashing commits,
> which leads to another ping and wait round.
>
> Thus I prefer committers do squash and merge, which will reduce PR
> authors' load during PR review process.
>
>
> -Rui
>
>
> On Mon, Jul 8, 2019 at 5:44 PM Valentyn Tymofieiev 
> wrote:
>
>> Rui, committer guide[1] does say that all commits are standalone changes:
>>
>> We prefer small independent, incremental PRs with descriptive, isolated
>>> commits. Each commit is a single clear change.
>>>
>>
>> However in my opinion, this recommendation applies to moments when a PR
>> is first sent for review, and when a PR is being merged. Committer guide
>> also mentions that during review iterations authors may add review-related
>> commits.
>>
>> the pull request may have a collection of review-related commits that are
>>> not meaningful to preserve in the history. The reviewer should give the
>>> LGTM and then request that the author of the pull request rebase, squash,
>>> split, etc, the commits, so that the history is most useful.
>>
>>
>> Review-related commits don't have to be isolated independent changes, and
>> perhaps committer guide and contributor guide [2] should spell out clearly
>> that authors should not feel pressure to make review commits look like
>> meaningful changes of their own when it does not make sense to do.  By the
>> end of the review, review commits should be squashed by a committer or by
>> the author.
>>
>> I think there are some incentives to always squash-and-force-push:
>> - Committer will not ask the author to squash commits if there is only
>> one commit.
>> - We don't have to wait for another round of tests to pass on the final
>> PR.
>>
>> Both concerns are addressed if a committer follows squash-and-merge
>> workflow.
>>
>> [1] https://beam.apache.org/contribute/committer-guide
>> 
>> [2] https://beam.apache.org/contribute/
>>
>> On Mon, Jul 8, 2019 at 11:33 AM Rui Wang  wrote:
>>
>>> Myself usually follows the pattern of "authors force-push their changes
>>> during every review iteration". The reason is after reading [1], I found
>>> it's hard to maintain a multiple commits PR as it's hard to create isolated
>>> commits for different logical pieces of code in practice. Therefore in
>>> practice I keep squash commits (and then have to force-push) to create at
>>> least a single isolated commit.
>>>
>>>
>>>
>>> [1]
>>> https://beam.apache.org/contribute/committer-guide/#pull-request-review-objectives
>>>
>>> -Rui
>>>
>>> On Mon, Jul 8, 2019 at 11:25 AM Udi Meiri  wrote:
>>>
 I think there are already some guidelines here:
 https://beam.apache.org/contribute/committer-guide/#pull-request-review-objectives
  (maybe
 we could point to them from the PR template?)
 Yes, it is acceptable to ask for squash or if it's ok to squash to a
 single commit.

 On Mon, Jul 8, 2019 at 11:14 AM Valentyn Tymofieiev <
 valen...@google.com> wrote:

> I have observed a pattern where authors force-push their changes
> during every review iteration, so that a pull request always contains one
> commit. This creates the following problems:
>
> 1. It is hard to see what has changed between review iterations.
> 2. Sometimes authors  make changes in parts of pull requests that the
> reviewer did not comment on, and such changes may be unnoticed by the
> reviewer.
> 3. After a force-push, comments made by reviewers on earlier commit
> are hard to find.
>
> A better workflow may be to:
> 1. Between review iterations authors push changes in new commit(s),
> but also keep the original commit.
> 2. If a follow-up commit does not constitute a meaningful change of
> its own, it should be prefixed with "fixup: ".
> 3. Once review has finished either:
> - Authors squash fixup commits after all reviewers have approved the
> PR per request of a reviewer.
> - Committers squash fixup commits during merge.
>
> I am curious what thoughts or suggestions others have. In particular:
> 1. Should we document guidelines for iterating on PRs in our
> contributor guide?
> 2. Is it acceptable for a reviewer to ask the author to rebase
> squashed changes that were force-pushed to address review feedback onto
> their original commits to simplify the rest of the review?
>
> Thanks.
>
> Related discussion:

Re: pipeline timeout

2019-07-09 Thread Rui Wang
Just curious: how do you use python to kick off a Java pipeline in airflow?

And, could you go to a direction that you have an airflow operator(maybe a
bash operator to execute a Jar?) to run such Java code:

result = p.run()
result.wait_until_finish(duration=TIMEOUT_SEC)
if not PipelineState.is_terminal(result.state):
  result.cancel()

?

On Tue, Jul 9, 2019 at 1:14 AM Chaim Turkel  wrote:

> sorry for not being explicit. My pipeline is in java and i am
> executing it using python in airflow.
> I would like from airflow to cancel the pipeline if running for more
> than x minutes.
> Currently i am doing this using the cli, but it is not optimal
>
> chaim
>
> On Mon, Jul 8, 2019 at 7:24 PM Mark Liu  wrote:
> >
> > Hi Chaim,
> >
> > You can checkout PipelineResult class and do something like:
> >
> > result = p.run()
> > result.wait_until_finish(duration=TIMEOUT_SEC)
> > if not PipelineState.is_terminal(result.state):
> >   result.cancel()
> >
> > The implementation of PipelineResult depends on what runner you choose.
> And you may find more useful functions in its subclass.
> >
> > Mark
> >
> >
> > On Sun, Jul 7, 2019 at 12:59 AM Chaim Turkel  wrote:
> >>
> >> Hi,
> >>   I have a pipeline that usually takes 15-30 minutes. Sometimes things
> >> get stuck (from 3rd party side). I would like to know if there is a
> >> way to cancel the job if it is running for more than x minutes? I know
> >> there is a cli command but i would like it either on the pipeline
> >> config or in the python sdk.
> >> Any ideas?
> >>
> >> Chaim Turkel
> >>
> >> --
> >>
> >>
> >> Loans are funded by
> >> FinWise Bank, a Utah-chartered bank located in Sandy,
> >> Utah, member FDIC, Equal
> >> Opportunity Lender. Merchant Cash Advances are
> >> made by Behalf. For more
> >> information on ECOA, click here
> >> . For important information about
> >> opening a new
> >> account, review Patriot Act procedures here
> >> .
> >> Visit Legal
> >>  to
> >> review our comprehensive program terms,
> >> conditions, and disclosures.
>
> --
>
>
> Loans are funded by
> FinWise Bank, a Utah-chartered bank located in Sandy,
> Utah, member FDIC, Equal
> Opportunity Lender. Merchant Cash Advances are
> made by Behalf. For more
> information on ECOA, click here
> . For important information about
> opening a new
> account, review Patriot Act procedures here
> .
> Visit Legal
>  to
> review our comprehensive program terms,
> conditions, and disclosures.
>


Re: Unable to start BEAM sql shell

2019-07-09 Thread Lukasz Cwik
Thanks for the fixes. I have reviewed both and merged them.

On Tue, Jul 9, 2019 at 10:59 AM Kyle Weaver  wrote:

> I would also make sure that you are running the command from the root of
> the repo.
>
> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com
> | +1650203
>
>
> On Tue, Jul 9, 2019 at 10:29 AM Rui Wang  wrote:
>
>> By applying the change above on a fresh clone from master, the command to
>> bundle SQL shell can pass on my mac. My Gradle is also 5.4.1 and it's Java
>> 8. A PR  was cut to apply this
>> change to Beam repo.
>>
>> I also found that the command to run SQL shell binary is also broken so
>> sending another PR to update SQL shell doc:
>> https://github.com/apache/beam/pull/9021
>>
>>
>> -Rui
>> 
>>
>>
>>
>> On Mon, Jul 8, 2019 at 8:15 PM Thomas K.  wrote:
>>
>>> Hi,
>>>
>>> Thanks for the advice. I made a fresh clone from master and also removed
>>> the configuration: shadow line from the build.gradle file. However. I'm
>>> still getting the same error message.
>>>
>>> A bit more about my env : I'm running Gradle 5.4.1 and Java 8 on Windows.
>>>
>>> Any thoughts?
>>>
>>> On Tue, Jul 9, 2019 at 1:44 AM Rui Wang  wrote:
>>>
 Indeed it's broken on shadow configuration.  I found that if removing
 configuration: "shadow" from [1], this command will pass.


 [1]:
 https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/shell/build.gradle#L31

 On Mon, Jul 8, 2019 at 1:10 PM Kyle Weaver  wrote:

> Hi Thomas,
>
> You probably need to make sure your clone of the Beam repo is up to
> date. Also, it looks like there's another bug so I filed a Jira:
> https://issues.apache.org/jira/browse/BEAM-7708
>
> Thanks,
> Kyle
>
> Kyle Weaver | Software Engineer | github.com/ibzib |
> kcwea...@google.com | +1650203
>
>
> On Mon, Jul 8, 2019 at 11:59 AM Thomas K. 
> wrote:
>
>> Hi,
>>
>> I'm following the instructions on this page -
>>
>> https://beam.apache.org/documentation/dsls/sql/shell/
>>
>> and running the following command
>>
>> gradlew -p sdks/java/extensions/sql/shell 
>> -Pbeam.sql.shell.bundled=':runners:flink:1.5,:sdks:java:io:kafka' 
>> installDist
>>
>>
>>
>>
>>
>> However, it fails with the error:
>> A problem occurred evaluating project
>> ':sdks:java:extensions:sql:shell'.
>> > Project with path '':runners:flink:1.5' could not be found in
>> project ':sdks:java:extensions:sql:shell'.
>>
>>
>>
>> How do I get it to  recognize all the extensions so that I can run
>> the SQL shell?
>>
>> Thanks.
>>
>>


Re: [DISCUSS] Contributor guidelines for iterating on PRs: when to squash commits.

2019-07-09 Thread Rui Wang
At least for me, because I usually don't know when PR review is done, in
order to make PR to be merged into Beam repo faster, I keep squashing
commits every time so that committers can review and then merge at a time,
otherwise committers could approve a PR but then ask squashing commits,
which leads to another ping and wait round.

Thus I prefer committers do squash and merge, which will reduce PR authors'
load during PR review process.


-Rui


On Mon, Jul 8, 2019 at 5:44 PM Valentyn Tymofieiev 
wrote:

> Rui, committer guide[1] does say that all commits are standalone changes:
>
> We prefer small independent, incremental PRs with descriptive, isolated
>> commits. Each commit is a single clear change.
>>
>
> However in my opinion, this recommendation applies to moments when a PR is
> first sent for review, and when a PR is being merged. Committer guide also
> mentions that during review iterations authors may add review-related
> commits.
>
> the pull request may have a collection of review-related commits that are
>> not meaningful to preserve in the history. The reviewer should give the
>> LGTM and then request that the author of the pull request rebase, squash,
>> split, etc, the commits, so that the history is most useful.
>
>
> Review-related commits don't have to be isolated independent changes, and
> perhaps committer guide and contributor guide [2] should spell out clearly
> that authors should not feel pressure to make review commits look like
> meaningful changes of their own when it does not make sense to do.  By the
> end of the review, review commits should be squashed by a committer or by
> the author.
>
> I think there are some incentives to always squash-and-force-push:
> - Committer will not ask the author to squash commits if there is only one
> commit.
> - We don't have to wait for another round of tests to pass on the final
> PR.
>
> Both concerns are addressed if a committer follows squash-and-merge
> workflow.
>
> [1] https://beam.apache.org/contribute/committer-guide
> 
> [2] https://beam.apache.org/contribute/
>
> On Mon, Jul 8, 2019 at 11:33 AM Rui Wang  wrote:
>
>> Myself usually follows the pattern of "authors force-push their changes
>> during every review iteration". The reason is after reading [1], I found
>> it's hard to maintain a multiple commits PR as it's hard to create isolated
>> commits for different logical pieces of code in practice. Therefore in
>> practice I keep squash commits (and then have to force-push) to create at
>> least a single isolated commit.
>>
>>
>>
>> [1]
>> https://beam.apache.org/contribute/committer-guide/#pull-request-review-objectives
>>
>> -Rui
>>
>> On Mon, Jul 8, 2019 at 11:25 AM Udi Meiri  wrote:
>>
>>> I think there are already some guidelines here:
>>> https://beam.apache.org/contribute/committer-guide/#pull-request-review-objectives
>>>  (maybe
>>> we could point to them from the PR template?)
>>> Yes, it is acceptable to ask for squash or if it's ok to squash to a
>>> single commit.
>>>
>>> On Mon, Jul 8, 2019 at 11:14 AM Valentyn Tymofieiev 
>>> wrote:
>>>
 I have observed a pattern where authors force-push their changes during
 every review iteration, so that a pull request always contains one commit.
 This creates the following problems:

 1. It is hard to see what has changed between review iterations.
 2. Sometimes authors  make changes in parts of pull requests that the
 reviewer did not comment on, and such changes may be unnoticed by the
 reviewer.
 3. After a force-push, comments made by reviewers on earlier commit are
 hard to find.

 A better workflow may be to:
 1. Between review iterations authors push changes in new commit(s), but
 also keep the original commit.
 2. If a follow-up commit does not constitute a meaningful change of its
 own, it should be prefixed with "fixup: ".
 3. Once review has finished either:
 - Authors squash fixup commits after all reviewers have approved the PR
 per request of a reviewer.
 - Committers squash fixup commits during merge.

 I am curious what thoughts or suggestions others have. In particular:
 1. Should we document guidelines for iterating on PRs in our
 contributor guide?
 2. Is it acceptable for a reviewer to ask the author to rebase squashed
 changes that were force-pushed to address review feedback onto their
 original commits to simplify the rest of the review?

 Thanks.

 Related discussion:
 [1] Committer Guidelines / Hygene before merging PRs
 https://lists.apache.org/thread.html/6d922820d6fc352479f88e5c8737f2c8893ddb706a1e578b50d28948@%3Cdev.beam.apache.org%3E

>>>


Apache Beam issue | Reading Avro files and pushing to Bigquery

2019-07-09 Thread Dhiraj Sardana
Hello,

We’re using Apache Beam with Google dataflow. We’ve a pipeline which reads data 
from google storage (Avro files), transforms the same and pushes to Bigquery.

Error we’re getting: Pipelines processes the data for some time and then stuck 
and in the ends, fails (and data does not reach the Bigquery):

Here is error snippet:
Workflow failed. Causes: S92:AvroIO.ReadAll/Read all via 
FileBasedSource/Reshuffle/Reshuffle/GroupBy...
Workflow failed. Causes: S92:AvroIO.ReadAll/Read all via 
FileBasedSource/Reshuffle/Reshuffle/GroupByKey/Read+AvroIO.ReadAll/Read all via 
FileBasedSource/Reshuffle/Reshuffle/GroupByKey/GroupByWindow+AvroIO.ReadAll/Read
 all via FileBasedSource/Reshuffle/Reshuffle/ExpandIterable+AvroIO.ReadAll/Read 
all via FileBasedSource/Reshuffle/Values/Values/Map+AvroIO.ReadAll/Read all via 
FileBasedSource/Read ….

Code snippet:
PCollection records = 
pipeline.apply(Create.of(fileList)).setCoder(StringUtf8Coder.of());
PCollection events = 
records.apply(AvroIO.readAllGenericRecords(userDefinedSchema));

Apache Beam version: We tried with 2.12.0 & 2.8.0 both, but same error.

It would be really helpful if we can get some hints/solution to this problem 
and let us know for more info.


Regards,
Dhiraj


Dhiraj Sardana
JEE Developer

NetEnt | Better Gaming™
T: +46 760 024 812, M: +46 760 024 
812
dhiraj.sard...@netent.com, www.netent.com
Address: NetEnt AB (publ), Vasagatan 16, 111 20, Stockholm, SE
This email and any attachments are confidential and may be legally privileged 
and protected by copyright. If you are not the intended recipient of this email 
you should not copy it or disclose its contents to anyone. If you have received 
this email in error, please notify the sender immediately and delete the email. 
Views or opinions in this email are solely those of the author. Unencrypted 
Internet communications are not secure and the sender does not accept 
responsibility for interception of this message by third parties. This 
communication is not intended to form a binding contract unless expressly 
indicated to the contrary and properly authorized. The recipient should scan 
this email and any attachments for the presence of viruses. The sender accepts 
no liability for any viruses transmitted in this email.


Re: Unable to start BEAM sql shell

2019-07-09 Thread Kyle Weaver
I would also make sure that you are running the command from the root of
the repo.

Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com |
+1650203


On Tue, Jul 9, 2019 at 10:29 AM Rui Wang  wrote:

> By applying the change above on a fresh clone from master, the command to
> bundle SQL shell can pass on my mac. My Gradle is also 5.4.1 and it's Java
> 8. A PR  was cut to apply this
> change to Beam repo.
>
> I also found that the command to run SQL shell binary is also broken so
> sending another PR to update SQL shell doc:
> https://github.com/apache/beam/pull/9021
>
>
> -Rui
> 
>
>
>
> On Mon, Jul 8, 2019 at 8:15 PM Thomas K.  wrote:
>
>> Hi,
>>
>> Thanks for the advice. I made a fresh clone from master and also removed
>> the configuration: shadow line from the build.gradle file. However. I'm
>> still getting the same error message.
>>
>> A bit more about my env : I'm running Gradle 5.4.1 and Java 8 on Windows.
>>
>> Any thoughts?
>>
>> On Tue, Jul 9, 2019 at 1:44 AM Rui Wang  wrote:
>>
>>> Indeed it's broken on shadow configuration.  I found that if removing
>>> configuration: "shadow" from [1], this command will pass.
>>>
>>>
>>> [1]:
>>> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/shell/build.gradle#L31
>>>
>>> On Mon, Jul 8, 2019 at 1:10 PM Kyle Weaver  wrote:
>>>
 Hi Thomas,

 You probably need to make sure your clone of the Beam repo is up to
 date. Also, it looks like there's another bug so I filed a Jira:
 https://issues.apache.org/jira/browse/BEAM-7708

 Thanks,
 Kyle

 Kyle Weaver | Software Engineer | github.com/ibzib |
 kcwea...@google.com | +1650203


 On Mon, Jul 8, 2019 at 11:59 AM Thomas K. 
 wrote:

> Hi,
>
> I'm following the instructions on this page -
>
> https://beam.apache.org/documentation/dsls/sql/shell/
>
> and running the following command
>
> gradlew -p sdks/java/extensions/sql/shell 
> -Pbeam.sql.shell.bundled=':runners:flink:1.5,:sdks:java:io:kafka' 
> installDist
>
>
>
>
>
> However, it fails with the error:
> A problem occurred evaluating project
> ':sdks:java:extensions:sql:shell'.
> > Project with path '':runners:flink:1.5' could not be found in
> project ':sdks:java:extensions:sql:shell'.
>
>
>
> How do I get it to  recognize all the extensions so that I can run the
> SQL shell?
>
> Thanks.
>
>


Re: Unable to start BEAM sql shell

2019-07-09 Thread Rui Wang
By applying the change above on a fresh clone from master, the command to
bundle SQL shell can pass on my mac. My Gradle is also 5.4.1 and it's Java
8. A PR  was cut to apply this
change to Beam repo.

I also found that the command to run SQL shell binary is also broken so
sending another PR to update SQL shell doc:
https://github.com/apache/beam/pull/9021


-Rui




On Mon, Jul 8, 2019 at 8:15 PM Thomas K.  wrote:

> Hi,
>
> Thanks for the advice. I made a fresh clone from master and also removed
> the configuration: shadow line from the build.gradle file. However. I'm
> still getting the same error message.
>
> A bit more about my env : I'm running Gradle 5.4.1 and Java 8 on Windows.
>
> Any thoughts?
>
> On Tue, Jul 9, 2019 at 1:44 AM Rui Wang  wrote:
>
>> Indeed it's broken on shadow configuration.  I found that if removing
>> configuration: "shadow" from [1], this command will pass.
>>
>>
>> [1]:
>> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/shell/build.gradle#L31
>>
>> On Mon, Jul 8, 2019 at 1:10 PM Kyle Weaver  wrote:
>>
>>> Hi Thomas,
>>>
>>> You probably need to make sure your clone of the Beam repo is up to
>>> date. Also, it looks like there's another bug so I filed a Jira:
>>> https://issues.apache.org/jira/browse/BEAM-7708
>>>
>>> Thanks,
>>> Kyle
>>>
>>> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com
>>> | +1650203
>>>
>>>
>>> On Mon, Jul 8, 2019 at 11:59 AM Thomas K.  wrote:
>>>
 Hi,

 I'm following the instructions on this page -

 https://beam.apache.org/documentation/dsls/sql/shell/

 and running the following command

 gradlew -p sdks/java/extensions/sql/shell 
 -Pbeam.sql.shell.bundled=':runners:flink:1.5,:sdks:java:io:kafka' 
 installDist





 However, it fails with the error:
 A problem occurred evaluating project ':sdks:java:extensions:sql:shell'.
 > Project with path '':runners:flink:1.5' could not be found in project
 ':sdks:java:extensions:sql:shell'.



 How do I get it to  recognize all the extensions so that I can run the
 SQL shell?

 Thanks.




Re: [Discuss] Create stackoverflow tags for python, java and go SDKs?

2019-07-09 Thread Rui Wang
I also agree `apache-beam` + `python` or `java` is an option. I can try to
tag SO questions by this option when there is a need to see how it will
work.

Thanks for all of your participation to this discussion!

-Rui

On Tue, Jul 9, 2019 at 9:49 AM Kenneth Knowles  wrote:

> With the new filtering UI, I think separate `apache-beam` tag and `python`
> or `java` tag seems perfect. No users are using apache-beam-java,
> apache-beam-python, or apache-beam-go so I don't think we need to make it
> up.
>
> Kenn
>
> On Tue, Jul 9, 2019 at 9:38 AM Pablo Estrada  wrote:
>
>> I am able to create the tags if we decide to go that way:)
>>
>> I have seen users relying on the 'python' tag, along with the
>> 'apache-beam' tag to specify an SDK. Maybe that's enough? What advantage
>> would come from a dedicated beam tag?
>>
>> Perhaps users will be more easily accustomed to use the apache-beam-(SDK)
>> tags, and in that case, it makes sense to add them.
>>
>> Thoughts?
>> ,-P.
>>
>>
>>
>>
>> On Tue, Jul 9, 2019, 8:52 AM Lukasz Cwik  wrote:
>>
>>> That sounds like a good idea to me.
>>>
>>> On Wed, Jul 3, 2019 at 10:45 AM Rui Wang  wrote:
>>>
 Hi Community,

 When reading apache-beam related questions in stackoverflow, it happens
 that some questions only mention version number(e.g. 2.8.0) but not mention
 which SDK related. Sometimes I can tell which SDK it is from code snippets,
 sometime I cannot as there is no code snippet. So in order to answer those
 questions I need to first comment and ask which SDK.

 I noticed that there is no tag for a specific SDK for apache beam.
 Adding such tags will be helpful when
 1. Questions with such tag tell which SDK it is talking about.
 2. If Questions do not mention SDK and without such tag, I can (or
 anyone else) help tag them.

 Note that creating tags is a privilege in SO that requires >1500
 reputation[1]. If people generally are ok with this idea, we will need to
 ask for help in the community to see who could be able to create tags.


 [1]: https://stackoverflow.com/help/privileges/create-tags

 Rui

>>>


Re: [Discuss] Create stackoverflow tags for python, java and go SDKs?

2019-07-09 Thread Kenneth Knowles
With the new filtering UI, I think separate `apache-beam` tag and `python`
or `java` tag seems perfect. No users are using apache-beam-java,
apache-beam-python, or apache-beam-go so I don't think we need to make it
up.

Kenn

On Tue, Jul 9, 2019 at 9:38 AM Pablo Estrada  wrote:

> I am able to create the tags if we decide to go that way:)
>
> I have seen users relying on the 'python' tag, along with the
> 'apache-beam' tag to specify an SDK. Maybe that's enough? What advantage
> would come from a dedicated beam tag?
>
> Perhaps users will be more easily accustomed to use the apache-beam-(SDK)
> tags, and in that case, it makes sense to add them.
>
> Thoughts?
> ,-P.
>
>
>
>
> On Tue, Jul 9, 2019, 8:52 AM Lukasz Cwik  wrote:
>
>> That sounds like a good idea to me.
>>
>> On Wed, Jul 3, 2019 at 10:45 AM Rui Wang  wrote:
>>
>>> Hi Community,
>>>
>>> When reading apache-beam related questions in stackoverflow, it happens
>>> that some questions only mention version number(e.g. 2.8.0) but not mention
>>> which SDK related. Sometimes I can tell which SDK it is from code snippets,
>>> sometime I cannot as there is no code snippet. So in order to answer those
>>> questions I need to first comment and ask which SDK.
>>>
>>> I noticed that there is no tag for a specific SDK for apache beam.
>>> Adding such tags will be helpful when
>>> 1. Questions with such tag tell which SDK it is talking about.
>>> 2. If Questions do not mention SDK and without such tag, I can (or
>>> anyone else) help tag them.
>>>
>>> Note that creating tags is a privilege in SO that requires >1500
>>> reputation[1]. If people generally are ok with this idea, we will need to
>>> ask for help in the community to see who could be able to create tags.
>>>
>>>
>>> [1]: https://stackoverflow.com/help/privileges/create-tags
>>>
>>> Rui
>>>
>>


Re: [Discuss] Create stackoverflow tags for python, java and go SDKs?

2019-07-09 Thread Pablo Estrada
I am able to create the tags if we decide to go that way:)

I have seen users relying on the 'python' tag, along with the 'apache-beam'
tag to specify an SDK. Maybe that's enough? What advantage would come from
a dedicated beam tag?

Perhaps users will be more easily accustomed to use the apache-beam-(SDK)
tags, and in that case, it makes sense to add them.

Thoughts?
,-P.




On Tue, Jul 9, 2019, 8:52 AM Lukasz Cwik  wrote:

> That sounds like a good idea to me.
>
> On Wed, Jul 3, 2019 at 10:45 AM Rui Wang  wrote:
>
>> Hi Community,
>>
>> When reading apache-beam related questions in stackoverflow, it happens
>> that some questions only mention version number(e.g. 2.8.0) but not mention
>> which SDK related. Sometimes I can tell which SDK it is from code snippets,
>> sometime I cannot as there is no code snippet. So in order to answer those
>> questions I need to first comment and ask which SDK.
>>
>> I noticed that there is no tag for a specific SDK for apache beam. Adding
>> such tags will be helpful when
>> 1. Questions with such tag tell which SDK it is talking about.
>> 2. If Questions do not mention SDK and without such tag, I can (or anyone
>> else) help tag them.
>>
>> Note that creating tags is a privilege in SO that requires >1500
>> reputation[1]. If people generally are ok with this idea, we will need to
>> ask for help in the community to see who could be able to create tags.
>>
>>
>> [1]: https://stackoverflow.com/help/privileges/create-tags
>>
>> Rui
>>
>


Re: Change of Behavior - JDBC Set Command

2019-07-09 Thread Alireza Samadian
Thank you for the clarification. I closed the current pull request, I will
create a new Jira Issue for the proposed methods.

Best,
Alireza

On Mon, Jul 8, 2019 at 4:38 PM Lukasz Cwik  wrote:

> Ok, since you want parsing to be exactly the same then it does make sense
> to expose some part of the PipelineOptionsFactory to do the parsing.
>
> On the API point of view, it would make sense to break up the reflection
> portion to be based upon the object types inside of PipelineOptions
> interface (backed by the proxy) and the following methods:
> T get(String propertyName);  // returns the type T or default value if
> unset based upon default value logic.
> open questions:
> * what happens to the JSON -> T conversion for PipelineOptions
> subinterfaces that have yet to be registered/bound?
> * should we have a T get(String propertyName, Class type)?, if yes how
> do we ensure that Class matches the expected subinterface type that is
> bound in the future?
>
> boolean set(String propertyName, T value);  // returns true if the value
> was previously set
> open questions:
> * does set return true if overriding a default?
> * how do we ensure that the subinterface type that is bound in the future
> is valid and matches the type of T?
> * should this be T set(String propertyName, T Value) where the previous
> value if set is returned?
>
> boolean unset(String propertyName);  // unsets the property if it has been
> set in the past return true if it has been set
> open questions:
> * should default values be considered set?
> * should there be a T unset(String propertyName, Class type) which
> returns the previously set type?
>
> Set properties(); // returns the set of properties that are set
> open questions:
> * should this return property names for default values?
>
> Should we have separate true/false methods that tell you whether something
> is set or is the default based upon the property name?
>
> After the above change, we can then expose the parsing logic in
> PipelineOptionsFactory and the SQL library can then compose the two
> capabilities to parse objects and reflectively set them.
>
> On Fri, Jul 5, 2019 at 3:23 AM Andrew Pilloud  wrote:
>
>> We have definitely tried to rework this a few times in the past. The
>> biggest problems is that some changes to pipeline options require multiple
>> values to change at once. For example, changing the runner might require
>> some options to be set and others reset before the options are valid.
>>
>> I'll try to dig up the past threads when I get back from vacation on
>> Monday, but I think there were a few PRs trying to do this before.
>>
>> Andrew
>>
>> On Wed, Jul 3, 2019, 9:54 AM Alireza Samadian 
>> wrote:
>>
>>> Before this PR, the set command was using a map to store values and then
>>> it was using PipelineOptionsFactory#fromArgs to parse those values.
>>> Therefore, by using PipelieOptionsFactory#parseObjects, we keep the same
>>> value parsing behavior for the SET command as before. Using
>>> PipelineOptionsFactory for parsing the objects also has two more
>>> advantages: It will prevent code duplication for parsing objects, and
>>> PipelineOptionsFactory does some extra checks (For instance checks if the
>>> runner is a valid type of runner). Thus, I think parsing the values the
>>> same way as PipelieOptionsFactory#parseObjects will be a better option.
>>>
>>>
>>> On Tue, Jul 2, 2019 at 3:50 PM Lukasz Cwik  wrote:
>>>
 I see, in the current PR it seems like we are trying to adopt the
 parsing logic of PipelineOptions command line value parsing to all SQL
 usecases since we are exposing the parseOption method to be used in the
 PipelineOptionsReflectionSetter#setOption.

 I should have asked in my earlier e-mail whether we wanted string to
 value parsing to match what we do inside of the PipelineOptionsFactory. If
 no, then PipelineOptionsReflectionSetter#setOption should take an Object
 type for value instead of String.

 On Tue, Jul 2, 2019 at 9:39 AM Anton Kedin  wrote:

> The proposed API assumes you already have a property name and a value
> parsed somehow, and now want to update a field on a pre-existing options
> object with that value, so there is no assumption about parsing being the
> same or not. E.g. if you set a property called `runner` to a string value
> `DirectRunner`, it should behave the same way whether it came from command
> line args, SQL SET command, JDBC connection args, or anywhere else.
>
> That said, we parse SET command differently from command line args
> [1]. We also parse the pipeline options from the connection args [2] that
> has a different syntax as well. I don't know whether we can easily deal
> with this aspect at this point (and whether we should), but if a value can
> get parsed, idea is that it should work the same way after that.
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/sr

Re: [Discuss] Create stackoverflow tags for python, java and go SDKs?

2019-07-09 Thread Lukasz Cwik
That sounds like a good idea to me.

On Wed, Jul 3, 2019 at 10:45 AM Rui Wang  wrote:

> Hi Community,
>
> When reading apache-beam related questions in stackoverflow, it happens
> that some questions only mention version number(e.g. 2.8.0) but not mention
> which SDK related. Sometimes I can tell which SDK it is from code snippets,
> sometime I cannot as there is no code snippet. So in order to answer those
> questions I need to first comment and ask which SDK.
>
> I noticed that there is no tag for a specific SDK for apache beam. Adding
> such tags will be helpful when
> 1. Questions with such tag tell which SDK it is talking about.
> 2. If Questions do not mention SDK and without such tag, I can (or anyone
> else) help tag them.
>
> Note that creating tags is a privilege in SO that requires >1500
> reputation[1]. If people generally are ok with this idea, we will need to
> ask for help in the community to see who could be able to create tags.
>
>
> [1]: https://stackoverflow.com/help/privileges/create-tags
>
> Rui
>


Re: Looping timer blog

2019-07-09 Thread Reza Rokni
Raised PR for edit to the blog sample :
https://github.com/apache/beam/pull/9010

Jan, will try and get some time later this month to have a look at
the @RequireTimeSortedInput.

On Thu, 27 Jun 2019 at 15:29, Jan Lukavský  wrote:

> Hi Reza,
>
> cool, I have put together a PR [1], which is still not completely ready.
> There are least still missing some tests - probably @ValidatesRunner and
> then fixing runners that won't pass that. It also misses few features
> described in the design doc, but that could be probably fixed later
> (support for allowedLateness and user-supplied sorting criterion). Would
> you like to test this on some of your code? It might suffice to put
> @RequiresTimeSortedInput on @ProcessElement method and input should start
> being sorted (should work at least for DirectRunner, FlinkRunner (stream
> and batch) and SparkRunner (batch)).
>
> [1] https://github.com/apache/beam/pull/8774
> On 6/27/19 6:16 AM, Reza Rokni wrote:
>
>
>
> On Tue, 25 Jun 2019 at 21:20, Jan Lukavský  wrote:
>
>>
>> On 6/25/19 1:43 PM, Reza Rokni wrote:
>>
>>
>>
>> On Tue, 25 Jun 2019 at 18:12, Jan Lukavský  wrote:
>>
>>> > The TTL check would be in the same Timer rather than a separate
>>> Timer.  The max value processed in each OnTimer call would be stored in
>>> valuestate and used as base to know how long it has been seen the pipeline
>>> has seen an external value for that key.
>>>
>>> OK, that seems to work, if you store maximum timestamp in a value state
>>> (that is, basically you generate per-key watermark).
>>>
>>> > You can store it in ValueState rather than BagState, but yes you store
>>> that value in State ready for the next OnTimer() fire.
>>>
>>> In my understanding of the problem, I'd say that this approach seems a
>>> little suboptimal. Consider the following, when trying to generate the OHLC
>>> data (open, high, low, close, that is move last closing price to next
>>> window opening price)
>>>
>>>  - suppose we have three times T1 < T2 < T3 < T4, where times T2 and T4
>>> denote "end of windows" (closing times)
>>>
>>>  - first (in processing time), we receive value for time T3, we cache it
>>> in ValueState, we set timer for T4
>>>
>>>  - next, we receive value for T1 - but we cannot overwrite the value
>>> already written for T3, right? What to do then? And will we reset timer to
>>> time T3?
>>>
>>>  - basically, because we received *two* values, both of which are needed
>>> and no timer could have been fired in between, we need both values stored
>>> to know which value to emit when timer fires. And consider that on batch,
>>> the timer fires only after we see all data (without any ordering).
>>>
>> I assume you are referring to late data rather than out of order data (
>> the later being taken care of with the in-memory sort). As discussed in the
>> talk late data is a sharp edge, one solution for late data is to branch it
>> away before GlobalWindow + State DoFn. This can then be output from the
>> pipeline into a sink with a marker to initiate a manual procedure for
>> correction. Essentially a manual redaction.
>>
>> Which in-memory sort do you refer to? I'm pretty sure there must be
>> sorting involved for this to work, but I'm not quite sure where exactly it
>> is in your implementation. You said that you can put data in ValueState
>> rather than BagState, so do you have a List as a value in the ValueState?
>> Or do you wrap the stateful par do with some other sorting logic? And if
>> so, how does this work on batch? I suppose that it has to all fit to
>> memory. I think this all goes around the @RequiresTimeSortedInput
>> annotation, that I propose. Maybe we can cooperate on that? :)\
>>
> Hu... nice this chat made me notice a bug in the looping timer example
> code we missed thanx :-) , the ValueState timerRunning, should
> actually be a ValueState minTimestamp and the check to set the timer
> needs to be if (NULL or  element.Timestamp is < then timer.Timestamp).
> Which also requires the use of timer read pattern as we don't have
> timer.read()
> https://stackoverflow.com/questions/55912522/setting-a-timer-to-the-minimum-timestamp-seen/55912542#55912542.
> I will fix and put a PR to correct the blog.
>
> For the hold and propagate pattern (for those following the original
> thread the pattern is not covered in the blog example code, but discussed
> at the summit):
> OnProcess()
> - You can drop the accumulators into BagState.
> - A timer is set at the minimum time interval.
> OnTimer()
> - The list is sorted in memory, for a lot of timeseries use cases (for
> example ohlc) the memory issues are heavily mitigated as we can use a Fixed
> Windows partial aggregations before the GlobalWindow stage. (Partial
> because they dont have the correct Open value set until they flow into the
> Global Window). Of course how big the window is dictates the compression
> you would get.
> - The current timer is set again to fire in the next interval window.
>
> @RequiresTimeSortedInput looks s

Re: pipeline timeout

2019-07-09 Thread Chaim Turkel
sorry for not being explicit. My pipeline is in java and i am
executing it using python in airflow.
I would like from airflow to cancel the pipeline if running for more
than x minutes.
Currently i am doing this using the cli, but it is not optimal

chaim

On Mon, Jul 8, 2019 at 7:24 PM Mark Liu  wrote:
>
> Hi Chaim,
>
> You can checkout PipelineResult class and do something like:
>
> result = p.run()
> result.wait_until_finish(duration=TIMEOUT_SEC)
> if not PipelineState.is_terminal(result.state):
>   result.cancel()
>
> The implementation of PipelineResult depends on what runner you choose. And 
> you may find more useful functions in its subclass.
>
> Mark
>
>
> On Sun, Jul 7, 2019 at 12:59 AM Chaim Turkel  wrote:
>>
>> Hi,
>>   I have a pipeline that usually takes 15-30 minutes. Sometimes things
>> get stuck (from 3rd party side). I would like to know if there is a
>> way to cancel the job if it is running for more than x minutes? I know
>> there is a cli command but i would like it either on the pipeline
>> config or in the python sdk.
>> Any ideas?
>>
>> Chaim Turkel
>>
>> --
>>
>>
>> Loans are funded by
>> FinWise Bank, a Utah-chartered bank located in Sandy,
>> Utah, member FDIC, Equal
>> Opportunity Lender. Merchant Cash Advances are
>> made by Behalf. For more
>> information on ECOA, click here
>> . For important information about
>> opening a new
>> account, review Patriot Act procedures here
>> .
>> Visit Legal
>>  to
>> review our comprehensive program terms,
>> conditions, and disclosures.

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
. For important information about 
opening a new
account, review Patriot Act procedures here 
.
Visit Legal 
 to
review our comprehensive program terms, 
conditions, and disclosures.