Apache Pulsar connector for Beam

2019-10-23 Thread Taher Koitawala
Hi All,
 Been wanting to know if we have a Pulsar connector for Beam.
Pulsar is another messaging queue like Kafka and I would like to build a
streaming pipeline with Pulsar. Any help would be appreciated..


Regards,
Taher Koitawala


Re: Question related to running unit tests in IDE

2019-10-23 Thread Saikat Maitra
Thank you for your email Ryan. I will try to reimport and share my findings.

Regards,
Saikat

On Wed, Oct 23, 2019 at 11:32 AM Ryan Skraba  wrote:

> Just for info -- I managed to get a pretty good state using IntelliJ
> 2019.2.3 (Fedora) and a plain gradle import!
>
> There's a slack channel at https://s.apache.org/beam-slack-channel
> (see https://beam.apache.org/community/contact-us/) called
> #beam-intellij  It's pretty low-traffic, but you might be able to get
> some real-time help there if you need it.
>
> Ryan
>
> On Wed, Oct 23, 2019 at 2:11 AM Saikat Maitra 
> wrote:
> >
> > Hi Michal, Alexey
> >
> > Thank you for your email. I am using macOS Catalina and JDK 8 with
> IntelliJ IDEA 2019.1
> >
> > I will try to setup IntelliJ from scratch and see if the error resolves.
> >
> > Regards,
> > Saikat
> >
> >
> >
> >
> > On Tue, Oct 22, 2019 at 7:05 AM Alexey Romanenko <
> aromanenko@gmail.com> wrote:
> >>
> >> Hi,
> >>
> >> Thank you for your interest to contribute!
> >>
> >> Did you properly imported a project (as explained on page [1]) and all
> deps were resolved successfully?
> >>
> >> [1]
> https://cwiki.apache.org/confluence/display/BEAM/Set+up+IntelliJ+from+scratch
> >>
> >> On 22 Oct 2019, at 02:28, Saikat Maitra 
> wrote:
> >>
> >> Hi,
> >>
> >> I am interested to contribute to this issue
> >>
> >> https://issues.apache.org/jira/browse/BEAM-3658
> >>
> >> I have followed the contribution guide and was able to build the
> project locally using gradlew commands.
> >>
> >> I wanted to debug and trace the issue further by running the tests
> locally using Intellij Idea but I am getting following errors. I looked up
> the docs related to running tests (
> https://cwiki.apache.org/confluence/display/BEAM/Run+a+single+unit+test)
> and common IDE errors (
> https://cwiki.apache.org/confluence/display/BEAM/%28FAQ%29+Recovering+from+common+IDE+errors)
> but have not found similar errors.
> >>
> >> Error:(632, 17) java: cannot find symbol
> >>   symbol:   method
> apply(org.apache.beam.sdk.transforms.Values)
> >>   location: interface org.apache.beam.sdk.values.POutput
> >>
> >> Error:(169, 26) java: cannot find symbol
> >>   symbol:   class PCollection
> >>   location: class org.apache.beam.sdk.transforms.Watch
> >>
> >> Error:(169, 59) java: cannot find symbol
> >>   symbol:   class KV
> >>   location: class org.apache.beam.sdk.transforms.Watch
> >>
> >> Please let me know if you have feedback.
> >>
> >> Regards,
> >> Saikat
> >>
> >>
>


Re: FileIO and windowed writes

2019-10-23 Thread Koprivica,Preston Blake
I currently only have quick access to test the DirectRunner and the 
FlinkRunner.  This only manifests in the FlinkRunner.

From: Reuven Lax 
Reply-To: "dev@beam.apache.org" 
Date: Wednesday, October 23, 2019 at 4:14 PM
To: dev 
Subject: Re: FileIO and windowed writes

Is this only in the Flink runner?

On Wed, Oct 23, 2019 at 2:12 PM Koprivica,Preston Blake 
mailto:preston.b.kopriv...@cerner.com>> wrote:
I’ve tried different windowing functions and all result in the same behavior.  
The one in the previous email used the global window and a processing time 
based repeated trigger.  The filename policy used real system time to timestamp 
the outgoing files.

Here are a couple other window+trigger combos I’ve tried (with window based 
filename strategies):

Window.into(FixedWindows.of(windowDur))
.withAllowedLateness(Duration.standardHours(24))
.discardingFiredPanes()
.triggering(
Repeatedly.forever(

AfterProcessingTime.pastFirstElementInPane().plusDelayOf(windowDur
ie a fixed window with processing time based trigger + ~infinite lateness


pipeline

.apply(

format("ReadSQS(%s)", options.getQueueUrl()),

SqsIO.read().withQueueUrl(options.getQueueUrl()))

.apply(WithTimestamps.of((Message m) -> Instant.now()))

.apply(

format("Window(%s)", options.getWindowDuration()),

Window.into(FixedWindows.of(windowDur)))
ie map event time to processing time and use default trigger (ie close of 
window, no lateness)

These all resulted in the same behavior – data gets hung up in a temp file 
somewhere and the finalize file logic never seems to run.

Thanks,
-Preston

From: Reuven Lax mailto:re...@google.com>>
Reply-To: "dev@beam.apache.org" 
mailto:dev@beam.apache.org>>
Date: Wednesday, October 23, 2019 at 2:35 PM
To: dev mailto:dev@beam.apache.org>>
Subject: Re: FileIO and windowed writes

What WindowFn are you using?

On Wed, Oct 23, 2019 at 11:36 AM Koprivica,Preston Blake 
mailto:preston.b.kopriv...@cerner.com>> wrote:
Hi guys,

I’m currently working on a simple system where the intention is to ingest data 
from a realtime stream – in this case amazon SQS – and write the output in an 
incremental fashion to a durable filesystem (ie S3).  It’s easy to think of 
this as a low-fi journaling system.  We need to make sure that data that’s 
written to the source queue eventually makes it to S3.  We are utilizing the 
FileIO windowed writes with a custom naming policy to partition the files by 
their event time.   Because SQS can’t guarantee order, we do have to allow late 
messages.  Moreover, we need a further guarantee that a message be written in a 
timely manner – we’re thinking some constant multiple of the windowing 
duration.  As a first pass, we were thinking a processing time based trigger 
that fires on some regular interval.  For context, here’s an example of the 
pipeline:

ReadSQS -> Window + Processing Time Trigger -> Convert to Avro Message -> Write 
to Avro

  pipeline
.apply(SqsIO.read().withQueueUrl(options.getQueueUrl()))
.apply(
Window.configure()
.discardingFiredPanes()
.triggering(
Repeatedly.forever(

AfterProcessingTime.pastFirstElementInPane().plusDelayOf(windowDur
.apply(ParDo.of(new SqsMsgToAvroDoFn<>(recordClass, options)))
.setCoder(AvroCoder.of(recordClass))
.apply(
AvroIO.write(recordClass)
.withWindowedWrites()
.withTempDirectory(options.getTempDir())
.withNumShards(options.getShards())
.to(new WindowedFilenamePolicy(options.getOutputPrefix(), 
"avro")));

This all seemed fairly straightforward.  I have not yet observed lost data with 
this pipeline, but I am seeing an issue with timeliness.  Things seem to get 
hung up on finalizing file output, but I have yet to truly pinpoint the issue.  
To really highlight the issue, I can setup a test where I send a single message 
to the source queue.  If nothing else happens, the data never makes it to its 
final output using the FlinkRunner (beam-2.15.0, flink-1.8).  Has anyone seen 
this behavior before?  Is the expectation of eventual consistency wrong?

Thanks,
-Preston




CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call 

Re: FileIO and windowed writes

2019-10-23 Thread Reuven Lax
Is this only in the Flink runner?

On Wed, Oct 23, 2019 at 2:12 PM Koprivica,Preston Blake <
preston.b.kopriv...@cerner.com> wrote:

> I’ve tried different windowing functions and all result in the same
> behavior.  The one in the previous email used the global window and a
> processing time based repeated trigger.  The filename policy used real
> system time to timestamp the outgoing files.
>
>
>
> Here are a couple other window+trigger combos I’ve tried (with window
> based filename strategies):
>
>
>
> Window.into(FixedWindows.of(windowDur))
>
> .withAllowedLateness(Duration.standardHours(24))
>
> .discardingFiredPanes()
>
> .triggering(
>
> Repeatedly.forever(
>
>
> AfterProcessingTime.pastFirstElementInPane().plusDelayOf(windowDur
>
> ie a fixed window with processing time based trigger + ~infinite lateness
>
>
>
> pipeline
>
> .apply(
>
> format("ReadSQS(%s)", options.getQueueUrl()),
>
> SqsIO.read().withQueueUrl(options.getQueueUrl()))
>
> .apply(WithTimestamps.of((Message m) -> Instant.now()))
>
> .apply(
>
> format("Window(%s)", options.getWindowDuration()),
>
> Window.into(FixedWindows.of(windowDur)))
>
> ie map event time to processing time and use default trigger (ie close of
> window, no lateness)
>
>
>
> These all resulted in the same behavior – data gets hung up in a temp file
> somewhere and the finalize file logic never seems to run.
>
>
>
> Thanks,
>
> -Preston
>
>
>
> *From: *Reuven Lax 
> *Reply-To: *"dev@beam.apache.org" 
> *Date: *Wednesday, October 23, 2019 at 2:35 PM
> *To: *dev 
> *Subject: *Re: FileIO and windowed writes
>
>
>
> What WindowFn are you using?
>
>
>
> On Wed, Oct 23, 2019 at 11:36 AM Koprivica,Preston Blake <
> preston.b.kopriv...@cerner.com> wrote:
>
> Hi guys,
>
>
>
> I’m currently working on a simple system where the intention is to ingest
> data from a realtime stream – in this case amazon SQS – and write the
> output in an incremental fashion to a durable filesystem (ie S3).  It’s
> easy to think of this as a low-fi journaling system.  We need to make sure
> that data that’s written to the source queue eventually makes it to S3.  We
> are utilizing the FileIO windowed writes with a custom naming policy to
> partition the files by their event time.   Because SQS can’t guarantee
> order, we do have to allow late messages.  Moreover, we need a further
> guarantee that a message be written in a timely manner – we’re thinking
> some constant multiple of the windowing duration.  As a first pass, we were
> thinking a processing time based trigger that fires on some regular
> interval.  For context, here’s an example of the pipeline:
>
>
>
> ReadSQS -> Window + Processing Time Trigger -> Convert to Avro Message ->
> Write to Avro
>
>
>
>   pipeline
>
> .apply(SqsIO.read().withQueueUrl(options.getQueueUrl()))
>
> .apply(
>
> Window.configure()
>
> .discardingFiredPanes()
>
> .triggering(
>
> Repeatedly.forever(
>
>
> AfterProcessingTime.pastFirstElementInPane().plusDelayOf(windowDur
>
> .apply(ParDo.of(new SqsMsgToAvroDoFn<>(recordClass, options)))
>
> .setCoder(AvroCoder.of(recordClass))
>
> .apply(
>
> AvroIO.write(recordClass)
>
> .withWindowedWrites()
>
> .withTempDirectory(options.getTempDir())
>
> .withNumShards(options.getShards())
>
> .to(new WindowedFilenamePolicy(options.getOutputPrefix(),
> "avro")));
>
>
>
> This all seemed fairly straightforward.  I have not yet observed lost data
> with this pipeline, but I am seeing an issue with timeliness.  Things seem
> to get hung up on finalizing file output, but I have yet to truly pinpoint
> the issue.  To really highlight the issue, I can setup a test where I send
> a single message to the source queue.  If nothing else happens, the data
> never makes it to its final output using the FlinkRunner (beam-2.15.0,
> flink-1.8).  Has anyone seen this behavior before?  Is the expectation of
> eventual consistency wrong?
>
>
>
> Thanks,
>
> -Preston
>
>
>
>
>
>
>
> CONFIDENTIALITY NOTICE This message and any included attachments are from
> Cerner Corporation and are intended only for the addressee. The information
> contained in this message is confidential and may constitute inside or
> non-public information under international, federal, or state securities
> laws. Unauthorized forwarding, printing, copying, distribution, or use of
> such information is strictly prohibited and may be unlawful. If you are not
> the addressee, please promptly delete this message and notify the sender of
> the delivery error by e-mail or you may call Cerner's corporate offices in
> Kansas City, Missouri, U.S.A at (+1) (816)221-1024 <(816)%20221-1024>.
>
>


Re: Strict timer ordering in Samza and Portable Flink Runners

2019-10-23 Thread Xinyu Liu
Hi, Jan,

Thanks for reporting this. I assigned BEAM-8459
 to myself and will take a
look soon.

Thanks,
Xinyu

On Wed, Oct 23, 2019 at 2:54 AM Jan Lukavský  wrote:

> Hi,
>
> as part of [1] a new set of validatesRunner tests has been introduced.
> These tests (currently marked as category UsesStrictTimerOrdering)
> verify that runners fire timers in increasing timestamp under all
> circumstances. After adding these validatesRunner tests, Samza [2] and
> Portable Flink [3] started to fail these tests. I have created the
> tracking issues for that, because that behavior should be fixed (timers
> in wrong order can cause erratic behavior and/or data loss).
>
> I'm writing to anyone interested in solving these issues.
>
> Cheers,
>
>   Jan
>
> [1] https://issues.apache.org/jira/browse/BEAM-7520
>
> [2] https://issues.apache.org/jira/browse/BEAM-8459
>
> [3] https://issues.apache.org/jira/browse/BEAM-8460
>
>


Re: FileIO and windowed writes

2019-10-23 Thread Koprivica,Preston Blake
I’ve tried different windowing functions and all result in the same behavior.  
The one in the previous email used the global window and a processing time 
based repeated trigger.  The filename policy used real system time to timestamp 
the outgoing files.

Here are a couple other window+trigger combos I’ve tried (with window based 
filename strategies):

Window.into(FixedWindows.of(windowDur))
.withAllowedLateness(Duration.standardHours(24))
.discardingFiredPanes()
.triggering(
Repeatedly.forever(

AfterProcessingTime.pastFirstElementInPane().plusDelayOf(windowDur
ie a fixed window with processing time based trigger + ~infinite lateness


pipeline

.apply(

format("ReadSQS(%s)", options.getQueueUrl()),

SqsIO.read().withQueueUrl(options.getQueueUrl()))

.apply(WithTimestamps.of((Message m) -> Instant.now()))

.apply(

format("Window(%s)", options.getWindowDuration()),

Window.into(FixedWindows.of(windowDur)))
ie map event time to processing time and use default trigger (ie close of 
window, no lateness)

These all resulted in the same behavior – data gets hung up in a temp file 
somewhere and the finalize file logic never seems to run.

Thanks,
-Preston

From: Reuven Lax 
Reply-To: "dev@beam.apache.org" 
Date: Wednesday, October 23, 2019 at 2:35 PM
To: dev 
Subject: Re: FileIO and windowed writes

What WindowFn are you using?

On Wed, Oct 23, 2019 at 11:36 AM Koprivica,Preston Blake 
mailto:preston.b.kopriv...@cerner.com>> wrote:
Hi guys,

I’m currently working on a simple system where the intention is to ingest data 
from a realtime stream – in this case amazon SQS – and write the output in an 
incremental fashion to a durable filesystem (ie S3).  It’s easy to think of 
this as a low-fi journaling system.  We need to make sure that data that’s 
written to the source queue eventually makes it to S3.  We are utilizing the 
FileIO windowed writes with a custom naming policy to partition the files by 
their event time.   Because SQS can’t guarantee order, we do have to allow late 
messages.  Moreover, we need a further guarantee that a message be written in a 
timely manner – we’re thinking some constant multiple of the windowing 
duration.  As a first pass, we were thinking a processing time based trigger 
that fires on some regular interval.  For context, here’s an example of the 
pipeline:

ReadSQS -> Window + Processing Time Trigger -> Convert to Avro Message -> Write 
to Avro

  pipeline
.apply(SqsIO.read().withQueueUrl(options.getQueueUrl()))
.apply(
Window.configure()
.discardingFiredPanes()
.triggering(
Repeatedly.forever(

AfterProcessingTime.pastFirstElementInPane().plusDelayOf(windowDur
.apply(ParDo.of(new SqsMsgToAvroDoFn<>(recordClass, options)))
.setCoder(AvroCoder.of(recordClass))
.apply(
AvroIO.write(recordClass)
.withWindowedWrites()
.withTempDirectory(options.getTempDir())
.withNumShards(options.getShards())
.to(new WindowedFilenamePolicy(options.getOutputPrefix(), 
"avro")));

This all seemed fairly straightforward.  I have not yet observed lost data with 
this pipeline, but I am seeing an issue with timeliness.  Things seem to get 
hung up on finalizing file output, but I have yet to truly pinpoint the issue.  
To really highlight the issue, I can setup a test where I send a single message 
to the source queue.  If nothing else happens, the data never makes it to its 
final output using the FlinkRunner (beam-2.15.0, flink-1.8).  Has anyone seen 
this behavior before?  Is the expectation of eventual consistency wrong?

Thanks,
-Preston




CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.


Re: FileIO and windowed writes

2019-10-23 Thread Reuven Lax
What WindowFn are you using?

On Wed, Oct 23, 2019 at 11:36 AM Koprivica,Preston Blake <
preston.b.kopriv...@cerner.com> wrote:

> Hi guys,
>
>
>
> I’m currently working on a simple system where the intention is to ingest
> data from a realtime stream – in this case amazon SQS – and write the
> output in an incremental fashion to a durable filesystem (ie S3).  It’s
> easy to think of this as a low-fi journaling system.  We need to make sure
> that data that’s written to the source queue eventually makes it to S3.  We
> are utilizing the FileIO windowed writes with a custom naming policy to
> partition the files by their event time.   Because SQS can’t guarantee
> order, we do have to allow late messages.  Moreover, we need a further
> guarantee that a message be written in a timely manner – we’re thinking
> some constant multiple of the windowing duration.  As a first pass, we were
> thinking a processing time based trigger that fires on some regular
> interval.  For context, here’s an example of the pipeline:
>
>
>
> ReadSQS -> Window + Processing Time Trigger -> Convert to Avro Message ->
> Write to Avro
>
>
>
>   pipeline
>
> .apply(SqsIO.read().withQueueUrl(options.getQueueUrl()))
>
> .apply(
>
> Window.configure()
>
> .discardingFiredPanes()
>
> .triggering(
>
> Repeatedly.forever(
>
>
> AfterProcessingTime.pastFirstElementInPane().plusDelayOf(windowDur
>
> .apply(ParDo.of(new SqsMsgToAvroDoFn<>(recordClass, options)))
>
> .setCoder(AvroCoder.of(recordClass))
>
> .apply(
>
> AvroIO.write(recordClass)
>
> .withWindowedWrites()
>
> .withTempDirectory(options.getTempDir())
>
> .withNumShards(options.getShards())
>
> .to(new WindowedFilenamePolicy(options.getOutputPrefix(),
> "avro")));
>
>
>
> This all seemed fairly straightforward.  I have not yet observed lost data
> with this pipeline, but I am seeing an issue with timeliness.  Things seem
> to get hung up on finalizing file output, but I have yet to truly pinpoint
> the issue.  To really highlight the issue, I can setup a test where I send
> a single message to the source queue.  If nothing else happens, the data
> never makes it to its final output using the FlinkRunner (beam-2.15.0,
> flink-1.8).  Has anyone seen this behavior before?  Is the expectation of
> eventual consistency wrong?
>
>
>
> Thanks,
>
> -Preston
>
>
>
>
>
>
> CONFIDENTIALITY NOTICE This message and any included attachments are from
> Cerner Corporation and are intended only for the addressee. The information
> contained in this message is confidential and may constitute inside or
> non-public information under international, federal, or state securities
> laws. Unauthorized forwarding, printing, copying, distribution, or use of
> such information is strictly prohibited and may be unlawful. If you are not
> the addressee, please promptly delete this message and notify the sender of
> the delivery error by e-mail or you may call Cerner's corporate offices in
> Kansas City, Missouri, U.S.A at (+1) (816)221-1024 <(816)%20221-1024>.
>


Re: [UPDATE] Preparing for Beam 2.17.0 release

2019-10-23 Thread Kenneth Knowles
I opened https://github.com/apache/beam/pull/9862 to raise the
documentation of Fix Version to the top level. It also includes the write
up of Jira priorities, to make clear that "Blocker" priority does not refer
to release blocking.

On Wed, Oct 23, 2019 at 11:16 AM Kenneth Knowles  wrote:

> I've gone over the tickets and removed Fix Version from many of them that
> do not seem to be critical defects. If I removed Fix Version from a ticket
> you care about, please feel free to add it back. I am not trying to decide
> what is in/out of the release, just trying to triage the Jira data to match
> expected practices.
>
> It should probably be documented somewhere outside of the release guide.
> As far as I can tell, the fact that we triage them down to zero is the only
> place we mention that it is used to indicate release blockers and not used
> for feature targets.
>
> Kenn
>
> On Wed, Oct 23, 2019 at 10:40 AM Kenneth Knowles  wrote:
>
>>  Wow, 28 release blocking tickets! That is the most I've ever seen, by
>> far. Many appear to be feature requests, not release-blocking defects. I
>> believe this is not according to our normal best practice. The release
>> cadence should not wait for features in progress, with exceptions discussed
>> on dev@. As a matter of best practice, I think we should triage feature
>> requests to not have Fix Version set until it has been discussed on dev@.
>>
>> Kenn
>>
>> On Wed, Oct 23, 2019 at 9:55 AM Mikhail Gryzykhin 
>> wrote:
>>
>>> Hi all,
>>>
>>> Beam 2.17 release branch cut is scheduled today (2019/10/23) according
>>> to the release calendar [1].  I'll start working on the branch cutoff
>>> and later work on cherry picking blocker fixes.
>>>
>>> If you have release blocking issues for 2.17 please mark their "Fix
>>> Version" as 2.17.0 [2]. This tag is already created in JIRA in case you
>>> would like to move any non-blocking issues to that version.
>>>
>>> There is a decent amount of open bugs to be resolved in 2.17.0 [2] and
>>> only 4 [3] are marked as blockers. Please, review those if these bugs are
>>> actually to be resolved in 2.17.0 and prioritize fixes if possible.
>>>
>>> Any thoughts, comments, objections?
>>>
>>> Regards.
>>> Mikhail.
>>>
>>>
>>> [1]
>>> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
>>> [2]
>>> https://issues.apache.org/jira/browse/BEAM-8457?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Reopened%2C%20Open%2C%20%22In%20Progress%22%2C%20%22Under%20Discussion%22%2C%20%22In%20Implementation%22%2C%20%22Triage%20Needed%22)%20AND%20fixVersion%20%3D%202.17.0
>>> [3]
>>> https://issues.apache.org/jira/browse/BEAM-8457?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Reopened%2C%20Open%2C%20%22In%20Progress%22%2C%20%22Under%20Discussion%22%2C%20%22In%20Implementation%22%2C%20%22Triage%20Needed%22)%20AND%20priority%20%3D%20Blocker%20AND%20fixVersion%20%3D%202.17.0
>>>
>>


Re: JIRA priorities explaination

2019-10-23 Thread Kenneth Knowles
I finally got around to writing some of this up. It is minimal. Feedback is
welcome, especially if what I have written does not accurately represent
the community's approach.

https://github.com/apache/beam/pull/9862

Kenn

On Mon, Feb 11, 2019 at 3:21 PM Daniel Oliveira 
wrote:

> Ah, sorry, I missed that Alex was just quoting from our Jira installation
> (didn't read his email closely enough). Also I wasn't aware about those
> pages on our website.
>
> Seeing as we do have definitions for our priorities, I guess my main
> request would be that they be made more discoverable somehow. I don't think
> the tooltips are reliable, and the pages on the website are informative,
> but hard to find. Since it feels a bit lazy to say "this isn't discoverable
> enough" without suggesting any improvements, I'd like to propose these two
> changes:
>
> 1. We should write a Beam Jira Guide with basic information about our
> Jira. I think the bug priorities should go in here, but also anything else
> we would want someone to know before filing any Jira issues, like how our
> components are organized or what the different issue types mean. This guide
> could either be written in the website or the wiki, but I think it should
> definitely be linked in https://beam.apache.org/contribute/ so that
> newcomers read it before getting their Jira account approved. The goal here
> being to have a reference for the basics of our Jira since at the moment it
> doesn't seem like we have anything for this.
>
> 2. The existing info on Post-commit and pre-commit policies doesn't seem
> very discoverable to someone monitoring the Pre/Post-commits. I've reported
> a handful of test-failures already and haven't seen this link mentioned
> much. We should try to find a way to funnel people towards this link when
> there's an issue, the same way we try to funnel people towards the
> contribution guide when they write a PR. As a note, while writing this
> email I remembered this link that someone gave me before (
> https://s.apache.org/beam-test-failure
> ).
> That mentions the Post-commit policies page, so maybe it's just a matter of
> pasting that all over our Jenkins builds whenever we have a failing test?
>
> PS: I'm also definitely for SLOs, but I figure it's probably better
> discussed in a separate thread so I'm trying to stick to the subject of
> priority definitions.
>
> On Mon, Feb 11, 2019 at 9:17 AM Scott Wegner  wrote:
>
>> Thanks for driving this discussion. I also was not aware of these
>> existing definitions. Once we agree on the terms, let's add them to our
>> Contributor Guide and start using them.
>>
>> +1 in general; I like both Alex and Kenn's definitions; Additional
>> wordsmithing could be moved to a Pull Request. Can we make the definitions
>> useful for both the person filing a bug, and the assignee, i.e.
>>
>> : .
>> 
>>
>> On Sun, Feb 10, 2019 at 7:49 PM Kenneth Knowles  wrote:
>>
>>> The content that Alex posted* is the definition from our Jira
>>> installation anyhow.
>>>
>>> I just searched around, and there's
>>> https://community.atlassian.com/t5/Jira-questions/According-to-Jira-What-is-Blocker-Critical-Major-Minor-and/qaq-p/668774
>>> which makes clear that this is really user-defined, since Jira has many
>>> deployments with their own configs.
>>>
>>> I guess what I want to know about this thread is what action is being
>>> proposed?
>>>
>>> Previously, there was a thread that resulted in
>>> https://beam.apache.org/contribute/precommit-policies/ and
>>> https://beam.apache.org/contribute/postcommits-policies/. These have
>>> test failures and flakes as Critical. I agree with Alex that these should
>>> be Blocker. They disrupt the work of the entire community, so we need to
>>> drop everything and get green again.
>>>
>>> Other than that, I think what you - Daniel - are suggesting is that the
>>> definition might be best expressed as SLOs. I asked on
>>> u...@infra.apache.org about how we could have those and the answer is
>>> the homebrew
>>> https://svn.apache.org/repos/infra/infrastructure/trunk/projects/status/sla/jira/.
>>> If anyone has time to dig into that and see if it can work for us, that
>>> would be cool.
>>>
>>> Kenn
>>>
>>> *Blocker: Blocks development and/or testing work, production could not
>>> run
>>> Critical: Crashes, loss of data, severe memory leak.
>>> Major (Default): Major loss of function.
>>> Minor: Minor loss of function, or other problem where easy workaround is
>>> present.
>>> Trivial: Trivial Cosmetic problem like misspelt words or misaligned text.
>>>
>>>
>>> On Sun, Feb 10, 2019 at 7:20 PM Daniel Oliveira 
>>> wrote:
>>>
 Are there existing meanings for the priorities in Jira already? I
 wasn't able to find any info on either the Beam website or wiki about it,
 so I've just been prioritizing issues based on gut feeling. If not, I think
 having some 

FileIO and windowed writes

2019-10-23 Thread Koprivica,Preston Blake
Hi guys,

I’m currently working on a simple system where the intention is to ingest data 
from a realtime stream – in this case amazon SQS – and write the output in an 
incremental fashion to a durable filesystem (ie S3).  It’s easy to think of 
this as a low-fi journaling system.  We need to make sure that data that’s 
written to the source queue eventually makes it to S3.  We are utilizing the 
FileIO windowed writes with a custom naming policy to partition the files by 
their event time.   Because SQS can’t guarantee order, we do have to allow late 
messages.  Moreover, we need a further guarantee that a message be written in a 
timely manner – we’re thinking some constant multiple of the windowing 
duration.  As a first pass, we were thinking a processing time based trigger 
that fires on some regular interval.  For context, here’s an example of the 
pipeline:

ReadSQS -> Window + Processing Time Trigger -> Convert to Avro Message -> Write 
to Avro

  pipeline
.apply(SqsIO.read().withQueueUrl(options.getQueueUrl()))
.apply(
Window.configure()
.discardingFiredPanes()
.triggering(
Repeatedly.forever(

AfterProcessingTime.pastFirstElementInPane().plusDelayOf(windowDur
.apply(ParDo.of(new SqsMsgToAvroDoFn<>(recordClass, options)))
.setCoder(AvroCoder.of(recordClass))
.apply(
AvroIO.write(recordClass)
.withWindowedWrites()
.withTempDirectory(options.getTempDir())
.withNumShards(options.getShards())
.to(new WindowedFilenamePolicy(options.getOutputPrefix(), 
"avro")));

This all seemed fairly straightforward.  I have not yet observed lost data with 
this pipeline, but I am seeing an issue with timeliness.  Things seem to get 
hung up on finalizing file output, but I have yet to truly pinpoint the issue.  
To really highlight the issue, I can setup a test where I send a single message 
to the source queue.  If nothing else happens, the data never makes it to its 
final output using the FlinkRunner (beam-2.15.0, flink-1.8).  Has anyone seen 
this behavior before?  Is the expectation of eventual consistency wrong?

Thanks,
-Preston




CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.


Re: [UPDATE] Preparing for Beam 2.17.0 release

2019-10-23 Thread Kenneth Knowles
I've gone over the tickets and removed Fix Version from many of them that
do not seem to be critical defects. If I removed Fix Version from a ticket
you care about, please feel free to add it back. I am not trying to decide
what is in/out of the release, just trying to triage the Jira data to match
expected practices.

It should probably be documented somewhere outside of the release guide. As
far as I can tell, the fact that we triage them down to zero is the only
place we mention that it is used to indicate release blockers and not used
for feature targets.

Kenn

On Wed, Oct 23, 2019 at 10:40 AM Kenneth Knowles  wrote:

>  Wow, 28 release blocking tickets! That is the most I've ever seen, by
> far. Many appear to be feature requests, not release-blocking defects. I
> believe this is not according to our normal best practice. The release
> cadence should not wait for features in progress, with exceptions discussed
> on dev@. As a matter of best practice, I think we should triage feature
> requests to not have Fix Version set until it has been discussed on dev@.
>
> Kenn
>
> On Wed, Oct 23, 2019 at 9:55 AM Mikhail Gryzykhin 
> wrote:
>
>> Hi all,
>>
>> Beam 2.17 release branch cut is scheduled today (2019/10/23) according
>> to the release calendar [1].  I'll start working on the branch cutoff
>> and later work on cherry picking blocker fixes.
>>
>> If you have release blocking issues for 2.17 please mark their "Fix
>> Version" as 2.17.0 [2]. This tag is already created in JIRA in case you
>> would like to move any non-blocking issues to that version.
>>
>> There is a decent amount of open bugs to be resolved in 2.17.0 [2] and
>> only 4 [3] are marked as blockers. Please, review those if these bugs are
>> actually to be resolved in 2.17.0 and prioritize fixes if possible.
>>
>> Any thoughts, comments, objections?
>>
>> Regards.
>> Mikhail.
>>
>>
>> [1]
>> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
>> [2]
>> https://issues.apache.org/jira/browse/BEAM-8457?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Reopened%2C%20Open%2C%20%22In%20Progress%22%2C%20%22Under%20Discussion%22%2C%20%22In%20Implementation%22%2C%20%22Triage%20Needed%22)%20AND%20fixVersion%20%3D%202.17.0
>> [3]
>> https://issues.apache.org/jira/browse/BEAM-8457?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Reopened%2C%20Open%2C%20%22In%20Progress%22%2C%20%22Under%20Discussion%22%2C%20%22In%20Implementation%22%2C%20%22Triage%20Needed%22)%20AND%20priority%20%3D%20Blocker%20AND%20fixVersion%20%3D%202.17.0
>>
>


Re: [UPDATE] Preparing for Beam 2.17.0 release

2019-10-23 Thread Kenneth Knowles
 Wow, 28 release blocking tickets! That is the most I've ever seen, by far.
Many appear to be feature requests, not release-blocking defects. I believe
this is not according to our normal best practice. The release cadence
should not wait for features in progress, with exceptions discussed on dev@.
As a matter of best practice, I think we should triage feature requests to
not have Fix Version set until it has been discussed on dev@.

Kenn

On Wed, Oct 23, 2019 at 9:55 AM Mikhail Gryzykhin  wrote:

> Hi all,
>
> Beam 2.17 release branch cut is scheduled today (2019/10/23) according to
> the release calendar [1].  I'll start working on the branch cutoff and
> later work on cherry picking blocker fixes.
>
> If you have release blocking issues for 2.17 please mark their "Fix
> Version" as 2.17.0 [2]. This tag is already created in JIRA in case you
> would like to move any non-blocking issues to that version.
>
> There is a decent amount of open bugs to be resolved in 2.17.0 [2] and
> only 4 [3] are marked as blockers. Please, review those if these bugs are
> actually to be resolved in 2.17.0 and prioritize fixes if possible.
>
> Any thoughts, comments, objections?
>
> Regards.
> Mikhail.
>
>
> [1]
> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
> [2]
> https://issues.apache.org/jira/browse/BEAM-8457?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Reopened%2C%20Open%2C%20%22In%20Progress%22%2C%20%22Under%20Discussion%22%2C%20%22In%20Implementation%22%2C%20%22Triage%20Needed%22)%20AND%20fixVersion%20%3D%202.17.0
> [3]
> https://issues.apache.org/jira/browse/BEAM-8457?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Reopened%2C%20Open%2C%20%22In%20Progress%22%2C%20%22Under%20Discussion%22%2C%20%22In%20Implementation%22%2C%20%22Triage%20Needed%22)%20AND%20priority%20%3D%20Blocker%20AND%20fixVersion%20%3D%202.17.0
>


[UPDATE] Preparing for Beam 2.17.0 release

2019-10-23 Thread Mikhail Gryzykhin
Hi all,

Beam 2.17 release branch cut is scheduled today (2019/10/23) according to
the release calendar [1].  I'll start working on the branch cutoff and
later work on cherry picking blocker fixes.

If you have release blocking issues for 2.17 please mark their "Fix
Version" as 2.17.0 [2]. This tag is already created in JIRA in case you
would like to move any non-blocking issues to that version.

There is a decent amount of open bugs to be resolved in 2.17.0 [2] and only
4 [3] are marked as blockers. Please, review those if these bugs are
actually to be resolved in 2.17.0 and prioritize fixes if possible.

Any thoughts, comments, objections?

Regards.
Mikhail.


[1]
https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
[2]
https://issues.apache.org/jira/browse/BEAM-8457?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Reopened%2C%20Open%2C%20%22In%20Progress%22%2C%20%22Under%20Discussion%22%2C%20%22In%20Implementation%22%2C%20%22Triage%20Needed%22)%20AND%20fixVersion%20%3D%202.17.0
[3]
https://issues.apache.org/jira/browse/BEAM-8457?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Reopened%2C%20Open%2C%20%22In%20Progress%22%2C%20%22Under%20Discussion%22%2C%20%22In%20Implementation%22%2C%20%22Triage%20Needed%22)%20AND%20priority%20%3D%20Blocker%20AND%20fixVersion%20%3D%202.17.0


Re: Question related to running unit tests in IDE

2019-10-23 Thread Ryan Skraba
Just for info -- I managed to get a pretty good state using IntelliJ
2019.2.3 (Fedora) and a plain gradle import!

There's a slack channel at https://s.apache.org/beam-slack-channel
(see https://beam.apache.org/community/contact-us/) called
#beam-intellij  It's pretty low-traffic, but you might be able to get
some real-time help there if you need it.

Ryan

On Wed, Oct 23, 2019 at 2:11 AM Saikat Maitra  wrote:
>
> Hi Michal, Alexey
>
> Thank you for your email. I am using macOS Catalina and JDK 8 with IntelliJ 
> IDEA 2019.1
>
> I will try to setup IntelliJ from scratch and see if the error resolves.
>
> Regards,
> Saikat
>
>
>
>
> On Tue, Oct 22, 2019 at 7:05 AM Alexey Romanenko  
> wrote:
>>
>> Hi,
>>
>> Thank you for your interest to contribute!
>>
>> Did you properly imported a project (as explained on page [1]) and all deps 
>> were resolved successfully?
>>
>> [1] 
>> https://cwiki.apache.org/confluence/display/BEAM/Set+up+IntelliJ+from+scratch
>>
>> On 22 Oct 2019, at 02:28, Saikat Maitra  wrote:
>>
>> Hi,
>>
>> I am interested to contribute to this issue
>>
>> https://issues.apache.org/jira/browse/BEAM-3658
>>
>> I have followed the contribution guide and was able to build the project 
>> locally using gradlew commands.
>>
>> I wanted to debug and trace the issue further by running the tests locally 
>> using Intellij Idea but I am getting following errors. I looked up the docs 
>> related to running tests 
>> (https://cwiki.apache.org/confluence/display/BEAM/Run+a+single+unit+test) 
>> and common IDE errors 
>> (https://cwiki.apache.org/confluence/display/BEAM/%28FAQ%29+Recovering+from+common+IDE+errors)
>>  but have not found similar errors.
>>
>> Error:(632, 17) java: cannot find symbol
>>   symbol:   method 
>> apply(org.apache.beam.sdk.transforms.Values)
>>   location: interface org.apache.beam.sdk.values.POutput
>>
>> Error:(169, 26) java: cannot find symbol
>>   symbol:   class PCollection
>>   location: class org.apache.beam.sdk.transforms.Watch
>>
>> Error:(169, 59) java: cannot find symbol
>>   symbol:   class KV
>>   location: class org.apache.beam.sdk.transforms.Watch
>>
>> Please let me know if you have feedback.
>>
>> Regards,
>> Saikat
>>
>>


Re: Java PortableRunner package name

2019-10-23 Thread Ismaël Mejía
+Ankur Goenka

Related JIRA. Maybe Ankur can chime in with more details on this and
other things he may have already thought.
https://issues.apache.org/jira/browse/BEAM-7303


On Tue, Oct 22, 2019 at 7:11 PM Maximilian Michels  wrote:
>
> +1 for moving. This is just a left-over from the fist "reference" runner
> implementation for portability.
>
> On 22.10.19 16:59, Łukasz Gajowy wrote:
> > +1 for moving/renaming. I agree with Kyle and Michał - there indeed
> > seems to be some confusion. The name "runners/reference" suggests that
> > it's a not production-ready "Runner" (it seems to be neither of those).
> > If possible, maybe sdks/java/portablility is a good place for this?
> >
> > Łukasz
> >
> > wt., 22 paź 2019 o 16:41 Kyle Weaver  > > napisał(a):
> >
> > I agree this should be moved. PortableRunner.java is analogous to
> > portable_runner.py, which resides under
> > sdks/python/apache_beam/runners/portability. Maybe
> > PortableRunner.java should be moved to somewhere under sdks/java, as
> > it's not actually a runner itself. The nomenclature is
> > confusing, PortableRunner could be more aptly named something like
> > `PortableRunnerClient`, or `JobClient` to better illustrate its
> > relationship with `JobServer`.
> >
> > On Tue, Oct 22, 2019 at 4:11 PM Michał Walenia
> > mailto:michal.wale...@polidea.com>> wrote:
> >
> > Hi,
> >
> > I found the Java PortableRunner class in
> > org.apache.beam.runners.reference package, where ReferenceRunner
> > used to reside prior to its deletion. The PortableRunner
> > implementation however is one that can be used with real
> > JobServers in production code.
> >
> > *
> > *
> >
> > It seems that this class shouldn’t be in the reference package
> > but somewhere else. I’d like to rename the package from
> > org.apache.beam.runners.reference to
> > org.apache.beam.runners.portability, as it contains only classes
> > related to the portable runner operation.
> >
> > *
> > *
> >
> > What do you think? If nobody is strongly against the change,
> > I’ll make a pull request with the refactor.
> >
> > *
> > *
> >
> > Have a good day,
> >
> > Michal
> >
> >
> >
> >
> > --
> >
> > Michał Walenia
> > Polidea  | Software Engineer
> >
> > M: +48 791 432 002 
> > E: michal.wale...@polidea.com 
> >
> > Unique Tech
> > Check out our projects! 
> >


Strict timer ordering in Samza and Portable Flink Runners

2019-10-23 Thread Jan Lukavský

Hi,

as part of [1] a new set of validatesRunner tests has been introduced. 
These tests (currently marked as category UsesStrictTimerOrdering) 
verify that runners fire timers in increasing timestamp under all 
circumstances. After adding these validatesRunner tests, Samza [2] and 
Portable Flink [3] started to fail these tests. I have created the 
tracking issues for that, because that behavior should be fixed (timers 
in wrong order can cause erratic behavior and/or data loss).


I'm writing to anyone interested in solving these issues.

Cheers,

 Jan

[1] https://issues.apache.org/jira/browse/BEAM-7520

[2] https://issues.apache.org/jira/browse/BEAM-8459

[3] https://issues.apache.org/jira/browse/BEAM-8460