Re: Dataflow worker overview graphs

2019-08-20 Thread Reza Rokni
Very nice resource, thanx Mikhail.



On Fri, 9 Aug 2019 at 05:22, Mikhail Gryzykhin  wrote:

> Unfortunately no, I don't have those for streaming explicitly.
>
> However most of code is shared between streaming and batch with main
> difference in initialization. Same goes for boilerplate parts of legacy vs
> FnApi.
>
> If you happen to create anything similar for streaming, please update page
> and let me know. Also I'll update this page with relevant changes once I
> get back to worker.
>
> --Mikhail
>
> On Thu, Aug 8, 2019 at 2:13 PM Ankur Goenka  wrote:
>
>> Thanks Mikhail. This is really useful.
>> Do you also have something similar for Streaming use case. More
>> specifically for Portable (fn_api) based streaming pipelines.
>>
>>
>> On Thu, Aug 8, 2019 at 2:08 PM Mikhail Gryzykhin 
>> wrote:
>>
>>> Hello everybody,
>>>
>>> Just wanted to share that I have found some graphs for dataflow worker I
>>> created while starting working on it. They cover specific scenarios, but
>>> may be useful for newcomers, so I put them into this wiki page
>>> 
>>> .
>>>
>>> If you feel they belong to some other location, please let me know.
>>>
>>> Regards,
>>> Mikhail.
>>>
>>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.


Re: Write-through-cache in State logic

2019-08-20 Thread Thomas Weise
Commenting here vs. on the PR since related to the overall approach.

Wouldn't it be simpler to have the runner just track a unique ID for each
worker and use that to communicate if the cache is valid or not?

* When the bundle is started, the runner tells the worker if the cache has
become invalid (since it knows if another worker has mutated state)
* When the worker sends mutation requests to the runner, it includes its
own ID (or the runner already has it as contextual information). No need to
wait for a response.
* When the bundle is finished, the runner records the last writer (only if
a change occurred)

Whenever current worker ID and last writer ID doesn't match, cache is
invalid.

Thomas


On Tue, Aug 20, 2019 at 11:42 AM Lukasz Cwik  wrote:

> Having cache tokens per key would be very expensive indeed and I believe
> we should go with a single cache token "per" bundle.
>
> On Mon, Aug 19, 2019 at 11:36 AM Maximilian Michels 
> wrote:
>
>> Maybe a Beam Python expert can chime in for Rakesh's question?
>>
>> Luke, I was assuming cache tokens to be per key and state id. During
>> implementing an initial support on the Runner side, I realized that we
>> probably want cache tokens to only be per state id. Note that if we had
>> per-key cache tokens, the number of cache tokens would approach the
>> total number of keys in an application.
>>
>> If anyone wants to have a look, here is a first version of the Runner
>> side for cache tokens. Note that I only implemented cache tokens for
>> BagUserState for now, but it can be easily added for side inputs as well.
>>
>> https://github.com/apache/beam/pull/9374
>>
>> -Max
>>
>>
>>


Re: [DISCUSS] Backwards compatibility of @Experimental features

2019-08-20 Thread Kenneth Knowles
I think this is a great basis for guidelines on the blog post at a minimum.

A real changelog, gathered in one place on the web page, is more useful
than scanning the blog.

The Jira version summaries are pretty close, but we would need to be much
more serious about making good titles and getting the types right. And we
may want a quick & easy way to suppress things that are not substantial.

Kenn

On Tue, Aug 13, 2019 at 1:50 AM Ismaël Mejía  wrote:

> I stumbled recently into this specification for changelogs, maybe we can
> follow it, or at least use some of  their sections for further blog posts
> about releases.
> https://keepachangelog.com/en/1.0.0/
>
> On Mon, Aug 12, 2019 at 6:08 PM Anton Kedin  wrote:
>
>> Concrete user feedback:
>> https://stackoverflow.com/questions/57453473/was-the-beamrecord-type-removed-from-apache-beam/57463708#57463708
>> Short version: we moved BeamRecord from Beam SQL to core Beam and renamed
>> it to Row (still @Experimental, BTW). But we never mentioned it anywhere
>> where it would be easy for users to find. Highlighting deprecations and
>> major shifts of public APIs in the release blog post (and in Javadoc) can
>> help make this traceable at the very least.
>>
>> Regards,
>> Anton
>>
>> On Wed, May 8, 2019 at 1:42 PM Kenneth Knowles  wrote:
>>
>>>
>>>
>>> On Wed, May 8, 2019 at 9:29 AM Ahmet Altay  wrote:
>>>


 *From: *Kenneth Knowles 
 *Date: *Wed, May 8, 2019 at 9:24 AM
 *To: *dev


>
> On Fri, Apr 19, 2019 at 3:09 AM Ismaël Mejía 
> wrote:
>
>> It seems we mostly agree that @Experimental is important, and that
>> API changes (removals) on experimental features should happen quickly but
>> still give some time to users so the Experimental purpose is not lost.
>>
>> Ahmet proposal given our current release calendar is close to 2
>> releases. Can we settle this on 2 releases as a 'minimum time' before
>> removal? (This will let maintainers the option to choose to support it 
>> more
>> time if they want as discussed in the related KafkaIO thread but still be
>> friendly with users).
>>
>> Do we agree?
>>
>
> This sounds pretty good to me.
>

 Sounds good to me too.


> How can we manage this? Right now we tie most activities (like
> re-triaging flakes) to the release process, since it is the only thing 
> that
> happens regularly for the community. If we don't have some forcing then I
> expect the whole thing will just be forgotten.
>

 Can we pre-create a list of future releases in JIRA, and for each
 experimental feature require that a JIRA issue is created for resolving the
 experimental status and tag it with the release that will happen after the
 minimum time period?

>>>
>>> Great idea. I just created the 2.15.0 release so it reaches far enough
>>> ahead for right now.
>>>
>>> Kenn
>>>
>>>

> Kenn
>
>
>>
>> Note: for the other subjects (e.g. when an Experimental feature
>> should become not experimental) I think we will hardly find an agreement 
>> so
>> I think this should be treated in a per case basis by the maintainers, 
>> but
>> if you want to follow up on that discussion we can open another thread 
>> for
>> this.
>>
>>
>>
>> On Sat, Apr 6, 2019 at 1:04 AM Ahmet Altay  wrote:
>>
>>> I agree that Experimental feature is still very useful. I was trying
>>> to argue that we diluted its value so +1 to reclaim that.
>>>
>>> Back to the original question, in my opinion removing existing
>>> "experimental and deprecated" features in n=1 release will confuse 
>>> users.
>>> This will likely be a surprise to them because we have been maintaining
>>> this state release after release now. I would propose in the next 
>>> release
>>> warning users of such a change happening and give them at least 3 
>>> months to
>>> upgrade to suggested newer paths. In the future we can have a shorter
>>> timelines assuming that we will set the user expectations right.
>>>
>>> On Fri, Apr 5, 2019 at 3:01 PM Ismaël Mejía 
>>> wrote:
>>>
 I agree 100% with Kenneth on the multiple advantages that the
 Experimental feature gave us. I also can count multiple places where 
 this
 has been essential in other modules than core. I disagree on the fact 
 that
 the @Experimental annotation has lost sense, it is simply ill defined, 
 and
 probably it is by design because its advantages come from it.

 Most of the topics in this thread are a consequence of the this
 loose definition, e.g. (1) not defining how a feature becomes stable, 
 and
 (2) what to do when we want to remove an experimental feature, are 
 ideas
 that we need to decide if we define just continue to 

Re: Try to understand "Output timestamps must be no earlier than the timestamp of the current input"

2019-08-20 Thread Chengzhi Zhao
Hi Robert,

Thanks for your information, that explains the behavior I noticed. I guess
my current solution would be somehow to shift the watermark or start the
streaming process before any files come in to settle down the initial
watermark.

I will keep watching the JIRA you shared, thanks for the insights.

-Chengzhi

On Tue, Aug 20, 2019 at 4:53 PM Robert Bradshaw  wrote:

> The original timestamps are probably being assigned in the
> watchForNewFiles transform, which is also setting the watermark:
>
>
> https://github.com/apache/beam/blob/release-2.15.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L668
>
> Until https://issues.apache.org/jira/browse/BEAM-644 is resolved, it
> probably makes sense to be able to customize the lag here.
>
> On Fri, Aug 16, 2019 at 6:44 PM Chengzhi Zhao 
> wrote:
> >
> > Hi Theodore,
> >
> > Thanks again for your insight and help. I'd like to learn more about how
> we got the timestamp from WindowedValue initially from +
> dev@beam.apache.org
> >
> > -Chengzhi
> >
> > On Fri, Aug 16, 2019 at 7:41 PM Theodore Siu  wrote:
> >>
> >> Hi Chengzhi,
> >>
> >> I'm not completely sure where/how the timestamp is set for a
> ProcessContext object. Here is the error code found within the Apache Beam
> repo.
> >>
> https://github.com/apache/beam/blob/master/runners/core-java/src/main/java/org/apache/beam/runners/core/SimpleDoFnRunner.java
> >> which makes reference to `elem.getTimestamp()` where elem is a
> WindowedValue.
> >>
> >> I am thinking +dev@beam.apache.org can offer some insight. Would be
> interested to find out more myself.
> >>
> >> -Theo
> >>
> >> On Fri, Aug 16, 2019 at 3:04 PM Chengzhi Zhao 
> wrote:
> >>>
> >>> Hi Theodore,
> >>>
> >>> Thanks for your reply. This is just a simple example that I tried to
> understand how event time works in Beam. I could have more fields and I
> would have an event time for each of record, so I tried to let Beam know
> which filed is the event time to use for later windowing and computation.
> >>>
> >>> I think we you mentioned the probable reason sounds reasonable, I am
> still trying to figure out in the error message "current input
> (2019-08-16T12:39:06.887Z)" is coming from if you have any insight on it.
> >>>
> >>> Thanks a lot for your help.
> >>>
> >>> -- Chengzhi
> >>>
> >>> On Fri, Aug 16, 2019 at 9:57 AM Theodore Siu 
> wrote:
> 
>  Hi Chengzhi,
> 
>  Are you simply trying to emit the timestamp onward? Why not just use
> `out.output` with an PCollection?
> 
>  static class ReadWithEventTime extends DoFn {
>  @DoFn.ProcessElement
>  public void processElement(@Element String line,
> OutputReceiver out){
>  out.output(new Instant(Long.parseLong(line)));
>  }
>  }
> 
>  You can also output the line itself as a PCollection. If you
> line has additional information to parse, consider a KeyValue Pair
> https://beam.apache.org/releases/javadoc/2.2.0/index.html?org/apache/beam/sdk/values/KV.html
> where you can emit both some parsed context of the string and the timestamp.
> 
>  The probable reason why outputWithTimestamp doesn't work with older
> times is that the timestamp emitted is used specifically for windowing and
> for streaming type Data pipelines to determine which window each record
> belongs for aggregations.
> 
>  -Theo
> 
> 
>  On Fri, Aug 16, 2019 at 8:52 AM Chengzhi Zhao <
> w.zhaocheng...@gmail.com> wrote:
> >
> > Hi folks,
> >
> > I am new to Beam and try to play with some example, I am running
> Beam 2.14 with Direct runner to read some files (I continue generated).
> >
> > I am facing this error: Cannot output with timestamp
> 2019-08-16T12:30:15.120Z. Output timestamps must be no earlier than the
> timestamp of the current input (2019-08-16T12:39:06.887Z) minus the allowed
> skew (0 milliseconds). I searched online but still don't quite understand
> it so I am asking here for some help.
> >
> > A file has some past timestamp in it:
> > 1565958615120
> > 1565958615120
> > 1565958615121
> >
> > My code looks something like this:
> >
> >static class ReadWithEventTime extends DoFn {
> > @ProcessElement
> > public void processElement(@Element String line,
> OutputReceiver out){
> > out.outputWithTimestamp(line, new
> Instant(Long.parseLong(line)));
> > }
> > }
> >
> > public static void main(String[] args) {
> > PipelineOptions options = PipelineOptionsFactory.create();
> > Pipeline pipeline = Pipeline.create(options);
> >
> > String sourcePath = new File("files/").getPath();
> >
> > PCollection data = pipeline.apply("ReadData",
> > TextIO.read().from(sourcePath + "/test*")
> >
>  .watchForNewFiles(Duration.standardSeconds(5),
> Watch.Growth.never()));
> >
> > data.apply("ReadWithEventTime", Pa

Re: Java 11 compatibility question

2019-08-20 Thread Elliotte Rusty Harold
If somebody is using JPMS and they attempt to import beam, they get a
compile time error. Some other projects I work on have been getting
user reports about this.

See 
https://github.com/GoogleCloudPlatform/cloud-opensource-java/blob/master/library-best-practices/JLBP-19.md
for more details.

On Tue, Aug 20, 2019 at 5:30 PM Ahmet Altay  wrote:
>
>
>
> On Tue, Aug 20, 2019 at 8:37 AM Elliotte Rusty Harold  
> wrote:
>>
>>
>>
>> On Tue, Aug 20, 2019 at 7:51 AM Ismaël Mejía  wrote:
>>>
>>> a per case approach (the exception could be portable runners not based on 
>>> Java).
>>>
>>> Of course other definitions of being Java 11 compatible are interesting but 
>>> probably not part of our current scope. Actions like change the codebase to 
>>> use Java 11 specific APIs / idioms, publish Java 11 specific artifacts or 
>>> use Java Platform Modules (JPM). All of these may be nice to have but are 
>>> probably less important for end users who may just want to be able to use 
>>> Beam in its current form in Java 11 VMs.
>>>
>>> What do others think? Is this enough to announce Java 11 compatibility and 
>>> add the documentation to the webpage?
>>
>>
>> No, it isn't, I fear. We don't have to use JPMS in Beam, but Beam really 
>> does need to be compatible with JPMS-using apps. The bare minimum here is 
>> avoiding split packages, and that needs to include all transitive 
>> dependencies, not just Beam itself. I don't think we meet that bar now.
>
>
> For my understanding, what would be the limitations of Beam's dependencies 
> having split dependencies? Would it limit Beam users from using 3rd party 
> libraries that require JPMS supports? Would it be in scope for Beam to get 
> its dependencies to meet a certain bar?
>
> Ismaël's definition of being able to run Beam published dependencies in Java 
> 11 VM sounds enough to me "to announce Java 11 compatibility _for Beam_".
>
>>
>>
>> --
>> Elliotte Rusty Harold
>> elh...@ibiblio.org



-- 
Elliotte Rusty Harold
elh...@ibiblio.org


Re: Java 11 compatibility question

2019-08-20 Thread Ahmet Altay
On Tue, Aug 20, 2019 at 8:37 AM Elliotte Rusty Harold 
wrote:

>
>
> On Tue, Aug 20, 2019 at 7:51 AM Ismaël Mejía  wrote:
>
>> a per case approach (the exception could be portable runners not based on
>> Java).
>>
>> Of course other definitions of being Java 11 compatible are interesting
>> but probably not part of our current scope. Actions like change the
>> codebase to use Java 11 specific APIs / idioms, publish Java 11 specific
>> artifacts or use Java Platform Modules (JPM). All of these may be nice to
>> have but are probably less important for end users who may just want to be
>> able to use Beam in its current form in Java 11 VMs.
>>
>> What do others think? Is this enough to announce Java 11 compatibility
>> and add the documentation to the webpage?
>>
>
> No, it isn't, I fear. We don't have to use JPMS in Beam, but Beam really
> does need to be compatible with JPMS-using apps. The bare minimum here is
> avoiding split packages, and that needs to include all transitive
> dependencies, not just Beam itself. I don't think we meet that bar now.
>

For my understanding, what would be the limitations of Beam's dependencies
having split dependencies? Would it limit Beam users from using 3rd party
libraries that require JPMS supports? Would it be in scope for Beam to get
its dependencies to meet a certain bar?

Ismaël's definition of being able to run Beam published dependencies in
Java 11 VM sounds enough to me "to announce Java 11 compatibility _for
Beam_".


>
> --
> Elliotte Rusty Harold
> elh...@ibiblio.org
>


Re: [VOTE] Release 2.15.0, release candidate #2

2019-08-20 Thread Pablo Estrada
+1

I've installed from the source in  apache/dist.
I've run unit tests in Python 3.6, and wordcount in Python 3.6 in Direct
and Dataflow runners.

Thanks!
-P.

On Tue, Aug 20, 2019 at 11:41 AM Hannah Jiang 
wrote:

> Yes, I agree this is a separate topic and shouldn't block 2.15 release.
> There is already a JIRA ticket, I will update it with more details.
>
> On Tue, Aug 20, 2019 at 11:32 AM Ahmet Altay  wrote:
>
>>
>>
>> On Tue, Aug 20, 2019 at 10:18 AM Yifan Zou  wrote:
>>
>>> Hi all,
>>>
>>> This is a friendly reminder. Please help to review, verify and vote on
>>> the release candidate #2 for the version 2.15.0.
>>> [ ] +1, Approve the release
>>> [ ] -1, Do not approve the release (please provide specific comments)
>>>
>>> I've verified Java quickstart & mobile games, and Python (both tar and
>>> wheel) quickstart with Py27, 35, 36, 37. They worked well.
>>>
>>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit?pli=1#gid=1036192804
>>>
>>> Thanks.
>>> Yifan
>>>
>>>
>>>
>>> On Tue, Aug 20, 2019 at 9:33 AM Hannah Jiang 
>>> wrote:
>>>
 A side note about this test:
 Now we only have py2 and py35, so it only fails with py35. I am
 introducing minor versions, which will add py36 and py37, and all py3 are
 flaky.
 It's really difficult to pass Portable Precommit with minor versions,
 the chance of passing the test is around 15%.

>>>
>> Hannah, let's separate this from the release thread. Is there a JIRA for
>> this, could you update it? And perhaps we need different pre commits for
>> different versions so that flakes do not stack up. Even if a suite is >90%
>> reliable, if we stack up with 4 version, the reliability will get much
>> lower.
>>
>>
>>>
 On Mon, Aug 19, 2019 at 5:27 PM Ahmet Altay  wrote:

> Thank you. Unless there are any objects, let's continue with
> validating RC2.
>
> On Mon, Aug 19, 2019 at 5:21 PM Kyle Weaver 
> wrote:
>
>> I'm not sure if it's worth blocking the release, since I can't
>> reproduce the issue on my machine and a fix would be hard to verify.
>>
>> Kyle Weaver | Software Engineer | github.com/ibzib |
>> kcwea...@google.com
>>
>>
>> On Mon, Aug 19, 2019 at 4:56 PM Ahmet Altay  wrote:
>>
>>> Kyle, are you currently working on this to decide whether it is the
>>> blocking case or not? Also is this affecting both release branch and 
>>> master
>>> branch?
>>>
>>> On Mon, Aug 19, 2019 at 4:49 PM Kyle Weaver 
>>> wrote:
>>>
 Re BEAM-7993: For some context, there are two possible causes here.
 The pessimistic take is that Dockerized SDK workers are taking forever 
 to
 start. The optimistic take is that the Docker containers are just 
 longer
 than normal (but not forever) to start on Jenkins, in which case this 
 issue
 is nothing new to this release. (The halting problem!) If it's the 
 second,
 it's safe to wait to fix it in the next release.

 Kyle Weaver | Software Engineer | github.com/ibzib |
 kcwea...@google.com


 On Mon, Aug 19, 2019 at 4:42 PM Yifan Zou 
 wrote:

> Mark and Kyle found a py35 portable test which is flaky:
> https://issues.apache.org/jira/browse/BEAM-7993.
> I plan to finalize the release this week. Would that be a blocker?
> Could we include the fix in 2.16?
>
> Thanks.
>
>
> On Mon, Aug 19, 2019 at 4:38 PM Yifan Zou 
> wrote:
>
>> I've run most of validations and they're all good.
>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit?pli=1#gid=1036192804
>>
>> On Mon, Aug 19, 2019 at 10:59 AM Hannah Jiang <
>> hannahji...@google.com> wrote:
>>
>>> (resending it to dev@)
>>> +1, I tested some test cases as well as customized test cases
>>> and all looks good. I updated validation sheet.
>>>
>>> On Mon, Aug 19, 2019 at 10:40 AM Hannah Jiang <
>>> hannahji...@google.com> wrote:
>>>
 +1, I tested some test cases as well as customized test cases
 and all looks good. I updated validation sheet.

 On Mon, Aug 19, 2019 at 10:28 AM Ahmet Altay 
 wrote:

> Hi all,
>
> Please help with validation and voting on the RC2. Let's help
> Yifan to finalize this release.
>
> Ahmet
>
> -- Forwarded message -
> From: Ahmet Altay 
> Date: Mon, Aug 19, 2019 at 10:27 AM
> Subject: Re: [VOTE] Release 2.15.0, release candidate #2
> To: dev 
>
>
> +1, verified a few python workloads

Re: [PROPOSAL] An initial Schema API in Python

2019-08-20 Thread Ahmet Altay
On Tue, Aug 20, 2019 at 3:48 PM Brian Hulette  wrote:

>
>
> On Tue, Aug 20, 2019 at 1:41 PM Robert Bradshaw 
> wrote:
>
>> On Mon, Aug 19, 2019 at 5:44 PM Ahmet Altay  wrote:
>> >
>> >
>> >
>> > On Mon, Aug 19, 2019 at 9:56 AM Brian Hulette 
>> wrote:
>> >>
>> >>
>> >>
>> >> On Fri, Aug 16, 2019 at 5:17 PM Chad Dombrova 
>> wrote:
>> 
>>  >> Agreed on float since it seems to trivially map to a double, but
>> I’m torn on int still. While I do want int type hints to work, it doesn’t
>> seem appropriate to map it to AtomicType.INT64, since it has a completely
>> different range of values.
>>  >>
>>  >> Let’s say we used native int for the runtime field type, not just
>> as a schema declaration for numpy.int64. What is the real world fallout
>> from this? Would there be data loss?
>>  >
>>  > I'm not sure I follow the question exactly, what is the interplay
>> between int and numpy.int64 in this scenario? Are you saying that np.int64
>> is used in the schema declaration, but we just use native int at runtime,
>> and check the bit width when encoding?
>>  >
>>  > In any case, I don't think the real world fallout of using int is
>> nearly that dire. I suppose data loss is possible if a poorly designed
>> pipeline overflows an int64 and crashes,
>> 
>>  The primary risk is that it *won't* crash when overflowing an int64,
>>  it'll just silently give the wrong answer. That's much less safe than
>>  using a native int and then actually crashing in the case it's too
>>  large at the point one tries to encode it.
>> >>>
>> >>>
>> >>> If the behavior of numpy.int64 is less safe than int, and both
>> support 64-bit integers, and int is the more intuitive type to use, then
>> that seems to make a strong case for using int rather than numpy.int64.
>> >>>
>> >>
>> >> I'm not sure we established numpy.int64 is less safe, just that a
>> silent overflow is a risk.
>>
>> Silent overflows are inherently less safe, especially for a language
>> where users in general never have to deal with this.
>>
>
> Absolutely agree that silent overflows are unsafe! I was just trying to
> point out that numpy isn't strictly silent. But as you point out below it's
> irrelevant because the runtime type is still int.
>
>
>> >> By default numpy will just log a warning when an overflow occurs, so
>> it's not totally silent, but definitely risky. numpy can however be made to
>> throw an exception when an overflow occurs with `np.seterr(over='raise')`.
>>
>> Warning logs on remote machines are unlikely to ever be seen. Even if
>> one knew about the numpy setting (keep in mind the user may not ever
>> directly user or import numpy), it doesn't seem to work (and one would
>> have to set it on the remote workers, or propagate this setting if set
>> in the main program).
>>
>> In [1]: import numpy as np
>> In [2]: np.seterr(over='raise')  # returns previous value
>> Out[2]: {'divide': 'warn', 'invalid': 'warn', 'over': 'warn', 'under':
>> 'ignore'}
>> In [3]: np.int64(2**36) * np.int64(2**36)
>> Out[3]: 0
>>
>>
> That's odd.. I ran the same test (Python 2.7, numpy 1.16) and it worked
> for me:
>
> In [4]: import numpy as np
>
> In [5]: np.int64(2**36) * np.int64(2**36)
> /usr/local/google/home/bhulette/working_dir/beam/sdks/python/venv/bin/ipython:1:
> RuntimeWarning: overflow encountered in long_scalars
>
>
>
> #!/usr/local/google/home/bhulette/working_dir/beam/sdks/python/venv/bin/python
> Out[5]: 0
>
> In [6]: np.seterr(over='raise')
> Out[6]: {'divide': 'warn', 'invalid': 'warn', 'over': 'warn', 'under':
> 'ignore'}
>
> In [7]: np.int64(2**36) * np.int64(2**36)
> ---
> FloatingPointErrorTraceback (most recent call last)
>  in ()
> > 1 np.int64(2**36) * np.int64(2**36)
>
> FloatingPointError: overflow encountered in long_scalars
>
>
>
>> >> Regardless of what type is used in the typing representation of a
>> schema, we've established that RowCoder.encode should accept anything
>> convertible to an int for integer fields. So it will need to check it's
>> width and raise an error if it's too large.
>> >> I added some tests last week to ensure that RowCoder does this [1].
>> However they're currently skipped because I'm unsure of the proper place to
>> raise the error. I wrote up the details in a comment [2] (sorry I did a
>> force push so the comment doesn't show up in the appropriate place).
>> >>
>> >> Note that when decoding an INT32/64 field RowCoder still produces
>> plain old ints (since it relies on VarIntCoder), so int really is the
>> runtime type, and the numpy types are just for the typing representation of
>> a schema.
>> >>
>> >> I also updated my PR to accept int, float, and str in the typing
>> representation of a schema, and added the following summary of type
>> mappings to typehints.schema [1], since it's not readily apparent from the
>> code itself:
>> >
>> >
>> > Cool!
>> >
>> >>
>

Re: [PROPOSAL] An initial Schema API in Python

2019-08-20 Thread Brian Hulette
On Tue, Aug 20, 2019 at 1:41 PM Robert Bradshaw  wrote:

> On Mon, Aug 19, 2019 at 5:44 PM Ahmet Altay  wrote:
> >
> >
> >
> > On Mon, Aug 19, 2019 at 9:56 AM Brian Hulette 
> wrote:
> >>
> >>
> >>
> >> On Fri, Aug 16, 2019 at 5:17 PM Chad Dombrova 
> wrote:
> 
>  >> Agreed on float since it seems to trivially map to a double, but
> I’m torn on int still. While I do want int type hints to work, it doesn’t
> seem appropriate to map it to AtomicType.INT64, since it has a completely
> different range of values.
>  >>
>  >> Let’s say we used native int for the runtime field type, not just
> as a schema declaration for numpy.int64. What is the real world fallout
> from this? Would there be data loss?
>  >
>  > I'm not sure I follow the question exactly, what is the interplay
> between int and numpy.int64 in this scenario? Are you saying that np.int64
> is used in the schema declaration, but we just use native int at runtime,
> and check the bit width when encoding?
>  >
>  > In any case, I don't think the real world fallout of using int is
> nearly that dire. I suppose data loss is possible if a poorly designed
> pipeline overflows an int64 and crashes,
> 
>  The primary risk is that it *won't* crash when overflowing an int64,
>  it'll just silently give the wrong answer. That's much less safe than
>  using a native int and then actually crashing in the case it's too
>  large at the point one tries to encode it.
> >>>
> >>>
> >>> If the behavior of numpy.int64 is less safe than int, and both support
> 64-bit integers, and int is the more intuitive type to use, then that seems
> to make a strong case for using int rather than numpy.int64.
> >>>
> >>
> >> I'm not sure we established numpy.int64 is less safe, just that a
> silent overflow is a risk.
>
> Silent overflows are inherently less safe, especially for a language
> where users in general never have to deal with this.
>

Absolutely agree that silent overflows are unsafe! I was just trying to
point out that numpy isn't strictly silent. But as you point out below it's
irrelevant because the runtime type is still int.


> >> By default numpy will just log a warning when an overflow occurs, so
> it's not totally silent, but definitely risky. numpy can however be made to
> throw an exception when an overflow occurs with `np.seterr(over='raise')`.
>
> Warning logs on remote machines are unlikely to ever be seen. Even if
> one knew about the numpy setting (keep in mind the user may not ever
> directly user or import numpy), it doesn't seem to work (and one would
> have to set it on the remote workers, or propagate this setting if set
> in the main program).
>
> In [1]: import numpy as np
> In [2]: np.seterr(over='raise')  # returns previous value
> Out[2]: {'divide': 'warn', 'invalid': 'warn', 'over': 'warn', 'under':
> 'ignore'}
> In [3]: np.int64(2**36) * np.int64(2**36)
> Out[3]: 0
>
>
That's odd.. I ran the same test (Python 2.7, numpy 1.16) and it worked for
me:

In [4]: import numpy as np

In [5]: np.int64(2**36) * np.int64(2**36)
/usr/local/google/home/bhulette/working_dir/beam/sdks/python/venv/bin/ipython:1:
RuntimeWarning: overflow encountered in long_scalars



#!/usr/local/google/home/bhulette/working_dir/beam/sdks/python/venv/bin/python
Out[5]: 0

In [6]: np.seterr(over='raise')
Out[6]: {'divide': 'warn', 'invalid': 'warn', 'over': 'warn', 'under':
'ignore'}

In [7]: np.int64(2**36) * np.int64(2**36)
---
FloatingPointErrorTraceback (most recent call last)
 in ()
> 1 np.int64(2**36) * np.int64(2**36)

FloatingPointError: overflow encountered in long_scalars



> >> Regardless of what type is used in the typing representation of a
> schema, we've established that RowCoder.encode should accept anything
> convertible to an int for integer fields. So it will need to check it's
> width and raise an error if it's too large.
> >> I added some tests last week to ensure that RowCoder does this [1].
> However they're currently skipped because I'm unsure of the proper place to
> raise the error. I wrote up the details in a comment [2] (sorry I did a
> force push so the comment doesn't show up in the appropriate place).
> >>
> >> Note that when decoding an INT32/64 field RowCoder still produces plain
> old ints (since it relies on VarIntCoder), so int really is the runtime
> type, and the numpy types are just for the typing representation of a
> schema.
> >>
> >> I also updated my PR to accept int, float, and str in the typing
> representation of a schema, and added the following summary of type
> mappings to typehints.schema [1], since it's not readily apparent from the
> code itself:
> >
> >
> > Cool!
> >
> >>
> >>
> >> Python  Schema
> >> np.int8 <-> BYTE
> >> np.int16<-> INT16
> >> np.int32<-> INT32
> >> np.int64<-> INT64
> >> int ---/
> >> np.float32  <-> 

Re: How to test a new precommit in PR?

2019-08-20 Thread Rui Wang
I have run the seed job. Fortunately the new precommit is only triggered by
SQL related PRs. I can follow up on those PRs if there is any negative
feedback.

But thanks for the heads up.


-Rui

On Tue, Aug 20, 2019 at 2:58 PM Boyuan Zhang  wrote:

> And after you run seed job, other PR's commit may also trigger your
> percommit test. You can set your precommit task as only triggered by phrase
> for test purpose.
>


Re: How to test a new precommit in PR?

2019-08-20 Thread Boyuan Zhang
And after you run seed job, other PR's commit may also trigger your
percommit test. You can set your precommit task as only triggered by phrase
for test purpose.


Re: How to test a new precommit in PR?

2019-08-20 Thread Rui Wang
Thank you Yifan!


-Rui

On Tue, Aug 20, 2019 at 2:52 PM Yifan Zou  wrote:

> Run the seed job then trigger your tests by using phrase.
>
> On Tue, Aug 20, 2019 at 2:39 PM Rui Wang  wrote:
>
>> Hi Community,
>>
>> I am trying to add a new precommit task (see [1] and [2]), and the PR is
>> pending. Does anyone know how to test the added precommit directly in the
>> PR before merging it?
>>
>>
>>
>> [1]:
>> https://github.com/apache/beam/pull/9210/files#diff-d6dfd4f4d675cfe2d6f52ae6fea472d0
>> [2]:
>> https://github.com/apache/beam/pull/9210/files#diff-c197962302397baf3a4cc36463dce5ea
>>
>>
>>
>> -Rui
>>
>


Re: How to test a new precommit in PR?

2019-08-20 Thread Yifan Zou
Run the seed job then trigger your tests by using phrase.

On Tue, Aug 20, 2019 at 2:39 PM Rui Wang  wrote:

> Hi Community,
>
> I am trying to add a new precommit task (see [1] and [2]), and the PR is
> pending. Does anyone know how to test the added precommit directly in the
> PR before merging it?
>
>
>
> [1]:
> https://github.com/apache/beam/pull/9210/files#diff-d6dfd4f4d675cfe2d6f52ae6fea472d0
> [2]:
> https://github.com/apache/beam/pull/9210/files#diff-c197962302397baf3a4cc36463dce5ea
>
>
>
> -Rui
>


How to test a new precommit in PR?

2019-08-20 Thread Rui Wang
Hi Community,

I am trying to add a new precommit task (see [1] and [2]), and the PR is
pending. Does anyone know how to test the added precommit directly in the
PR before merging it?



[1]:
https://github.com/apache/beam/pull/9210/files#diff-d6dfd4f4d675cfe2d6f52ae6fea472d0
[2]:
https://github.com/apache/beam/pull/9210/files#diff-c197962302397baf3a4cc36463dce5ea



-Rui


Re: Try to understand "Output timestamps must be no earlier than the timestamp of the current input"

2019-08-20 Thread Robert Bradshaw
The original timestamps are probably being assigned in the
watchForNewFiles transform, which is also setting the watermark:

https://github.com/apache/beam/blob/release-2.15.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L668

Until https://issues.apache.org/jira/browse/BEAM-644 is resolved, it
probably makes sense to be able to customize the lag here.

On Fri, Aug 16, 2019 at 6:44 PM Chengzhi Zhao  wrote:
>
> Hi Theodore,
>
> Thanks again for your insight and help. I'd like to learn more about how we 
> got the timestamp from WindowedValue initially from +dev@beam.apache.org
>
> -Chengzhi
>
> On Fri, Aug 16, 2019 at 7:41 PM Theodore Siu  wrote:
>>
>> Hi Chengzhi,
>>
>> I'm not completely sure where/how the timestamp is set for a ProcessContext 
>> object. Here is the error code found within the Apache Beam repo.
>> https://github.com/apache/beam/blob/master/runners/core-java/src/main/java/org/apache/beam/runners/core/SimpleDoFnRunner.java
>> which makes reference to `elem.getTimestamp()` where elem is a WindowedValue.
>>
>> I am thinking +dev@beam.apache.org can offer some insight. Would be 
>> interested to find out more myself.
>>
>> -Theo
>>
>> On Fri, Aug 16, 2019 at 3:04 PM Chengzhi Zhao  
>> wrote:
>>>
>>> Hi Theodore,
>>>
>>> Thanks for your reply. This is just a simple example that I tried to 
>>> understand how event time works in Beam. I could have more fields and I 
>>> would have an event time for each of record, so I tried to let Beam know 
>>> which filed is the event time to use for later windowing and computation.
>>>
>>> I think we you mentioned the probable reason sounds reasonable, I am still 
>>> trying to figure out in the error message "current input 
>>> (2019-08-16T12:39:06.887Z)" is coming from if you have any insight on it.
>>>
>>> Thanks a lot for your help.
>>>
>>> -- Chengzhi
>>>
>>> On Fri, Aug 16, 2019 at 9:57 AM Theodore Siu  wrote:

 Hi Chengzhi,

 Are you simply trying to emit the timestamp onward? Why not just use 
 `out.output` with an PCollection?

 static class ReadWithEventTime extends DoFn {
 @DoFn.ProcessElement
 public void processElement(@Element String line, 
 OutputReceiver out){
 out.output(new Instant(Long.parseLong(line)));
 }
 }

 You can also output the line itself as a PCollection. If you line 
 has additional information to parse, consider a KeyValue Pair 
 https://beam.apache.org/releases/javadoc/2.2.0/index.html?org/apache/beam/sdk/values/KV.html
  where you can emit both some parsed context of the string and the 
 timestamp.

 The probable reason why outputWithTimestamp doesn't work with older times 
 is that the timestamp emitted is used specifically for windowing and for 
 streaming type Data pipelines to determine which window each record 
 belongs for aggregations.

 -Theo


 On Fri, Aug 16, 2019 at 8:52 AM Chengzhi Zhao  
 wrote:
>
> Hi folks,
>
> I am new to Beam and try to play with some example, I am running Beam 
> 2.14 with Direct runner to read some files (I continue generated).
>
> I am facing this error: Cannot output with timestamp 
> 2019-08-16T12:30:15.120Z. Output timestamps must be no earlier than the 
> timestamp of the current input (2019-08-16T12:39:06.887Z) minus the 
> allowed skew (0 milliseconds). I searched online but still don't quite 
> understand it so I am asking here for some help.
>
> A file has some past timestamp in it:
> 1565958615120
> 1565958615120
> 1565958615121
>
> My code looks something like this:
>
>static class ReadWithEventTime extends DoFn {
> @ProcessElement
> public void processElement(@Element String line, 
> OutputReceiver out){
> out.outputWithTimestamp(line, new Instant(Long.parseLong(line)));
> }
> }
>
> public static void main(String[] args) {
> PipelineOptions options = PipelineOptionsFactory.create();
> Pipeline pipeline = Pipeline.create(options);
>
> String sourcePath = new File("files/").getPath();
>
> PCollection data = pipeline.apply("ReadData",
> TextIO.read().from(sourcePath + "/test*")
> .watchForNewFiles(Duration.standardSeconds(5), 
> Watch.Growth.never()));
>
> data.apply("ReadWithEventTime", ParDo.of(new 
> ReadWithEventTime()));
>
> pipeline.run().waitUntilFinish();
>
> }
>
>
> I am trying to understand in the error message where "current input 
> (2019-08-16T12:39:06.887Z)" is comming from. Is it the lowest watermark 
> when I start my application? If that's the case, is there a way that I 
> can change the initial watermark?
>
> Also, I can setup `withAllowedTimestampSkew` but it looks like it has 
>

Re: [PROPOSAL] An initial Schema API in Python

2019-08-20 Thread Robert Bradshaw
On Mon, Aug 19, 2019 at 5:44 PM Ahmet Altay  wrote:
>
>
>
> On Mon, Aug 19, 2019 at 9:56 AM Brian Hulette  wrote:
>>
>>
>>
>> On Fri, Aug 16, 2019 at 5:17 PM Chad Dombrova  wrote:

 >> Agreed on float since it seems to trivially map to a double, but I’m 
 >> torn on int still. While I do want int type hints to work, it doesn’t 
 >> seem appropriate to map it to AtomicType.INT64, since it has a 
 >> completely different range of values.
 >>
 >> Let’s say we used native int for the runtime field type, not just as a 
 >> schema declaration for numpy.int64. What is the real world fallout from 
 >> this? Would there be data loss?
 >
 > I'm not sure I follow the question exactly, what is the interplay 
 > between int and numpy.int64 in this scenario? Are you saying that 
 > np.int64 is used in the schema declaration, but we just use native int 
 > at runtime, and check the bit width when encoding?
 >
 > In any case, I don't think the real world fallout of using int is nearly 
 > that dire. I suppose data loss is possible if a poorly designed pipeline 
 > overflows an int64 and crashes,

 The primary risk is that it *won't* crash when overflowing an int64,
 it'll just silently give the wrong answer. That's much less safe than
 using a native int and then actually crashing in the case it's too
 large at the point one tries to encode it.
>>>
>>>
>>> If the behavior of numpy.int64 is less safe than int, and both support 
>>> 64-bit integers, and int is the more intuitive type to use, then that seems 
>>> to make a strong case for using int rather than numpy.int64.
>>>
>>
>> I'm not sure we established numpy.int64 is less safe, just that a silent 
>> overflow is a risk.

Silent overflows are inherently less safe, especially for a language
where users in general never have to deal with this.

>> By default numpy will just log a warning when an overflow occurs, so it's 
>> not totally silent, but definitely risky. numpy can however be made to throw 
>> an exception when an overflow occurs with `np.seterr(over='raise')`.

Warning logs on remote machines are unlikely to ever be seen. Even if
one knew about the numpy setting (keep in mind the user may not ever
directly user or import numpy), it doesn't seem to work (and one would
have to set it on the remote workers, or propagate this setting if set
in the main program).

In [1]: import numpy as np
In [2]: np.seterr(over='raise')  # returns previous value
Out[2]: {'divide': 'warn', 'invalid': 'warn', 'over': 'warn', 'under': 'ignore'}
In [3]: np.int64(2**36) * np.int64(2**36)
Out[3]: 0

>> Regardless of what type is used in the typing representation of a schema, 
>> we've established that RowCoder.encode should accept anything convertible to 
>> an int for integer fields. So it will need to check it's width and raise an 
>> error if it's too large.
>> I added some tests last week to ensure that RowCoder does this [1]. However 
>> they're currently skipped because I'm unsure of the proper place to raise 
>> the error. I wrote up the details in a comment [2] (sorry I did a force push 
>> so the comment doesn't show up in the appropriate place).
>>
>> Note that when decoding an INT32/64 field RowCoder still produces plain old 
>> ints (since it relies on VarIntCoder), so int really is the runtime type, 
>> and the numpy types are just for the typing representation of a schema.
>>
>> I also updated my PR to accept int, float, and str in the typing 
>> representation of a schema, and added the following summary of type mappings 
>> to typehints.schema [1], since it's not readily apparent from the code 
>> itself:
>
>
> Cool!
>
>>
>>
>> Python  Schema
>> np.int8 <-> BYTE
>> np.int16<-> INT16
>> np.int32<-> INT32
>> np.int64<-> INT64
>> int ---/
>> np.float32  <-> FLOAT
>> np.float64  <-> DOUBLE
>> float   ---/
>> bool<-> BOOLEAN
>> The mappings for STRING and BYTES are different between python 2 and python 
>> 3,
>> because of the changes to str:
>> py3:
>> str/unicode <-> STRING
>> bytes   <-> BYTES
>> ByteString  ---/
>> py2:
>> unicode <-> STRING
>> str/bytes   ---/
>> ByteString  <-> BYTES
>>
>> As you can see, int and float typings can now be used to create a schema 
>> with an INT64 or DOUBLE attribute, but when creating an anonymous NamedTuple 
>> sub-class from a schema, the numpy types are preferred. I prefer that 
>> approach, if only for symmetry with the other integer and floating point 
>> types, but I can change it to prefer int/float if I'm the only one that 
>> feels that way.

Just to be clear, this is just talking about the schema itself (as at
that level, due to the many-to-one mapping above, no distinction is
made between int vs. int64). The runtime types are still int/float,
right?

> Just an opinion: As a user I would expect anonymous types created for me to 
> have n

Re: (mini-doc) Beam (Flink) portable job templates

2019-08-20 Thread Robert Bradshaw
The point of expansion services is to run at pipeline construction
time so that the caller can build on top of the outputs. E.g. we're
hoping to expose Beam's SQL transforms to other languages via an
expansion service and *not* duplicate the logic of parsing the SQL
statements to determine the type(s) of the outputs. Even for simpler
IOs, we would like to take advantage of schema information (e.g.
looked up at construction time) to produce results and validate (or
even inform) subsequent construction.

I think we're also making a mistake in talking about "the" expansion
service here, as if there was only one well defined service that all
pipenes used. If we go the route of deferring some expansion to the
runner, we need a way of naming expansion services. It seems like this
proposal is simply isomorphic to defining new primitive transforms
which some (all?) runners are just expected to understand.

On Tue, Aug 20, 2019 at 10:11 AM Thomas Weise  wrote:
>
>
>
> On Tue, Aug 20, 2019 at 8:56 AM Lukasz Cwik  wrote:
>>
>>
>>
>> On Mon, Aug 19, 2019 at 5:52 PM Ahmet Altay  wrote:
>>>
>>>
>>>
>>> On Sun, Aug 18, 2019 at 12:34 PM Thomas Weise  wrote:

 There is a PR open for this: https://github.com/apache/beam/pull/9331

 (it wasn't tagged with the JIRA and therefore not linked)

 I think it is worthwhile to explore how we could further detangle the 
 client side Python and Java dependencies.

 The expansion service is one more dependency to consider in a build 
 environment. Is it really necessary to expand external transforms prior to 
 submission to the job service?
>>>
>>>
>>> +1, this will make it easier to use external transforms from the already 
>>> familiar client environments.
>>>
>>
>>
>> The intent is to make it so that you CAN (not MUST) run an expansion service 
>> separate from a Runner. Creating a single endpoint that hosts both the Job 
>> and Expansion service is something that gRPC does very easily since you can 
>> host multiple service definitions on a single port.
>
>
> Yes, that's fine. The point here is when the expansion occurs. I believe the 
> runner can also invoke the expansion service, thereby eliminating the 
> expansion service interaction from the client side.
>
>
>>
>>


 Can we come up with a partially constructed proto that can be produced by 
 just running the Python entry point? Note this would also require pushing 
 the pipeline options parsing into the job service.
>>>
>>>
>>> Why would this require pushing the pipeline options parsing to the job 
>>> service. Assuming that python will have enough idea about the external 
>>> transform what options it will need. The necessary bit could be converted 
>>> to arguments and be part of that partially constructed proto.
>>>


 On Sun, Aug 18, 2019 at 12:01 PM enrico canzonieri  
 wrote:
>
> I found the tracking ticket at BEAM-7966
>
> On Sun, Aug 18, 2019 at 11:59 AM enrico canzonieri 
>  wrote:
>>
>> Is this alternative still being considered? Creating a portable jar 
>> sounds like a good solution to re-use the existing runner specific 
>> deployment mechanism (e.g. Flink k8s operator) and in general simplify 
>> the deployment story.
>>
>> On Fri, Aug 9, 2019 at 12:46 AM Robert Bradshaw  
>> wrote:
>>>
>>> The expansion service is a separate service. (The flink jar happens to
>>> bring both up.) However, there is negotiation to receive/validate the
>>> pipeline options.
>>>
>>> On Fri, Aug 9, 2019 at 1:54 AM Thomas Weise  wrote:
>>> >
>>> > We would also need to consider cross-language pipelines that 
>>> > (currently) assume the interaction with an expansion service at 
>>> > construction time.
>>> >
>>> > On Thu, Aug 8, 2019, 4:38 PM Kyle Weaver  wrote:
>>> >>
>>> >> > It might also be useful to have the option to just output the 
>>> >> > proto and artifacts, as alternative to the jar file.
>>> >>
>>> >> Sure, that wouldn't be too big a change if we were to decide to go 
>>> >> the SDK route.
>>> >>
>>> >> > For the Flink entry point we would need to allow for the job 
>>> >> > server to be used as a library.
>>> >>
>>> >> We don't need the whole job server, we only need to add a main 
>>> >> method to FlinkPipelineRunner [1] as the entry point, which would 
>>> >> basically just do the setup described in the doc then call 
>>> >> FlinkPipelineRunner::run.
>>> >>
>>> >> [1] 
>>> >> https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineRunner.java#L53
>>> >>
>>> >> Kyle Weaver | Software Engineer | github.com/ibzib | 
>>> >> kcwea...@google.com
>>> >>
>>> >>
>>> >> On Thu, Aug 8, 2019 at 4:21 PM Thomas Weise  wrote:
>>> >>>
>>> >>> Hi Kyle,
>>> >>>
>>> >>> It might also be useful to h

Re: Need advice: PubsubIO external transform PR

2019-08-20 Thread Chad Dombrova
> The issue is also tracked here:
>
https://jira.apache.org/jira/browse/BEAM-7870 There are some suggestions
> in the issue. I think the best solution is to allow execution of the
> source API parts of KafkaIO/PubSubIO (on the Runner) and the following
> UDFs (in the environment). Since those do not cross the language
> boundary, it is simply a matter of settings this up correctly.
>

Thanks, Max.  I added some feedback to the issue.


Re: [VOTE] Release 2.15.0, release candidate #2

2019-08-20 Thread Hannah Jiang
Yes, I agree this is a separate topic and shouldn't block 2.15 release.
There is already a JIRA ticket, I will update it with more details.

On Tue, Aug 20, 2019 at 11:32 AM Ahmet Altay  wrote:

>
>
> On Tue, Aug 20, 2019 at 10:18 AM Yifan Zou  wrote:
>
>> Hi all,
>>
>> This is a friendly reminder. Please help to review, verify and vote on
>> the release candidate #2 for the version 2.15.0.
>> [ ] +1, Approve the release
>> [ ] -1, Do not approve the release (please provide specific comments)
>>
>> I've verified Java quickstart & mobile games, and Python (both tar and
>> wheel) quickstart with Py27, 35, 36, 37. They worked well.
>>
>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit?pli=1#gid=1036192804
>>
>> Thanks.
>> Yifan
>>
>>
>>
>> On Tue, Aug 20, 2019 at 9:33 AM Hannah Jiang 
>> wrote:
>>
>>> A side note about this test:
>>> Now we only have py2 and py35, so it only fails with py35. I am
>>> introducing minor versions, which will add py36 and py37, and all py3 are
>>> flaky.
>>> It's really difficult to pass Portable Precommit with minor versions,
>>> the chance of passing the test is around 15%.
>>>
>>
> Hannah, let's separate this from the release thread. Is there a JIRA for
> this, could you update it? And perhaps we need different pre commits for
> different versions so that flakes do not stack up. Even if a suite is >90%
> reliable, if we stack up with 4 version, the reliability will get much
> lower.
>
>
>>
>>> On Mon, Aug 19, 2019 at 5:27 PM Ahmet Altay  wrote:
>>>
 Thank you. Unless there are any objects, let's continue with validating
 RC2.

 On Mon, Aug 19, 2019 at 5:21 PM Kyle Weaver 
 wrote:

> I'm not sure if it's worth blocking the release, since I can't
> reproduce the issue on my machine and a fix would be hard to verify.
>
> Kyle Weaver | Software Engineer | github.com/ibzib |
> kcwea...@google.com
>
>
> On Mon, Aug 19, 2019 at 4:56 PM Ahmet Altay  wrote:
>
>> Kyle, are you currently working on this to decide whether it is the
>> blocking case or not? Also is this affecting both release branch and 
>> master
>> branch?
>>
>> On Mon, Aug 19, 2019 at 4:49 PM Kyle Weaver 
>> wrote:
>>
>>> Re BEAM-7993: For some context, there are two possible causes here.
>>> The pessimistic take is that Dockerized SDK workers are taking forever 
>>> to
>>> start. The optimistic take is that the Docker containers are just longer
>>> than normal (but not forever) to start on Jenkins, in which case this 
>>> issue
>>> is nothing new to this release. (The halting problem!) If it's the 
>>> second,
>>> it's safe to wait to fix it in the next release.
>>>
>>> Kyle Weaver | Software Engineer | github.com/ibzib |
>>> kcwea...@google.com
>>>
>>>
>>> On Mon, Aug 19, 2019 at 4:42 PM Yifan Zou 
>>> wrote:
>>>
 Mark and Kyle found a py35 portable test which is flaky:
 https://issues.apache.org/jira/browse/BEAM-7993.
 I plan to finalize the release this week. Would that be a blocker?
 Could we include the fix in 2.16?

 Thanks.


 On Mon, Aug 19, 2019 at 4:38 PM Yifan Zou 
 wrote:

> I've run most of validations and they're all good.
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit?pli=1#gid=1036192804
>
> On Mon, Aug 19, 2019 at 10:59 AM Hannah Jiang <
> hannahji...@google.com> wrote:
>
>> (resending it to dev@)
>> +1, I tested some test cases as well as customized test cases and
>> all looks good. I updated validation sheet.
>>
>> On Mon, Aug 19, 2019 at 10:40 AM Hannah Jiang <
>> hannahji...@google.com> wrote:
>>
>>> +1, I tested some test cases as well as customized test cases
>>> and all looks good. I updated validation sheet.
>>>
>>> On Mon, Aug 19, 2019 at 10:28 AM Ahmet Altay 
>>> wrote:
>>>
 Hi all,

 Please help with validation and voting on the RC2. Let's help
 Yifan to finalize this release.

 Ahmet

 -- Forwarded message -
 From: Ahmet Altay 
 Date: Mon, Aug 19, 2019 at 10:27 AM
 Subject: Re: [VOTE] Release 2.15.0, release candidate #2
 To: dev 


 +1, verified a few python workloads and updated the validation
 sheet.

 Thank you Yifan!

 On Thu, Aug 15, 2019 at 12:49 AM Yifan Zou 
 wrote:

> Hi everyone,
> Please review and vote on the release candidate #2 for the
> version 2.15.0, as follows:
> [ ] +1, App

Re: [VOTE] Release 2.15.0, release candidate #2

2019-08-20 Thread Ahmet Altay
On Tue, Aug 20, 2019 at 10:18 AM Yifan Zou  wrote:

> Hi all,
>
> This is a friendly reminder. Please help to review, verify and vote on
> the release candidate #2 for the version 2.15.0.
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
> I've verified Java quickstart & mobile games, and Python (both tar and
> wheel) quickstart with Py27, 35, 36, 37. They worked well.
>
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit?pli=1#gid=1036192804
>
> Thanks.
> Yifan
>
>
>
> On Tue, Aug 20, 2019 at 9:33 AM Hannah Jiang 
> wrote:
>
>> A side note about this test:
>> Now we only have py2 and py35, so it only fails with py35. I am
>> introducing minor versions, which will add py36 and py37, and all py3 are
>> flaky.
>> It's really difficult to pass Portable Precommit with minor versions, the
>> chance of passing the test is around 15%.
>>
>
Hannah, let's separate this from the release thread. Is there a JIRA for
this, could you update it? And perhaps we need different pre commits for
different versions so that flakes do not stack up. Even if a suite is >90%
reliable, if we stack up with 4 version, the reliability will get much
lower.


>
>> On Mon, Aug 19, 2019 at 5:27 PM Ahmet Altay  wrote:
>>
>>> Thank you. Unless there are any objects, let's continue with validating
>>> RC2.
>>>
>>> On Mon, Aug 19, 2019 at 5:21 PM Kyle Weaver  wrote:
>>>
 I'm not sure if it's worth blocking the release, since I can't
 reproduce the issue on my machine and a fix would be hard to verify.

 Kyle Weaver | Software Engineer | github.com/ibzib |
 kcwea...@google.com


 On Mon, Aug 19, 2019 at 4:56 PM Ahmet Altay  wrote:

> Kyle, are you currently working on this to decide whether it is the
> blocking case or not? Also is this affecting both release branch and 
> master
> branch?
>
> On Mon, Aug 19, 2019 at 4:49 PM Kyle Weaver 
> wrote:
>
>> Re BEAM-7993: For some context, there are two possible causes here.
>> The pessimistic take is that Dockerized SDK workers are taking forever to
>> start. The optimistic take is that the Docker containers are just longer
>> than normal (but not forever) to start on Jenkins, in which case this 
>> issue
>> is nothing new to this release. (The halting problem!) If it's the 
>> second,
>> it's safe to wait to fix it in the next release.
>>
>> Kyle Weaver | Software Engineer | github.com/ibzib |
>> kcwea...@google.com
>>
>>
>> On Mon, Aug 19, 2019 at 4:42 PM Yifan Zou 
>> wrote:
>>
>>> Mark and Kyle found a py35 portable test which is flaky:
>>> https://issues.apache.org/jira/browse/BEAM-7993.
>>> I plan to finalize the release this week. Would that be a blocker?
>>> Could we include the fix in 2.16?
>>>
>>> Thanks.
>>>
>>>
>>> On Mon, Aug 19, 2019 at 4:38 PM Yifan Zou 
>>> wrote:
>>>
 I've run most of validations and they're all good.
 https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit?pli=1#gid=1036192804

 On Mon, Aug 19, 2019 at 10:59 AM Hannah Jiang <
 hannahji...@google.com> wrote:

> (resending it to dev@)
> +1, I tested some test cases as well as customized test cases and
> all looks good. I updated validation sheet.
>
> On Mon, Aug 19, 2019 at 10:40 AM Hannah Jiang <
> hannahji...@google.com> wrote:
>
>> +1, I tested some test cases as well as customized test cases and
>> all looks good. I updated validation sheet.
>>
>> On Mon, Aug 19, 2019 at 10:28 AM Ahmet Altay 
>> wrote:
>>
>>> Hi all,
>>>
>>> Please help with validation and voting on the RC2. Let's help
>>> Yifan to finalize this release.
>>>
>>> Ahmet
>>>
>>> -- Forwarded message -
>>> From: Ahmet Altay 
>>> Date: Mon, Aug 19, 2019 at 10:27 AM
>>> Subject: Re: [VOTE] Release 2.15.0, release candidate #2
>>> To: dev 
>>>
>>>
>>> +1, verified a few python workloads and updated the validation
>>> sheet.
>>>
>>> Thank you Yifan!
>>>
>>> On Thu, Aug 15, 2019 at 12:49 AM Yifan Zou 
>>> wrote:
>>>
 Hi everyone,
 Please review and vote on the release candidate #2 for the
 version 2.15.0, as follows:
 [ ] +1, Approve the release
 [ ] -1, Do not approve the release (please provide specific
 comments)


 The complete staging area is available for your review, which
 includes:
 * JIRA release notes [1].
 * The official Apache source release to be deployed to
>

Re: contributor permission in jira and hello

2019-08-20 Thread Pablo Estrada
Ah this is a great feature. Thanks for looking into it!

On Tue, Aug 20, 2019 at 12:44 AM Ismaël Mejía  wrote:

> Hello Cannan,
>
> Welcome! You were added to the contributors role and the ticket was
> assigned to you too. Now you can also self assign JIRAs if you want to
> contribute in other areas.
>
>
> On Mon, Aug 19, 2019 at 10:01 PM Canaan Silberberg 
> wrote:
> >
> > HI all
> >
> > I'm working with beam's BigQueryIO over at Etsy Inc. and we're
> interested in this feature: https://issues.apache.org/jira/browse/BEAM-876
> >
> > ...which I see is unassigned. I'd like to implement and contribute it.
> Could I have permission to self-assign it in Jira? (my jira user name is
> ziel)
>


Re: Need advice: PubsubIO external transform PR

2019-08-20 Thread Chamikara Jayalath
On Tue, Aug 20, 2019 at 4:29 AM Maximilian Michels  wrote:

> Hi Chad!
>
> Thank you so much for your feedback. You are 100% on the right track.
> What you are seeing is a core issue that also needs to be solved for
> KafkaIO to be fully usable in other SDKs. I haven't had much time to
> work on this in the past weeks but now is the time :)
>
> The cross-language implementation recycles source connectors (like
> KafkaIO.Read) which use the legacy source interface. This interface does
> not exist anymore in portability. The current portability architecture
> assumes UDFs (this includes KafkaIO.Read) to run in an environment, and
> not directly on the Runner. This causes issues when there are multiple
> transforms associated with a connector which run with and without an
> environment, e.g. KafkaIO.Read or PubSubIO.Read.


> The issue is also tracked here:
> https://jira.apache.org/jira/browse/BEAM-7870 There are some suggestions
> in the issue. I think the best solution is to allow execution of the
> source API parts of KafkaIO/PubSubIO (on the Runner) and the following
> UDFs (in the environment). Since those do not cross the language
> boundary, it is simply a matter of settings this up correctly.
>

I think another way to fix this will be to introduce an UnboundedSource to
SDF converter when we have SDF for unbounded source for Java SDK. I
think +Lukasz
Cwik  already added something like this for Python SDK.
This will allow KafkaIO.Read to execute in the Java SDK environment instead
of the runner.

Thanks,
Cham


>
> -Max
>
> On 20.08.19 03:45, Chad Dombrova wrote:
> >
> > I don't understand why this replacement is necessary, since
> > the next transform in the chain is a java ParDo that seems
> > like it should be fully capable of using
> > PubsubMessageWithAttributesCoder.
> >
> >
> > Not too familiar with Flink, but have you tried using PubSub source
> > from a pure Java Flink pipeline ? I expect this to work but haven't
> > tested it. If that works agree that the coder replacement sounds
> > strange.
> >
> >
> > Just spoke with my co-worker and he confirmed that Pubsub Read works in
> > pure Java on Flink.  We're going to test sending the same Java pipeline
> > through the portable runner next.  Our hypothesis is that it will /not/
> > work.  If it /does/, then that means it's something specific to how the
> > transforms are created when using the expansion service.
> >
> > -chad
> >
> >
>


Re: [VOTE] Release 2.15.0, release candidate #2

2019-08-20 Thread Yifan Zou
Hi all,

This is a friendly reminder. Please help to review, verify and vote on the
release candidate #2 for the version 2.15.0.
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

I've verified Java quickstart & mobile games, and Python (both tar and
wheel) quickstart with Py27, 35, 36, 37. They worked well.
https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit?pli=1#gid=1036192804

Thanks.
Yifan



On Tue, Aug 20, 2019 at 9:33 AM Hannah Jiang  wrote:

> A side note about this test:
> Now we only have py2 and py35, so it only fails with py35. I am
> introducing minor versions, which will add py36 and py37, and all py3 are
> flaky.
> It's really difficult to pass Portable Precommit with minor versions, the
> chance of passing the test is around 15%.
>
> On Mon, Aug 19, 2019 at 5:27 PM Ahmet Altay  wrote:
>
>> Thank you. Unless there are any objects, let's continue with validating
>> RC2.
>>
>> On Mon, Aug 19, 2019 at 5:21 PM Kyle Weaver  wrote:
>>
>>> I'm not sure if it's worth blocking the release, since I can't reproduce
>>> the issue on my machine and a fix would be hard to verify.
>>>
>>> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com
>>>
>>>
>>> On Mon, Aug 19, 2019 at 4:56 PM Ahmet Altay  wrote:
>>>
 Kyle, are you currently working on this to decide whether it is the
 blocking case or not? Also is this affecting both release branch and master
 branch?

 On Mon, Aug 19, 2019 at 4:49 PM Kyle Weaver 
 wrote:

> Re BEAM-7993: For some context, there are two possible causes here.
> The pessimistic take is that Dockerized SDK workers are taking forever to
> start. The optimistic take is that the Docker containers are just longer
> than normal (but not forever) to start on Jenkins, in which case this 
> issue
> is nothing new to this release. (The halting problem!) If it's the second,
> it's safe to wait to fix it in the next release.
>
> Kyle Weaver | Software Engineer | github.com/ibzib |
> kcwea...@google.com
>
>
> On Mon, Aug 19, 2019 at 4:42 PM Yifan Zou  wrote:
>
>> Mark and Kyle found a py35 portable test which is flaky:
>> https://issues.apache.org/jira/browse/BEAM-7993.
>> I plan to finalize the release this week. Would that be a blocker?
>> Could we include the fix in 2.16?
>>
>> Thanks.
>>
>>
>> On Mon, Aug 19, 2019 at 4:38 PM Yifan Zou 
>> wrote:
>>
>>> I've run most of validations and they're all good.
>>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit?pli=1#gid=1036192804
>>>
>>> On Mon, Aug 19, 2019 at 10:59 AM Hannah Jiang <
>>> hannahji...@google.com> wrote:
>>>
 (resending it to dev@)
 +1, I tested some test cases as well as customized test cases and
 all looks good. I updated validation sheet.

 On Mon, Aug 19, 2019 at 10:40 AM Hannah Jiang <
 hannahji...@google.com> wrote:

> +1, I tested some test cases as well as customized test cases and
> all looks good. I updated validation sheet.
>
> On Mon, Aug 19, 2019 at 10:28 AM Ahmet Altay 
> wrote:
>
>> Hi all,
>>
>> Please help with validation and voting on the RC2. Let's help
>> Yifan to finalize this release.
>>
>> Ahmet
>>
>> -- Forwarded message -
>> From: Ahmet Altay 
>> Date: Mon, Aug 19, 2019 at 10:27 AM
>> Subject: Re: [VOTE] Release 2.15.0, release candidate #2
>> To: dev 
>>
>>
>> +1, verified a few python workloads and updated the validation
>> sheet.
>>
>> Thank you Yifan!
>>
>> On Thu, Aug 15, 2019 at 12:49 AM Yifan Zou 
>> wrote:
>>
>>> Hi everyone,
>>> Please review and vote on the release candidate #2 for the
>>> version 2.15.0, as follows:
>>> [ ] +1, Approve the release
>>> [ ] -1, Do not approve the release (please provide specific
>>> comments)
>>>
>>>
>>> The complete staging area is available for your review, which
>>> includes:
>>> * JIRA release notes [1].
>>> * The official Apache source release to be deployed to
>>> dist.apache.org [2], which is signed with the key with
>>> fingerprint AC9DB4F14CC90F37080F2C5B6D18F9A7F8DA25E1 [3].
>>> * All artifacts to be deployed to the Maven Central Repository
>>> [4].
>>> * Source code tag "v2.15.0-RC2" [5].
>>> * Website pull request listing the release [6], publishing the
>>> API reference manual [7], and the blog post [8].
>>> * Python artifacts are deployed along with the source release to
>>> the dist.apache.org [2].
>

Re: (mini-doc) Beam (Flink) portable job templates

2019-08-20 Thread Thomas Weise
On Tue, Aug 20, 2019 at 8:56 AM Lukasz Cwik  wrote:

>
>
> On Mon, Aug 19, 2019 at 5:52 PM Ahmet Altay  wrote:
>
>>
>>
>> On Sun, Aug 18, 2019 at 12:34 PM Thomas Weise  wrote:
>>
>>> There is a PR open for this: https://github.com/apache/beam/pull/9331
>>>
>>> (it wasn't tagged with the JIRA and therefore not linked)
>>>
>>> I think it is worthwhile to explore how we could further detangle the
>>> client side Python and Java dependencies.
>>>
>>> The expansion service is one more dependency to consider in a build
>>> environment. Is it really necessary to expand external transforms prior to
>>> submission to the job service?
>>>
>>
>> +1, this will make it easier to use external transforms from the already
>> familiar client environments.
>>
>>
>
> The intent is to make it so that you CAN (not MUST) run an expansion
> service separate from a Runner. Creating a single endpoint that hosts both
> the Job and Expansion service is something that gRPC does very easily since
> you can host multiple service definitions on a single port.
>

Yes, that's fine. The point here is when the expansion occurs. I believe
the runner can also invoke the expansion service, thereby eliminating the
expansion service interaction from the client side.



>
>
>>
>>> Can we come up with a partially constructed proto that can be produced
>>> by just running the Python entry point? Note this would also require
>>> pushing the pipeline options parsing into the job service.
>>>
>>
>> Why would this require pushing the pipeline options parsing to the job
>> service. Assuming that python will have enough idea about the external
>> transform what options it will need. The necessary bit could be converted
>> to arguments and be part of that partially constructed proto.
>>
>>
>>>
>>> On Sun, Aug 18, 2019 at 12:01 PM enrico canzonieri <
>>> ecanzoni...@gmail.com> wrote:
>>>
 I found the tracking ticket at BEAM-7966
 

 On Sun, Aug 18, 2019 at 11:59 AM enrico canzonieri <
 ecanzoni...@gmail.com> wrote:

> Is this alternative still being considered? Creating a portable jar
> sounds like a good solution to re-use the existing runner specific
> deployment mechanism (e.g. Flink k8s operator) and in general simplify the
> deployment story.
>
> On Fri, Aug 9, 2019 at 12:46 AM Robert Bradshaw 
> wrote:
>
>> The expansion service is a separate service. (The flink jar happens to
>> bring both up.) However, there is negotiation to receive/validate the
>> pipeline options.
>>
>> On Fri, Aug 9, 2019 at 1:54 AM Thomas Weise  wrote:
>> >
>> > We would also need to consider cross-language pipelines that
>> (currently) assume the interaction with an expansion service at
>> construction time.
>> >
>> > On Thu, Aug 8, 2019, 4:38 PM Kyle Weaver 
>> wrote:
>> >>
>> >> > It might also be useful to have the option to just output the
>> proto and artifacts, as alternative to the jar file.
>> >>
>> >> Sure, that wouldn't be too big a change if we were to decide to go
>> the SDK route.
>> >>
>> >> > For the Flink entry point we would need to allow for the job
>> server to be used as a library.
>> >>
>> >> We don't need the whole job server, we only need to add a main
>> method to FlinkPipelineRunner [1] as the entry point, which would 
>> basically
>> just do the setup described in the doc then call 
>> FlinkPipelineRunner::run.
>> >>
>> >> [1]
>> https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineRunner.java#L53
>> >>
>> >> Kyle Weaver | Software Engineer | github.com/ibzib |
>> kcwea...@google.com
>> >>
>> >>
>> >> On Thu, Aug 8, 2019 at 4:21 PM Thomas Weise 
>> wrote:
>> >>>
>> >>> Hi Kyle,
>> >>>
>> >>> It might also be useful to have the option to just output the
>> proto and artifacts, as alternative to the jar file.
>> >>>
>> >>> For the Flink entry point we would need to allow for the job
>> server to be used as a library. It would probably not be too hard to have
>> the Flink job constructed via the context execution environment, which
>> would require no changes on the Flink side.
>> >>>
>> >>> Thanks,
>> >>> Thomas
>> >>>
>> >>>
>> >>> On Thu, Aug 8, 2019 at 9:52 AM Kyle Weaver 
>> wrote:
>> 
>>  Re Javaless/serverless solution:
>>  I take it this would probably mean that we would construct the
>> jar directly from the SDK. There are advantages to this: full separation 
>> of
>> Python and Java environments, no need for a job server, and likely a
>> simpler implementation, since we'd no longer have to work within the
>> constraints of the existing job server infrastructure. The only downside 
>> I
>> can think of is the additional 

Re: [VOTE] Release 2.15.0, release candidate #2

2019-08-20 Thread Hannah Jiang
A side note about this test:
Now we only have py2 and py35, so it only fails with py35. I am introducing
minor versions, which will add py36 and py37, and all py3 are flaky.
It's really difficult to pass Portable Precommit with minor versions, the
chance of passing the test is around 15%.

On Mon, Aug 19, 2019 at 5:27 PM Ahmet Altay  wrote:

> Thank you. Unless there are any objects, let's continue with validating
> RC2.
>
> On Mon, Aug 19, 2019 at 5:21 PM Kyle Weaver  wrote:
>
>> I'm not sure if it's worth blocking the release, since I can't reproduce
>> the issue on my machine and a fix would be hard to verify.
>>
>> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com
>>
>>
>> On Mon, Aug 19, 2019 at 4:56 PM Ahmet Altay  wrote:
>>
>>> Kyle, are you currently working on this to decide whether it is the
>>> blocking case or not? Also is this affecting both release branch and master
>>> branch?
>>>
>>> On Mon, Aug 19, 2019 at 4:49 PM Kyle Weaver  wrote:
>>>
 Re BEAM-7993: For some context, there are two possible causes here. The
 pessimistic take is that Dockerized SDK workers are taking forever to
 start. The optimistic take is that the Docker containers are just longer
 than normal (but not forever) to start on Jenkins, in which case this issue
 is nothing new to this release. (The halting problem!) If it's the second,
 it's safe to wait to fix it in the next release.

 Kyle Weaver | Software Engineer | github.com/ibzib |
 kcwea...@google.com


 On Mon, Aug 19, 2019 at 4:42 PM Yifan Zou  wrote:

> Mark and Kyle found a py35 portable test which is flaky:
> https://issues.apache.org/jira/browse/BEAM-7993.
> I plan to finalize the release this week. Would that be a blocker?
> Could we include the fix in 2.16?
>
> Thanks.
>
>
> On Mon, Aug 19, 2019 at 4:38 PM Yifan Zou  wrote:
>
>> I've run most of validations and they're all good.
>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit?pli=1#gid=1036192804
>>
>> On Mon, Aug 19, 2019 at 10:59 AM Hannah Jiang 
>> wrote:
>>
>>> (resending it to dev@)
>>> +1, I tested some test cases as well as customized test cases and
>>> all looks good. I updated validation sheet.
>>>
>>> On Mon, Aug 19, 2019 at 10:40 AM Hannah Jiang <
>>> hannahji...@google.com> wrote:
>>>
 +1, I tested some test cases as well as customized test cases and
 all looks good. I updated validation sheet.

 On Mon, Aug 19, 2019 at 10:28 AM Ahmet Altay 
 wrote:

> Hi all,
>
> Please help with validation and voting on the RC2. Let's help
> Yifan to finalize this release.
>
> Ahmet
>
> -- Forwarded message -
> From: Ahmet Altay 
> Date: Mon, Aug 19, 2019 at 10:27 AM
> Subject: Re: [VOTE] Release 2.15.0, release candidate #2
> To: dev 
>
>
> +1, verified a few python workloads and updated the validation
> sheet.
>
> Thank you Yifan!
>
> On Thu, Aug 15, 2019 at 12:49 AM Yifan Zou 
> wrote:
>
>> Hi everyone,
>> Please review and vote on the release candidate #2 for the
>> version 2.15.0, as follows:
>> [ ] +1, Approve the release
>> [ ] -1, Do not approve the release (please provide specific
>> comments)
>>
>>
>> The complete staging area is available for your review, which
>> includes:
>> * JIRA release notes [1].
>> * The official Apache source release to be deployed to
>> dist.apache.org [2], which is signed with the key with
>> fingerprint AC9DB4F14CC90F37080F2C5B6D18F9A7F8DA25E1 [3].
>> * All artifacts to be deployed to the Maven Central Repository
>> [4].
>> * Source code tag "v2.15.0-RC2" [5].
>> * Website pull request listing the release [6], publishing the
>> API reference manual [7], and the blog post [8].
>> * Python artifacts are deployed along with the source release to
>> the dist.apache.org [2].
>> * Validation sheet with a tab for 2.15.0 release to help with
>> validation [9].
>>
>> The vote will be open for at least 72 hours. It is adopted by
>> majority approval, with at least 3 PMC affirmative votes.
>>
>> Thanks,
>> Yifan Zou
>> [1]
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12345489
>> [2] https://dist.apache.org/repos/dist/dev/beam/2.15.0/
>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>> [4]
>> https://repository.apache.org/content/repositories/orgapachebeam-1082/
>> 

Re: (mini-doc) Beam (Flink) portable job templates

2019-08-20 Thread Lukasz Cwik
On Mon, Aug 19, 2019 at 5:52 PM Ahmet Altay  wrote:

>
>
> On Sun, Aug 18, 2019 at 12:34 PM Thomas Weise  wrote:
>
>> There is a PR open for this: https://github.com/apache/beam/pull/9331
>>
>> (it wasn't tagged with the JIRA and therefore not linked)
>>
>> I think it is worthwhile to explore how we could further detangle the
>> client side Python and Java dependencies.
>>
>> The expansion service is one more dependency to consider in a build
>> environment. Is it really necessary to expand external transforms prior to
>> submission to the job service?
>>
>
> +1, this will make it easier to use external transforms from the already
> familiar client environments.
>
>

The intent is to make it so that you CAN (not MUST) run an expansion
service separate from a Runner. Creating a single endpoint that hosts both
the Job and Expansion service is something that gRPC does very easily since
you can host multiple service definitions on a single port.


>
>> Can we come up with a partially constructed proto that can be produced by
>> just running the Python entry point? Note this would also require pushing
>> the pipeline options parsing into the job service.
>>
>
> Why would this require pushing the pipeline options parsing to the job
> service. Assuming that python will have enough idea about the external
> transform what options it will need. The necessary bit could be converted
> to arguments and be part of that partially constructed proto.
>
>
>>
>> On Sun, Aug 18, 2019 at 12:01 PM enrico canzonieri 
>> wrote:
>>
>>> I found the tracking ticket at BEAM-7966
>>> 
>>>
>>> On Sun, Aug 18, 2019 at 11:59 AM enrico canzonieri <
>>> ecanzoni...@gmail.com> wrote:
>>>
 Is this alternative still being considered? Creating a portable jar
 sounds like a good solution to re-use the existing runner specific
 deployment mechanism (e.g. Flink k8s operator) and in general simplify the
 deployment story.

 On Fri, Aug 9, 2019 at 12:46 AM Robert Bradshaw 
 wrote:

> The expansion service is a separate service. (The flink jar happens to
> bring both up.) However, there is negotiation to receive/validate the
> pipeline options.
>
> On Fri, Aug 9, 2019 at 1:54 AM Thomas Weise  wrote:
> >
> > We would also need to consider cross-language pipelines that
> (currently) assume the interaction with an expansion service at
> construction time.
> >
> > On Thu, Aug 8, 2019, 4:38 PM Kyle Weaver 
> wrote:
> >>
> >> > It might also be useful to have the option to just output the
> proto and artifacts, as alternative to the jar file.
> >>
> >> Sure, that wouldn't be too big a change if we were to decide to go
> the SDK route.
> >>
> >> > For the Flink entry point we would need to allow for the job
> server to be used as a library.
> >>
> >> We don't need the whole job server, we only need to add a main
> method to FlinkPipelineRunner [1] as the entry point, which would 
> basically
> just do the setup described in the doc then call FlinkPipelineRunner::run.
> >>
> >> [1]
> https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineRunner.java#L53
> >>
> >> Kyle Weaver | Software Engineer | github.com/ibzib |
> kcwea...@google.com
> >>
> >>
> >> On Thu, Aug 8, 2019 at 4:21 PM Thomas Weise  wrote:
> >>>
> >>> Hi Kyle,
> >>>
> >>> It might also be useful to have the option to just output the
> proto and artifacts, as alternative to the jar file.
> >>>
> >>> For the Flink entry point we would need to allow for the job
> server to be used as a library. It would probably not be too hard to have
> the Flink job constructed via the context execution environment, which
> would require no changes on the Flink side.
> >>>
> >>> Thanks,
> >>> Thomas
> >>>
> >>>
> >>> On Thu, Aug 8, 2019 at 9:52 AM Kyle Weaver 
> wrote:
> 
>  Re Javaless/serverless solution:
>  I take it this would probably mean that we would construct the
> jar directly from the SDK. There are advantages to this: full separation 
> of
> Python and Java environments, no need for a job server, and likely a
> simpler implementation, since we'd no longer have to work within the
> constraints of the existing job server infrastructure. The only downside I
> can think of is the additional cost of implementing/maintaining jar
> creation code in each SDK, but that cost may be acceptable if it's simple
> enough.
> 
>  Kyle Weaver | Software Engineer | github.com/ibzib |
> kcwea...@google.com
> 
> 
>  On Thu, Aug 8, 2019 at 9:31 AM Thomas Weise 
> wrote:
> >
> >
> >
> > On Thu, Aug 8, 2019 at 8:29 AM Robert Bradshaw <
> rober...

Re: Java 11 compatibility question

2019-08-20 Thread Elliotte Rusty Harold
On Tue, Aug 20, 2019 at 7:51 AM Ismaël Mejía  wrote:

> a per case approach (the exception could be portable runners not based on
> Java).
>
> Of course other definitions of being Java 11 compatible are interesting
> but probably not part of our current scope. Actions like change the
> codebase to use Java 11 specific APIs / idioms, publish Java 11 specific
> artifacts or use Java Platform Modules (JPM). All of these may be nice to
> have but are probably less important for end users who may just want to be
> able to use Beam in its current form in Java 11 VMs.
>
> What do others think? Is this enough to announce Java 11 compatibility and
> add the documentation to the webpage?
>

No, it isn't, I fear. We don't have to use JPMS in Beam, but Beam really
does need to be compatible with JPMS-using apps. The bare minimum here is
avoiding split packages, and that needs to include all transitive
dependencies, not just Beam itself. I don't think we meet that bar now.

-- 
Elliotte Rusty Harold
elh...@ibiblio.org


Re: Java 11 compatibility question

2019-08-20 Thread Ismaël Mejía
Many different people understand different things for Java 11 compatibility
and probably the easiest path for us is to define exactly what we (Beam)
meant with being Java 11 compatible.

The definition that Michał gave seems aligned with the current scope. Beam
published artifacts compiled with Java 8 work correctly in machines that
run a Java 11 VM (and we validate this with the ValidatesRunner tests +
examples).

By this definition, we may consider that both Direct runner and Dataflow
are Java 11 compatible, but we cannot argue this for other runners like
Spark or Flink because neither of those natively support Java 11 yet [1]
[2]. Probably we should specify for which runners we do validate Java 11
compatibility in the CI since this will evolve once Flink 1.10 and Spark
3.0 are released.
We could argue that with portability supporting Java 11 and newer versions
will be easier, but the runner part is still tight with the execution
system so for the moment we will have a per case approach (the exception
could be portable runners not based on Java).

Of course other definitions of being Java 11 compatible are interesting but
probably not part of our current scope. Actions like change the codebase to
use Java 11 specific APIs / idioms, publish Java 11 specific artifacts or
use Java Platform Modules (JPM). All of these may be nice to have but are
probably less important for end users who may just want to be able to use
Beam in its current form in Java 11 VMs.

What do others think? Is this enough to announce Java 11 compatibility and
add the documentation to the webpage?

Ismaël

[1] https://issues.apache.org/jira/browse/FLINK-10725
[2] https://issues.apache.org/jira/browse/SPARK-24417

ps. Thanks Elliotte for the pointer to the best practices repo, lots of
interesting info there.


On Fri, Aug 9, 2019 at 1:02 PM Elliotte Rusty Harold 
wrote:

> Another useful thing to improve Java 11 support is to add
> Automatic-Module-Name headers to the various JARs Beam publishes:
>
>
> https://github.com/GoogleCloudPlatform/cloud-opensource-java/blob/master/library-best-practices/JLBP-20.md
>
> If a JAR doesn't include this, Java synthezizes one from the name of the
> jar file, and things get wonky fast. This is a low risk change that has no
> effect on non-modular and pre-Java-9 apps.
>
>
> On Wed, Aug 7, 2019 at 9:41 AM Michał Walenia 
> wrote:
>
>> Hi everyone,
>>
>> I want to gather the collective knowledge here about Java 11
>> compatibility and ask about the tests needed to deem Beam compatible with
>> JDK 11.
>>
>> Right now concerning testing JDK 11 compatibility I implemented:
>>
>>- Jenkins jobs for running ValidatesRunner test sets in both Direct
>>and Dataflow runners, [1], [2]
>>- ValidatesRunner portability API tests for Dataflow [3],
>>- examples in normal and portable mode for the Dataflow runner [4],
>>[5].
>>
>>
>> Are these tests sufficient to say that we’re java 11 compatible? What
>> other aspects do we need to test to be able to say that?
>>
>>
>> Regards,
>>
>>
>> Michał
>>
>> [1]
>> https://builds.apache.org/job/beam_PostCommit_Java11_ValidatesRunner_Direct/
>> [2]
>> https://builds.apache.org/job/beam_PostCommit_Java11_ValidatesRunner_Dataflow/
>> [3]
>> https://builds.apache.org/job/beam_PostCommit_Java11_ValidatesRunner_PortabilityApi_Dataflow/
>> [4]
>> https://builds.apache.org/job/beam_PostCommit_Java11_Examples_Dataflow/
>> [5]
>> https://builds.apache.org/job/beam_PostCommit_Java11_Examples_Dataflow_Portability/
>>
>> --
>>
>> Michał Walenia
>> Polidea  | Software Engineer
>>
>> M: +48 791 432 002 <+48791432002>
>> E: michal.wale...@polidea.com
>>
>> Unique Tech
>> Check out our projects! 
>>
>
>
> --
> Elliotte Rusty Harold
> elh...@ibiblio.org
>


Re: Need advice: PubsubIO external transform PR

2019-08-20 Thread Maximilian Michels
Hi Chad!

Thank you so much for your feedback. You are 100% on the right track.
What you are seeing is a core issue that also needs to be solved for
KafkaIO to be fully usable in other SDKs. I haven't had much time to
work on this in the past weeks but now is the time :)

The cross-language implementation recycles source connectors (like
KafkaIO.Read) which use the legacy source interface. This interface does
not exist anymore in portability. The current portability architecture
assumes UDFs (this includes KafkaIO.Read) to run in an environment, and
not directly on the Runner. This causes issues when there are multiple
transforms associated with a connector which run with and without an
environment, e.g. KafkaIO.Read or PubSubIO.Read.

The issue is also tracked here:
https://jira.apache.org/jira/browse/BEAM-7870 There are some suggestions
in the issue. I think the best solution is to allow execution of the
source API parts of KafkaIO/PubSubIO (on the Runner) and the following
UDFs (in the environment). Since those do not cross the language
boundary, it is simply a matter of settings this up correctly.

-Max

On 20.08.19 03:45, Chad Dombrova wrote:
> 
> I don't understand why this replacement is necessary, since
> the next transform in the chain is a java ParDo that seems
> like it should be fully capable of using
> PubsubMessageWithAttributesCoder.
> 
> 
> Not too familiar with Flink, but have you tried using PubSub source
> from a pure Java Flink pipeline ? I expect this to work but haven't
> tested it. If that works agree that the coder replacement sounds
> strange.
> 
> 
> Just spoke with my co-worker and he confirmed that Pubsub Read works in
> pure Java on Flink.  We're going to test sending the same Java pipeline
> through the portable runner next.  Our hypothesis is that it will /not/
> work.  If it /does/, then that means it's something specific to how the
> transforms are created when using the expansion service.
> 
> -chad
> 
> 


Re: contributor permission in jira and hello

2019-08-20 Thread Ismaël Mejía
Hello Cannan,

Welcome! You were added to the contributors role and the ticket was
assigned to you too. Now you can also self assign JIRAs if you want to
contribute in other areas.


On Mon, Aug 19, 2019 at 10:01 PM Canaan Silberberg  wrote:
>
> HI all
>
> I'm working with beam's BigQueryIO over at Etsy Inc. and we're interested in 
> this feature: https://issues.apache.org/jira/browse/BEAM-876
>
> ...which I see is unassigned. I'd like to implement and contribute it. Could 
> I have permission to self-assign it in Jira? (my jira user name is ziel)