Re: [VOTE] Mark 2.7.0 branch as a long term support (LTS) branch

2018-11-09 Thread Kenneth Knowles
+1

On Fri, Nov 9, 2018, 12:04 Scott Wegner  +1 Thanks for driving this Ahmet!
>
> On Fri, Nov 9, 2018 at 9:33 AM Chamikara Jayalath 
> wrote:
>
>> +1
>>
>> Thanks,
>> Cham
>>
>> On Fri, Nov 9, 2018 at 9:31 AM Udi Meiri  wrote:
>>
>>> +1
>>>
>>> On Fri, Nov 9, 2018 at 8:31 AM Maximilian Michels 
>>> wrote:
>>>
 +1

 On 09.11.18 09:38, Robert Bradshaw wrote:
 > +1 approve.
 > On Fri, Nov 9, 2018 at 2:47 AM Ahmet Altay  wrote:
 >>
 >> Hi all,
 >>
 >> Please review the following statement:
 >>
 >> "2.7.0 branch will be marked as the long-term-support (LTS) release
 branch. This branch will be supported for a window of 6 months starting
 from the day it is marked as an LTS branch. Beam community will decide on
 which issues will be backported and when patch releases on the branch will
 be made on a case by case basis.
 >>
 >> [ ] +1, Approve
 >> [ ] -1, Do not approve (please provide specific comments)
 >>
 >> Context:
 >> - Discussion on the dev@ list [1].
 >> - Beam's release policies [2].
 >>
 >> Thank you,
 >> Ahmet
 >>
 >> [1]
 https://lists.apache.org/thread.html/f604b2d688ad467ddccd4cf56b664b7309dae78f1bd8849e4bb9aae2@%3Cdev.beam.apache.org%3E
 >> [2] https://beam.apache.org/community/policies/
 >>
 >>
 >>

>>>
>
> --
>
>
>
>
> Got feedback? tinyurl.com/swegner-feedback
>


Re: Roadmap section on IO related features

2018-11-09 Thread Chamikara Jayalath
I created https://github.com/apache/beam/pull/7003 that adds a first
version.

I included planned connectors mentioned in [1] that currently has an owner
assigned to this section. Did not include connectors that have already been
completed and connectors for which corresponding JIRAs do not have an owner
assigned.

Thanks,
Cham

[1] https://beam.apache.org/documentation/io/built-in/

On Fri, Oct 26, 2018 at 7:52 AM Chamikara Jayalath 
wrote:

> +1 for using the term connectors.
>
> JB, thanks for agreeing to add content to this section.
>
> - Cham
>
> On Thu, Oct 25, 2018 at 8:48 PM Ahmet Altay  wrote:
>
>> I like this idea and also the first option. I agree with making it clear
>> that things about IO are language specific.
>>
>> And +1 to calling it connectors or anything else that will resonate with
>> end users.
>>
>> On Thu, Oct 25, 2018 at 7:59 PM, Jean-Baptiste Onofré 
>> wrote:
>>
>>> Agree, I think connector is a more meaning name for users.
>>>
>>> IO is more the Beam "internal" wording.
>>>
>>> I will update this section as I have new connectors ( :) ) on the fly.
>>>
>>> Regards
>>> JB
>>>
>>> On 26/10/2018 04:49, Kenneth Knowles wrote:
>>> > My $0.02
>>> >
>>> > "IO" has an established meaning in Beam dev argot but I think on the
>>> web
>>> > page I would use the word "connector" or something more universal.
>>> >
>>> > On Thu, Oct 25, 2018 at 7:39 PM Chamikara Jayalath <
>>> chamik...@google.com
>>> > > wrote:
>>> >
>>> >
>>> > (1) Add a top level IO roadmap.
>>> >
>>> >
>>> > I like this, but it is important on the roadmap to be very clear about
>>> > language / SDK.
>>> >
>>> >
>>> > (2) Add IO roadmap sub-sections under each SDK instead of at top
>>> level.
>>> >
>>> >
>>> > This seems OK to me too. If you are a user and you have some data you
>>> > just want to see if it is going to be accessible to you soon. You
>>> > probably already committed to a language.
>>> >
>>> >
>>> > (3) We don't need a IO roadmap since we already have
>>> > https://beam.apache.org/documentation/io/built-in/
>>> >
>>> >
>>> > I think the roadmap / in-progress part should move to the new Roadmap
>>> > and/or the wiki.
>>> >
>>> > Kenn
>>> >
>>> >
>>> >
>>> > WDYT ?
>>> >
>>> > Thanks,
>>> > Cham
>>> >
>>> >
>>> > -- Forwarded message -
>>> > From: *Chamikara Jayalath* >> > >
>>> > Date: Thu, Oct 25, 2018 at 7:07 PM
>>> > Subject: Re: [apache/beam] [Website] Add roadmap at top level
>>> (#6718)
>>> > To: apache/beam >> > >
>>> > Cc: Chamikara Jayalath >> > >, Your activity
>>> > >> > >
>>> >
>>> >
>>> > Ok. Makes sense. Kenn and others, WDYT ? We can start writing down
>>> a
>>> > roadmap for IO if there's no objection. It can include more details
>>> > about some of the proposed IO mentioned in
>>> > https://beam.apache.org/documentation/io/built-in/ as well as
>>> > information on upcoming major IO related efforts such as
>>> > cross-language IO support and SDF (in addition to what will be
>>> > available in portability roadmap for these features).
>>> >
>>> > —
>>> > You are receiving this because you are subscribed to this thread.
>>> > Reply to this email directly, view it on GitHub
>>> > ,
>>> > or mute the thread
>>> > <
>>> https://github.com/notifications/unsubscribe-auth/AKUZEUcsbL6bhjOfeRiNZmqtab3PDWBjks5uom5KgaJpZM4Xj9s1
>>> >.
>>> >
>>>
>>
>>


Re: Design review for supporting AutoValue Coders and conversions to Row

2018-11-09 Thread Reuven Lax
Hi Jeff,

I would suggest a slightly different approach. Instead of generating a
coder, writing a SchemaProvider that generates a schema for AutoValue. Once
a PCollection has a schema, a coder is not needed (as Beam knows how to
encode any type with a schema), and it will work seamlessly with Beam SQL
(in fact you don't need to write a transform to turn it into a Row if a
schema is registered).

We already do this for POJOs and basic JavaBeans. I'm happy to help do this
for AutoValue.

Reuven

On Sat, Nov 10, 2018 at 5:50 AM Jeff Klukas  wrote:

> Hi all - I'm looking for some review and commentary on a proposed design
> for providing built-in Coders for AutoValue classes. There's existing
> discussion in BEAM-1891 [0] about using AvroCoder, but that's blocked on
> incompatibility between AutoValue and Avro's reflection machinery that
> don't look resolvable.
>
> I wrote up a design document [1] that instead proposes using AutoValue's
> extension API to automatically generate a Coder for each AutoValue class
> that users generate. A similar technique could be used to generate
> conversions to and from Row for use with BeamSql.
>
> I'd appreciate review of the design and thoughts on whether this seems
> feasible to support within the Beam codebase. I may be missing a simpler
> approach.
>
> [0] https://issues.apache.org/jira/browse/BEAM-1891
> [1]
> https://docs.google.com/document/d/1ucoik4WzUDfilqIz3I1AuMHc1J8DE6iv7gaUCDI42BI/edit?usp=sharing
>


Re: Design review for supporting AutoValue Coders and conversions to Row

2018-11-09 Thread Jeff Klukas
Anton - Thanks for reading and commenting. I've gone as far as creating a
skeleton AutoValue extension to better understand how that API works, but I
don't yet have a working prototype for either of these proposed additions.

I'll move on to prototyping the Coder generation for AutoValue classes if I
get some clear signal from this list that maintaining an AutoValue
extension for generating this code seems like a reasonable path forward.

On Fri, Nov 9, 2018 at 7:42 PM Anton Kedin  wrote:

> Hi Jeff,
>
> I think this is a great idea! Thank you for working on the proposal. I
> left couple of comments in the doc.
>
> Have you tried prototyping this?
>
> Regards,
> Anton
>


Re: Design review for supporting AutoValue Coders and conversions to Row

2018-11-09 Thread Anton Kedin
Hi Jeff,

I think this is a great idea! Thank you for working on the proposal. I left
couple of comments in the doc.

Have you tried prototyping this?

Regards,
Anton

On Fri, Nov 9, 2018 at 1:50 PM Jeff Klukas  wrote:

> Hi all - I'm looking for some review and commentary on a proposed design
> for providing built-in Coders for AutoValue classes. There's existing
> discussion in BEAM-1891 [0] about using AvroCoder, but that's blocked on
> incompatibility between AutoValue and Avro's reflection machinery that
> don't look resolvable.
>
> I wrote up a design document [1] that instead proposes using AutoValue's
> extension API to automatically generate a Coder for each AutoValue class
> that users generate. A similar technique could be used to generate
> conversions to and from Row for use with BeamSql.
>
> I'd appreciate review of the design and thoughts on whether this seems
> feasible to support within the Beam codebase. I may be missing a simpler
> approach.
>
> [0] https://issues.apache.org/jira/browse/BEAM-1891
> [1]
> https://docs.google.com/document/d/1ucoik4WzUDfilqIz3I1AuMHc1J8DE6iv7gaUCDI42BI/edit?usp=sharing
>


Re: Beam Custom I/O Read Transform

2018-11-09 Thread Chamikara Jayalath
Currently, Python SDK doesn't have a transform for reading XML files.
Probably your best bet will be to use Python SDK's file system [1]
abstraction to read XML files from a custom ParDo. Also adding a reshuffle
transform [2] following this will allow Dataflow to better rebalance steps
that come after reading.

Thanks,
Cham

[1]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystems.py
[2]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/util.py#L614


On Fri, Nov 9, 2018 at 3:53 PM Sean Schwartz  wrote:

> Hello,
>
> My company, SwiftIQ, uses google dataflow for our large scale data
> processing pipeline. We currently are using java as our codebase. We are
> looking at Python, but I'm having trouble trying to see if our dataflow can
> be supported used Python.
>
> Our first step of the pipeline should be a I/O Read Transform of an XML
> file. I see that this package exists in Java, however I'm not finding it as
> a module in Python.
>
> Is there a Python module that does this? If not is there a way to write
> our own custom Read Transform that reads a XML file into a PCollection?
>
> A quick response would be greatly appreciated.
>
> Thanks!
>
> Sean Schwartz
>
> --
>
> Sean Schwartz
> Data Engineer
> Cell: 847.772.0240
>


Beam Custom I/O Read Transform

2018-11-09 Thread Sean Schwartz
Hello,

My company, SwiftIQ, uses google dataflow for our large scale data
processing pipeline. We currently are using java as our codebase. We are
looking at Python, but I'm having trouble trying to see if our dataflow can
be supported used Python.

Our first step of the pipeline should be a I/O Read Transform of an XML
file. I see that this package exists in Java, however I'm not finding it as
a module in Python.

Is there a Python module that does this? If not is there a way to write our
own custom Read Transform that reads a XML file into a PCollection?

A quick response would be greatly appreciated.

Thanks!

Sean Schwartz

-- 

Sean Schwartz
Data Engineer
Cell: 847.772.0240


Design review for supporting AutoValue Coders and conversions to Row

2018-11-09 Thread Jeff Klukas
Hi all - I'm looking for some review and commentary on a proposed design
for providing built-in Coders for AutoValue classes. There's existing
discussion in BEAM-1891 [0] about using AvroCoder, but that's blocked on
incompatibility between AutoValue and Avro's reflection machinery that
don't look resolvable.

I wrote up a design document [1] that instead proposes using AutoValue's
extension API to automatically generate a Coder for each AutoValue class
that users generate. A similar technique could be used to generate
conversions to and from Row for use with BeamSql.

I'd appreciate review of the design and thoughts on whether this seems
feasible to support within the Beam codebase. I may be missing a simpler
approach.

[0] https://issues.apache.org/jira/browse/BEAM-1891
[1]
https://docs.google.com/document/d/1ucoik4WzUDfilqIz3I1AuMHc1J8DE6iv7gaUCDI42BI/edit?usp=sharing


Re: BEAM-6018: memory leak in thread pool instantiation

2018-11-09 Thread Lukasz Cwik
Thanks for the context Dan, that was helpful.

On Fri, Nov 9, 2018 at 10:09 AM Udi Meiri  wrote:

> The reasoning unbounded threadpool is explained as:
> /* The SDK requires an unbounded thread pool because a step may create X
> writers
> * each requiring their own thread to perform the writes otherwise a writer
> may
> * block causing deadlock for the step because the writers buffer is full.
> * Also, the MapTaskExecutor launches the steps in reverse order and
> completes
> * them in forward order thus requiring enough threads so that each step's
> writers
> * can be active.
>
> */
>
>
> https://github.com/apache/beam/blob/17c2da6d981cae9f233aea1e2d6d64259362dd73/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcsOptions.java#L133-L138
>
> On Thu, Nov 8, 2018 at 11:41 PM Dan Halperin  wrote:
>
>>
>>> On Thu, Nov 8, 2018 at 2:12 PM Udi Meiri  wrote:
>>>
 Both options risk delaying worker shutdown if the executor's shutdown()
 hasn't been called, which is I guess why the executor in GcsOptions.java
 creates daemon threads.

>>>
>> My guess (and it really is a guess at this point) is that this was a fix
>> for DirectRunner issues - want that to exit quickly!
>>
>>
>>>
 On Thu, Nov 8, 2018 at 1:02 PM Lukasz Cwik  wrote:

> Not certain, it looks like we should have been caching the executor
> within the GcsUtil as a static instance instead of creating one each time.
> Could have been missed during code review / slow code changes over time.
> GcsUtil is not well "loved".
>
> On Thu, Nov 8, 2018 at 11:00 AM Udi Meiri  wrote:
>
>> HI,
>> I've identified a memory leak when GcsUtil.java instantiates a
>> ThreadPoolExecutor (https://issues.apache.org/jira/browse/BEAM-6018).
>> The code uses the getExitingExecutorService
>> 
>>  wrapper,
>> which leaks memory. The question is, why is that wrapper necessary
>> if executor.shutdown(); is later unconditionally called?
>>
>


Re: [VOTE] Mark 2.7.0 branch as a long term support (LTS) branch

2018-11-09 Thread Scott Wegner
+1 Thanks for driving this Ahmet!

On Fri, Nov 9, 2018 at 9:33 AM Chamikara Jayalath 
wrote:

> +1
>
> Thanks,
> Cham
>
> On Fri, Nov 9, 2018 at 9:31 AM Udi Meiri  wrote:
>
>> +1
>>
>> On Fri, Nov 9, 2018 at 8:31 AM Maximilian Michels  wrote:
>>
>>> +1
>>>
>>> On 09.11.18 09:38, Robert Bradshaw wrote:
>>> > +1 approve.
>>> > On Fri, Nov 9, 2018 at 2:47 AM Ahmet Altay  wrote:
>>> >>
>>> >> Hi all,
>>> >>
>>> >> Please review the following statement:
>>> >>
>>> >> "2.7.0 branch will be marked as the long-term-support (LTS) release
>>> branch. This branch will be supported for a window of 6 months starting
>>> from the day it is marked as an LTS branch. Beam community will decide on
>>> which issues will be backported and when patch releases on the branch will
>>> be made on a case by case basis.
>>> >>
>>> >> [ ] +1, Approve
>>> >> [ ] -1, Do not approve (please provide specific comments)
>>> >>
>>> >> Context:
>>> >> - Discussion on the dev@ list [1].
>>> >> - Beam's release policies [2].
>>> >>
>>> >> Thank you,
>>> >> Ahmet
>>> >>
>>> >> [1]
>>> https://lists.apache.org/thread.html/f604b2d688ad467ddccd4cf56b664b7309dae78f1bd8849e4bb9aae2@%3Cdev.beam.apache.org%3E
>>> >> [2] https://beam.apache.org/community/policies/
>>> >>
>>> >>
>>> >>
>>>
>>

-- 




Got feedback? tinyurl.com/swegner-feedback


Re: [DISCUSS] More precision supported by DATETIME field in Schema

2018-11-09 Thread Lukasz Cwik
Added a few folks for visibility.

On Fri, Nov 9, 2018 at 12:43 AM Robert Bradshaw  wrote:

> We *might* have a few bits left in the WindowedValue representation to
> make this backwards compatible if we really wanted.
>
> The use of java.time.instant means that we won't be able to upgrade
> (even in v3) our internal timestamps to match without either
> internally supporting >64 bits of precision or limiting the date
> range. But using the standard Java time does make a lot of sense.
> On Fri, Nov 9, 2018 at 12:33 AM Rui Wang  wrote:
> >
> > https://github.com/apache/beam/pull/6991
> >
> > I am using java.time.instant as the internal representation to replace
> Joda time for DateTime field in the PR. The java.time.instant uses a long
> to save seconds-after-epoch and a int to save nanoseconds-of-second.
> Therefore 64 bits are fully used for seconds-after-epoch, which loses
> nothing.
> >
> > Comments are very welcome to this PR.
> >
> > -Rui
> >
> > On Wed, Nov 7, 2018 at 1:15 AM Reuven Lax  wrote:
> >>
> >> As you said, this would be update incompatible across all streaming
> pipelines. At the very least this would be a big problem for Dataflow
> users, and I believe many Flink users as well. I'm not sure the benefit
> here justifies causing problems for so many users.
> >>
> >> Reuven
> >>
> >> On Wed, Nov 7, 2018 at 4:56 PM Robert Bradshaw 
> wrote:
> >>>
> >>> Yes, microseconds is a good compromise for covering a long enough
> >>> timespan that there's little reason it could be hit (even for
> >>> processing historical data).
> >>>
> >>> Regarding backwards compatibility, could we just change the internal
> >>> representation of Beam's element timestamps, possibly with new APIs to
> >>> access the finer granularity? (True, it may not be upgrade
> >>> compatible.)
> >>> On Tue, Nov 6, 2018 at 8:46 PM Reuven Lax  wrote:
> >>> >
> >>> > The main difference (though possibly theoretical) is when time runs
> out. With 64 bits and nanosecond precision, we can only represent times
> about 244 years in the future (or the past).
> >>> >
> >>> > On Tue, Nov 6, 2018 at 11:30 AM Kenneth Knowles 
> wrote:
> >>> >>
> >>> >> I like nanoseconds as extremely future-proof. What about specing
> this out in stages (1) domain of values (2) portable encoding that can
> represent those values (3) language-specific types to embed the values in.
> >>> >>
> >>> >> 1. If it is a nanosecond-precision absolute time, and we eventually
> want to migrate event time timestamps to match, then we need values for
> "end of global window" and "end of time". TBH I am not sure we need both of
> these any more. We can either define a max on the nanosecond range or
> create distinguished values.
> >>> >>
> >>> >> 2. For portability, presumably an order-preserving integer encoding
> of nanoseconds since epoch with whatever tweaks to allow for representing
> the end of time. It might be useful to find a way to allow multiple. Not
> super useful at a particular version, but might have given us a migration
> path. It would also allow experiments for performance.
> >>> >>
> >>> >> 3. We could probably find a way to keep user-facing API
> compatibility here while increasing underlying precision at 1 and 2, but I
> probably not worth it. A new Java type IMO addresses the lossiness issue
> because a user would have to explicitly request truncation to assign to a
> millis event time timestamp.
> >>> >>
> >>> >> Kenn
> >>> >>
> >>> >> On Tue, Nov 6, 2018 at 12:55 AM Charles Chen 
> wrote:
> >>> >>>
> >>> >>> Is the proposal to do this for both Beam Schema DATETIME fields as
> well as for Beam timestamps in general?  The latter likely has a bunch of
> downstream consequences for all runners.
> >>> >>>
> >>> >>> On Tue, Nov 6, 2018 at 12:38 AM Ismaël Mejía 
> wrote:
> >>> 
> >>>  +1 to more precision even to the nano level, probably via Reuven's
> >>>  proposal of a different internal representation.
> >>>  On Tue, Nov 6, 2018 at 9:19 AM Robert Bradshaw <
> rober...@google.com> wrote:
> >>>  >
> >>>  > +1 to offering more granular timestamps in general. I think it
> will be
> >>>  > odd if setting the element timestamp from a row DATETIME field
> is
> >>>  > lossy, so we should seriously consider upgrading that as well.
> >>>  > On Tue, Nov 6, 2018 at 6:42 AM Charles Chen 
> wrote:
> >>>  > >
> >>>  > > One related issue that came up before is that we (perhaps
> unnecessarily) restrict the precision of timestamps in the Python SDK to
> milliseconds because of legacy reasons related to the Java runner's use of
> Joda time.  Perhaps Beam portability should natively use a more granular
> timestamp unit.
> >>>  > >
> >>>  > > On Mon, Nov 5, 2018 at 9:34 PM Rui Wang 
> wrote:
> >>>  > >>
> >>>  > >> Thanks Reuven!
> >>>  > >>
> >>>  > >> I think Reuven gives the third option:
> >>>  > >>
> >>>  > >> Change internal representation of DATETIME field in Row.
> Still keep public 

Re: BEAM-6018: memory leak in thread pool instantiation

2018-11-09 Thread Udi Meiri
The reasoning unbounded threadpool is explained as:
/* The SDK requires an unbounded thread pool because a step may create X
writers
* each requiring their own thread to perform the writes otherwise a writer
may
* block causing deadlock for the step because the writers buffer is full.
* Also, the MapTaskExecutor launches the steps in reverse order and
completes
* them in forward order thus requiring enough threads so that each step's
writers
* can be active.

*/

https://github.com/apache/beam/blob/17c2da6d981cae9f233aea1e2d6d64259362dd73/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcsOptions.java#L133-L138

On Thu, Nov 8, 2018 at 11:41 PM Dan Halperin  wrote:

>
>> On Thu, Nov 8, 2018 at 2:12 PM Udi Meiri  wrote:
>>
>>> Both options risk delaying worker shutdown if the executor's shutdown()
>>> hasn't been called, which is I guess why the executor in GcsOptions.java
>>> creates daemon threads.
>>>
>>
> My guess (and it really is a guess at this point) is that this was a fix
> for DirectRunner issues - want that to exit quickly!
>
>
>>
>>> On Thu, Nov 8, 2018 at 1:02 PM Lukasz Cwik  wrote:
>>>
 Not certain, it looks like we should have been caching the executor
 within the GcsUtil as a static instance instead of creating one each time.
 Could have been missed during code review / slow code changes over time.
 GcsUtil is not well "loved".

 On Thu, Nov 8, 2018 at 11:00 AM Udi Meiri  wrote:

> HI,
> I've identified a memory leak when GcsUtil.java instantiates a
> ThreadPoolExecutor (https://issues.apache.org/jira/browse/BEAM-6018).
> The code uses the getExitingExecutorService
> 
>  wrapper,
> which leaks memory. The question is, why is that wrapper necessary
> if executor.shutdown(); is later unconditionally called?
>



smime.p7s
Description: S/MIME Cryptographic Signature


Re: Performance of BeamFnData between Python and Java

2018-11-09 Thread Xinyu Liu
Really appreciate the pointers! Looks like the next step is to try
increasing our bundle size. We will do some experiments on our side and
report back later.

@Robert: thanks a lot for the details on protobuf. It was pretty surprising
to us that decoding protobuf messages slows down the performance a lot. For
your questions, we are not using the default docker container, since here
we need to use LinkedIn python packaging and deployment. AFAK it uses
Cython to compile the code.

Thanks,
Xinyu

On Thu, Nov 8, 2018 at 3:11 PM Robert Bradshaw  wrote:

> I'd assume you're compiling the code with Cython as well? (If you're
> using the default containers, that should be fine.)
> On Fri, Nov 9, 2018 at 12:09 AM Robert Bradshaw 
> wrote:
> >
> > Very cool to hear of this progress on Samza!
> >
> > Python protocol buffers are extraordinarily slow (lots of reflection,
> > attributes lookups, and bit fiddling for serialization/deserialization
> > that is certainly not Python's strong point). Each bundle processed
> > involves multiple protos being constructed and sent/received (notably
> > the particularly nested and branchy monitoring info one). While there
> > are still some improvements that could be made for making bundles
> > lighter-weight, amortizing this cost over many elements is essential
> > for performance. (Note that elements within a bundle are packed into a
> > single byte buffer, so avoid this overhead.)
> >
> > Also, it may be good to guarantee you're at least using the C++
> > bindings:
> https://developers.google.com/protocol-buffers/docs/reference/python-generated
> > (still slow, but not as slow).
> >
> > And, of course, due to the GIL one may want many python workers for
> > multi-core machines.
> >
> > On Thu, Nov 8, 2018 at 9:18 PM Thomas Weise  wrote:
> > >
> > > We have been doing some end to end testing with Python and Flink
> (streaming). You could take a look at the following and possibly replicate
> it for your work:
> > >
> > >
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/flink/flink_streaming_impulse.py
> > >
> > > We found that in order to get acceptable performance, we need larger
> bundles (we started with single element bundles). Default in the Flink
> runner now is to cap bundles at 1000 elements or 1 second, whatever comes
> first. With that, I have seen decent throughput for the pipeline above (~
> 5000k elements per second with single SDK worker).
> > >
> > > The Flink runner also has support to run multiple SDK workers per
> Flink task manager.
> > >
> > > Thomas
> > >
> > >
> > > On Thu, Nov 8, 2018 at 11:13 AM Xinyu Liu 
> wrote:
> > >>
> > >> 19mb/s throughput is enough for us. Seems the bottleneck is the rate
> of RPC calls. Our message size is usually 1k ~ 10k. So if we can reach
> 19mb/s, we will be able to process ~4k qps, that meets our requirements. I
> guess increasing the size of the bundles will help. Do you guys have any
> results from running python with Flink? We are curious about the
> performance there.
> > >>
> > >> Thanks,
> > >> Xinyu
> > >>
> > >> On Thu, Nov 8, 2018 at 10:13 AM Lukasz Cwik  wrote:
> > >>>
> > >>> This benchmark[1] shows that Python is getting about 19mb/s.
> > >>>
> > >>> Yes, running more python sdk_worker processes will improve
> performance since Python is limited to a single CPU core.
> > >>>
> > >>> [1]
> https://performance-dot-grpc-testing.appspot.com/explore?dashboard=5652536396611584=490377658=1286539696
> > >>>
> > >>>
> > >>>
> > >>> On Wed, Nov 7, 2018 at 5:24 PM Xinyu Liu 
> wrote:
> > 
> >  By looking at the gRPC dashboard published by the benchmark[1], it
> seems the streaming ping-pong operations per second for gRPC in python is
> around 2k ~ 3k qps. This seems quite low compared to gRPC performance in
> other languages, e.g. 600k qps for Java and Go. Is it expected to run
> multiple sdk_worker processes to improve performance?
> > 
> >  [1]
> https://performance-dot-grpc-testing.appspot.com/explore?dashboard=5652536396611584=713624174=1012810333
> > 
> >  On Wed, Nov 7, 2018 at 11:14 AM Lukasz Cwik 
> wrote:
> > >
> > > gRPC folks provide a bunch of benchmarks for different scenarios:
> https://grpc.io/docs/guides/benchmarking.html
> > > You would be most interested in the streaming throughput
> benchmarks since the Data API is written on top of the gRPC streaming APIs.
> > >
> > > 200KB/s does seem pretty small. Have you captured any Python
> profiles[1] and looked at them?
> > >
> > > 1:
> https://lists.apache.org/thread.html/f8488faede96c65906216c6b4bc521385abeddc1578c99b85937d2f2@%3Cdev.beam.apache.org%3E
> > >
> > >
> > > On Wed, Nov 7, 2018 at 10:18 AM Hai Lu  wrote:
> > >>
> > >> Hi,
> > >>
> > >> This is Hai from LinkedIn. I'm currently working on Portable API
> for Samza Runner. I was able to make Python work with Samza container
> reading from Kafka. However, I'm seeing severe performance issue with 

Re: [VOTE] Mark 2.7.0 branch as a long term support (LTS) branch

2018-11-09 Thread Chamikara Jayalath
+1

Thanks,
Cham

On Fri, Nov 9, 2018 at 9:31 AM Udi Meiri  wrote:

> +1
>
> On Fri, Nov 9, 2018 at 8:31 AM Maximilian Michels  wrote:
>
>> +1
>>
>> On 09.11.18 09:38, Robert Bradshaw wrote:
>> > +1 approve.
>> > On Fri, Nov 9, 2018 at 2:47 AM Ahmet Altay  wrote:
>> >>
>> >> Hi all,
>> >>
>> >> Please review the following statement:
>> >>
>> >> "2.7.0 branch will be marked as the long-term-support (LTS) release
>> branch. This branch will be supported for a window of 6 months starting
>> from the day it is marked as an LTS branch. Beam community will decide on
>> which issues will be backported and when patch releases on the branch will
>> be made on a case by case basis.
>> >>
>> >> [ ] +1, Approve
>> >> [ ] -1, Do not approve (please provide specific comments)
>> >>
>> >> Context:
>> >> - Discussion on the dev@ list [1].
>> >> - Beam's release policies [2].
>> >>
>> >> Thank you,
>> >> Ahmet
>> >>
>> >> [1]
>> https://lists.apache.org/thread.html/f604b2d688ad467ddccd4cf56b664b7309dae78f1bd8849e4bb9aae2@%3Cdev.beam.apache.org%3E
>> >> [2] https://beam.apache.org/community/policies/
>> >>
>> >>
>> >>
>>
>


Re: [VOTE] Mark 2.7.0 branch as a long term support (LTS) branch

2018-11-09 Thread Udi Meiri
+1

On Fri, Nov 9, 2018 at 8:31 AM Maximilian Michels  wrote:

> +1
>
> On 09.11.18 09:38, Robert Bradshaw wrote:
> > +1 approve.
> > On Fri, Nov 9, 2018 at 2:47 AM Ahmet Altay  wrote:
> >>
> >> Hi all,
> >>
> >> Please review the following statement:
> >>
> >> "2.7.0 branch will be marked as the long-term-support (LTS) release
> branch. This branch will be supported for a window of 6 months starting
> from the day it is marked as an LTS branch. Beam community will decide on
> which issues will be backported and when patch releases on the branch will
> be made on a case by case basis.
> >>
> >> [ ] +1, Approve
> >> [ ] -1, Do not approve (please provide specific comments)
> >>
> >> Context:
> >> - Discussion on the dev@ list [1].
> >> - Beam's release policies [2].
> >>
> >> Thank you,
> >> Ahmet
> >>
> >> [1]
> https://lists.apache.org/thread.html/f604b2d688ad467ddccd4cf56b664b7309dae78f1bd8849e4bb9aae2@%3Cdev.beam.apache.org%3E
> >> [2] https://beam.apache.org/community/policies/
> >>
> >>
> >>
>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [VOTE] Mark 2.7.0 branch as a long term support (LTS) branch

2018-11-09 Thread Maximilian Michels

+1

On 09.11.18 09:38, Robert Bradshaw wrote:

+1 approve.
On Fri, Nov 9, 2018 at 2:47 AM Ahmet Altay  wrote:


Hi all,

Please review the following statement:

"2.7.0 branch will be marked as the long-term-support (LTS) release branch. 
This branch will be supported for a window of 6 months starting from the day it is 
marked as an LTS branch. Beam community will decide on which issues will be 
backported and when patch releases on the branch will be made on a case by case 
basis.

[ ] +1, Approve
[ ] -1, Do not approve (please provide specific comments)

Context:
- Discussion on the dev@ list [1].
- Beam's release policies [2].

Thank you,
Ahmet

[1] 
https://lists.apache.org/thread.html/f604b2d688ad467ddccd4cf56b664b7309dae78f1bd8849e4bb9aae2@%3Cdev.beam.apache.org%3E
[2] https://beam.apache.org/community/policies/





Re: [DISCUSS] More precision supported by DATETIME field in Schema

2018-11-09 Thread Robert Bradshaw
We *might* have a few bits left in the WindowedValue representation to
make this backwards compatible if we really wanted.

The use of java.time.instant means that we won't be able to upgrade
(even in v3) our internal timestamps to match without either
internally supporting >64 bits of precision or limiting the date
range. But using the standard Java time does make a lot of sense.
On Fri, Nov 9, 2018 at 12:33 AM Rui Wang  wrote:
>
> https://github.com/apache/beam/pull/6991
>
> I am using java.time.instant as the internal representation to replace Joda 
> time for DateTime field in the PR. The java.time.instant uses a long to save 
> seconds-after-epoch and a int to save nanoseconds-of-second. Therefore 64 
> bits are fully used for seconds-after-epoch, which loses nothing.
>
> Comments are very welcome to this PR.
>
> -Rui
>
> On Wed, Nov 7, 2018 at 1:15 AM Reuven Lax  wrote:
>>
>> As you said, this would be update incompatible across all streaming 
>> pipelines. At the very least this would be a big problem for Dataflow users, 
>> and I believe many Flink users as well. I'm not sure the benefit here 
>> justifies causing problems for so many users.
>>
>> Reuven
>>
>> On Wed, Nov 7, 2018 at 4:56 PM Robert Bradshaw  wrote:
>>>
>>> Yes, microseconds is a good compromise for covering a long enough
>>> timespan that there's little reason it could be hit (even for
>>> processing historical data).
>>>
>>> Regarding backwards compatibility, could we just change the internal
>>> representation of Beam's element timestamps, possibly with new APIs to
>>> access the finer granularity? (True, it may not be upgrade
>>> compatible.)
>>> On Tue, Nov 6, 2018 at 8:46 PM Reuven Lax  wrote:
>>> >
>>> > The main difference (though possibly theoretical) is when time runs out. 
>>> > With 64 bits and nanosecond precision, we can only represent times about 
>>> > 244 years in the future (or the past).
>>> >
>>> > On Tue, Nov 6, 2018 at 11:30 AM Kenneth Knowles  wrote:
>>> >>
>>> >> I like nanoseconds as extremely future-proof. What about specing this 
>>> >> out in stages (1) domain of values (2) portable encoding that can 
>>> >> represent those values (3) language-specific types to embed the values 
>>> >> in.
>>> >>
>>> >> 1. If it is a nanosecond-precision absolute time, and we eventually want 
>>> >> to migrate event time timestamps to match, then we need values for "end 
>>> >> of global window" and "end of time". TBH I am not sure we need both of 
>>> >> these any more. We can either define a max on the nanosecond range or 
>>> >> create distinguished values.
>>> >>
>>> >> 2. For portability, presumably an order-preserving integer encoding of 
>>> >> nanoseconds since epoch with whatever tweaks to allow for representing 
>>> >> the end of time. It might be useful to find a way to allow multiple. Not 
>>> >> super useful at a particular version, but might have given us a 
>>> >> migration path. It would also allow experiments for performance.
>>> >>
>>> >> 3. We could probably find a way to keep user-facing API compatibility 
>>> >> here while increasing underlying precision at 1 and 2, but I probably 
>>> >> not worth it. A new Java type IMO addresses the lossiness issue because 
>>> >> a user would have to explicitly request truncation to assign to a millis 
>>> >> event time timestamp.
>>> >>
>>> >> Kenn
>>> >>
>>> >> On Tue, Nov 6, 2018 at 12:55 AM Charles Chen  wrote:
>>> >>>
>>> >>> Is the proposal to do this for both Beam Schema DATETIME fields as well 
>>> >>> as for Beam timestamps in general?  The latter likely has a bunch of 
>>> >>> downstream consequences for all runners.
>>> >>>
>>> >>> On Tue, Nov 6, 2018 at 12:38 AM Ismaël Mejía  wrote:
>>> 
>>>  +1 to more precision even to the nano level, probably via Reuven's
>>>  proposal of a different internal representation.
>>>  On Tue, Nov 6, 2018 at 9:19 AM Robert Bradshaw  
>>>  wrote:
>>>  >
>>>  > +1 to offering more granular timestamps in general. I think it will 
>>>  > be
>>>  > odd if setting the element timestamp from a row DATETIME field is
>>>  > lossy, so we should seriously consider upgrading that as well.
>>>  > On Tue, Nov 6, 2018 at 6:42 AM Charles Chen  wrote:
>>>  > >
>>>  > > One related issue that came up before is that we (perhaps 
>>>  > > unnecessarily) restrict the precision of timestamps in the Python 
>>>  > > SDK to milliseconds because of legacy reasons related to the Java 
>>>  > > runner's use of Joda time.  Perhaps Beam portability should 
>>>  > > natively use a more granular timestamp unit.
>>>  > >
>>>  > > On Mon, Nov 5, 2018 at 9:34 PM Rui Wang  wrote:
>>>  > >>
>>>  > >> Thanks Reuven!
>>>  > >>
>>>  > >> I think Reuven gives the third option:
>>>  > >>
>>>  > >> Change internal representation of DATETIME field in Row. Still 
>>>  > >> keep public ReadableDateTime getDateTime(String fieldName) API to 
>>>  > >> be 

Re: [VOTE] Mark 2.7.0 branch as a long term support (LTS) branch

2018-11-09 Thread Robert Bradshaw
+1 approve.
On Fri, Nov 9, 2018 at 2:47 AM Ahmet Altay  wrote:
>
> Hi all,
>
> Please review the following statement:
>
> "2.7.0 branch will be marked as the long-term-support (LTS) release branch. 
> This branch will be supported for a window of 6 months starting from the day 
> it is marked as an LTS branch. Beam community will decide on which issues 
> will be backported and when patch releases on the branch will be made on a 
> case by case basis.
>
> [ ] +1, Approve
> [ ] -1, Do not approve (please provide specific comments)
>
> Context:
> - Discussion on the dev@ list [1].
> - Beam's release policies [2].
>
> Thank you,
> Ahmet
>
> [1] 
> https://lists.apache.org/thread.html/f604b2d688ad467ddccd4cf56b664b7309dae78f1bd8849e4bb9aae2@%3Cdev.beam.apache.org%3E
> [2] https://beam.apache.org/community/policies/
>
>
>