from:"Robert Bradshaw"

Re: [DISCUSS] Portability representation of schemas

2019-05-09 Thread Robert Bradshaw

From: Kenneth Knowles 
Date: Thu, May 9, 2019 at 10:05 AM
To: dev

> This is a huge development. Top posting because I can be more compact.
>
> I really think after the initial idea converges this needs a design doc with 
> goals and alternatives. It is an extraordinarily consequential model change. 
> So in the spirit of doing the work / bias towards action, I created a quick 
> draft at https://s.apache.org/beam-schemas and added everyone on this thread 
> as editors. I am still in the process of writing this to match the thread.

Thanks! Added some comments there.

> *Multiple timestamp resolutions*: you can use logcial types to represent 
> nanos the same way Java and proto do.

As per the other discussion, I'm unsure the value in supporting
multiple timestamp resolutions is high enough to outweigh the cost.

> *Why multiple int types?* The domain of values for these types are different. 
> For a language with one "int" or "number" type, that's another domain of 
> values.

What is the value in having different domains? If your data has a
natural domain, chances are it doesn't line up exactly with one of
these. I guess it's for languages whose types have specific domains?
(There's also compactness in representation, encoded and in-memory,
though I'm not sure that's high.)

> *Columnar/Arrow*: making sure we unlock the ability to take this path is 
> Paramount. So tying it directly to a row-oriented coder seems 
> counterproductive.

I don't think Coders are necessarily row-oriented. They are, however,
bytes-oriented. (Perhaps they need not be.) There seems to be a lot of
overlap between what Coders express in terms of element typing
information and what Schemas express, and I'd rather have one concept
if possible. Or have a clear division of responsibilities.

> *Multimap*: what does it add over an array-valued map or 
> large-iterable-valued map? (honest question, not rhetorical)

Multimap has a different notion of what it means to contain a value,
can handle (unordered) unions of non-disjoint keys, etc. Maybe this
isn't worth a new primitive type.

> *URN/enum for type names*: I see the case for both. The core types are 
> fundamental enough they should never really change - after all, proto, 
> thrift, avro, arrow, have addressed this (not to mention most programming 
> languages). Maybe additions once every few years. I prefer the smallest 
> intersection of these schema languages. A oneof is more clear, while URN 
> emphasizes the similarity of built-in and logical types.

Hmm... Do we have any examples of the multi-level primitive/logical
type in any of these other systems? I have a bias towards all types
being on the same footing unless there is compelling reason to divide
things into primitive/use-defined ones.

Here it seems like the most essential value of the primitive type set
is to describe the underlying representation, for encoding elements in
a variety of ways (notably columnar, but also interfacing with other
external systems like IOs). Perhaps, rather than the previous
suggestion of making everything a logical of bytes, this could be made
clear by still making everything a logical type, but renaming
"TypeName" to Representation. There would be URNs (typically with
empty payloads) for the various primitive types (whose mapping to
their representations would be the identity).

- Robert

Re: [DISCUSS] Portability representation of schemas

2019-05-08 Thread Robert Bradshaw

From: Reuven Lax 
Date: Wed, May 8, 2019 at 10:36 PM
To: dev

> On Wed, May 8, 2019 at 1:23 PM Robert Bradshaw  wrote:
>>
>> Very excited to see this. In particular, I think this will be very
>> useful for cross-language pipelines (not just SQL, but also for
>> describing non-trivial data (e.g. for source and sink reuse).
>>
>> The proto specification makes sense to me. The only thing that looks
>> like it's missing (other than possibly iterable, for arbitrarily-large
>> support) is multimap. Another basic type, should we want to support
>> it, is union (though this of course can get messy).
>
> multimap is an interesting suggestion. Do you have a use case in mind?
>
> union (or oneof) is also a good suggestion. There are good use cases for 
> this, but this is a more fundamental change.

No specific usecase, they just seemed to round out the options.

>> I'm curious what the rational was for going with a oneof for type_info
>> rather than an repeated components like we do with coders.
>
> No strong reason. Do you think repeated components is better than oneof?

It's more consistent with how we currently do coders (which has pros and cons).

>> Removing DATETIME as a logical coder on top of INT64 may cause issues
>> of insufficient resolution and/or timespan. Similarly with DECIMAL (or
>> would it be backed by string?)
>
> There could be multiple TIMESTAMP types for different resolutions, and they 
> don't all need the same backing field type. E.g. the backing type for 
> nanoseconds could by Row(INT64, INT64), or it could just be a byte array.

Hmm What would the value be in supporting different types of
timestamps? Would all SDKs have to support all of them? Can one
compare, take differences, etc. across timestamp types? (As Luke
points out, the other conversation on timestamps is likely relevant
here as well.)

>> The biggest question, as far as portability is concerned at least, is
>> the notion of logical types. serialized_class is clearly not portable,
>> and I also think we'll want a way to share semantic meaning across
>> SDKs (especially if things like dates become logical types). Perhaps
>> URNs (+payloads) would be a better fit here?
>
> Yes, URN + payload is probably the better fit for portability.
>
>> Taking a step back, I think it's worth asking why we have different
>> types, rather than simply making everything a LogicalType of bytes
>> (aka coder). Other than encoding format, the answer I can come up with
>> is that the type decides the kinds of operations that can be done on
>> it, e.g. does it support comparison? Arithmetic? Containment?
>> Higher-level date operations? Perhaps this should be used to guide the
>> set of types we provide.
>
> Also even though we could make everything a LogicalType (though at least byte 
> array would have to stay primitive), I think  it's useful to have a slightly 
> larger set of primitive types.  It makes things easier to understand and 
> debug, and it makes it simpler for the various SDKs to map them to their 
> types (e.g. mapping to POJOs).

 This would be the case if one didn't have LogicalType at all, but
once one introduces that one now has this more complicated two-level
hierarchy of types which doesn't seem simpler to me.

I'm trying to understand what information Schema encodes that a
NamedTupleCoder (or RowCoder) would/could not. (Coders have the
disadvantage that there are multiple encodings of a single value, e.g.
BigEndian vs. VarInt, but if we have multiple resolutions of timestamp
that would still seem to be an issue. Possibly another advantage is
encoding into non-record-oriented formats, e.g. Parquet or Arrow, that
have a set of primitives.)

Re: [discuss] Reducing log verbosity for Python failures?

2019-05-08 Thread Robert Bradshaw

+1 to making them significantly more compact in most cases.

From: Pablo Estrada 
Date: Wed, May 8, 2019 at 11:35 PM
To: dev

> Hello all,
> Some tests in Python have the problem that when they fail, lots of internal 
> logging is dumped onto stdout, and we end up having to scroll way up to find 
> the actual stack trace for the failed test. This logging, as far as i can 
> tell, is dumping of fn api protos.
>
> Does anyone use these logs to look into the test failure? I would like to 
> find a way to make these more compact, or maybe just stop logging them 
> (people who need them can choose to log them in their local setup?).
>
> I lean towards making them more compact (by, for instance, writing functions 
> that log their information in a more compact fashion); but I would like to 
> hear thoughts from others.
>
> So thoughts? : )
> -P.

Re: Python SDK timestamp precision

2019-05-08 Thread Robert Bradshaw

From: Kenneth Knowles 
Date: Wed, May 8, 2019 at 6:50 PM
To: dev

> This got pretty long, but I don't yet want to end it, because there's not 
> quite yet a solution that will allow a user to treat timestamps from most 
> systems as Beam timestamps.

+1, it'd be really nice to find a solution to this.

> I'm cutting pieces just to make inline replies easier to read.
>
> On Tue, Apr 23, 2019 at 9:03 AM Robert Bradshaw  wrote:
>>
>> On Tue, Apr 23, 2019 at 4:20 PM Kenneth Knowles  wrote:
>> >  -  WindowFn must receive exactly the data that came from the user's data 
>> > source. So that cannot be rounded.
>> >  - The user's WindowFn assigns to a window, so it can contain arbitrary 
>> > precision as it should be grouped as bytes.
>> >  - End of window, timers, watermark holds, etc, are all treated only as 
>> > bounds, so can all be rounded based on their use as an upper or lower 
>> > bound.
>> >
>> > We already do a lot of this - Pubsub publish timestamps are microsecond 
>> > precision (you could say our current connector constitutes data loss) as 
>> > are Windmill timestamps (since these are only combines of Beam timestamps 
>> > here there is no data loss). There are undoubtedly some corner cases I've 
>> > missed, and naively this might look like duplicating timestamps so that 
>> > could be an unacceptable performance concern.
>>
>> If I understand correctly, in this scheme WindowInto assignment is
>> paramaterized by a function that specifies how to parse/extract the
>> timestamp from the data element (maybe just a field specifier for
>> schema'd data) rather than store the (exact) timestamp in a standard
>> place in the WindowedValue, and the window merging always goes back to
>> the SDK rather than the possibility of it being handled runner-side.
>
> This sounds promising. You could also store the extracted approximate 
> timestamp somewhere, of course.
>
>> Even if the runner doesn't care about interpreting the window, I think
>> we'll want to have compatible window representations (and timestamp
>> representations, and windowing fns) across SDKs (especially for
>> cross-language) which favors choosing a consistent resolution.
>>
>> The end-of-window, for firing, can be approximate, but it seems it
>> should be exact for timestamp assignment of the result (and similarly
>> with the other timestamp combiners).
>
> I was thinking that the window itself should be stored as exact data, while 
> just the firing itself is approximated, since it already is, because of 
> watermarks and timers.

I think this works where we can compare encoded windows, but some
portable interpretation of windows is required for runner-side
implementation of merging windows (for example).

There may also be issues if windows (or timestamps) are assigned to a
high precision in one SDK, then inspected/acted on in another SDK, and
then passed back to the original SDK where the truncation would be
visible.

> You raise a good point that min/max timestamp combiners require actually 
> understanding the higher-precision timestamp. I can think of a couple things 
> to do. One is the old "standardize all 3 or for precisions we need" and the 
> other is that combiners other than EOW exist primarily to hold the watermark, 
> and that hold does not require the original precision. Still, neither of 
> these is that satisfying.

In the current model, the output timestamp is user-visible.

>> > A correction: Java *now* uses nanoseconds [1]. It uses the same breakdown 
>> > as proto (int64 seconds since epoch + int32 nanos within second). It has 
>> > legacy classes that use milliseconds, and Joda itself now encourages 
>> > moving back to Java's new Instant type. Nanoseconds should complicate the 
>> > arithmetic only for the one person authoring the date library, which they 
>> > have already done.
>>
>> The encoding and decoding need to be done in a language-consistent way
>> as well.
>
> I honestly am not sure what you mean by "language-consistent" here.

If we want to make reading and writing of timestamps, windows
cross-language, we can't rely on language-specific libraries to do the
encoding.

>> Also, most date libraries don't division, etc. operators, so
>> we have to do that as well. Not that it should be *that* hard.
>
> If the libraries dedicated to time handling haven't found it needful, is 
> there a specific reason you raise this? We do some simple math to find the 
> window things fall into; is that it?

Yes. E.g.

https://github.com/apache/beam/blob

Re: [DISCUSS] Portability representation of schemas

2019-05-08 Thread Robert Bradshaw

Very excited to see this. In particular, I think this will be very
useful for cross-language pipelines (not just SQL, but also for
describing non-trivial data (e.g. for source and sink reuse).

The proto specification makes sense to me. The only thing that looks
like it's missing (other than possibly iterable, for arbitrarily-large
support) is multimap. Another basic type, should we want to support
it, is union (though this of course can get messy).

I'm curious what the rational was for going with a oneof for type_info
rather than an repeated components like we do with coders.

Removing DATETIME as a logical coder on top of INT64 may cause issues
of insufficient resolution and/or timespan. Similarly with DECIMAL (or
would it be backed by string?)

The biggest question, as far as portability is concerned at least, is
the notion of logical types. serialized_class is clearly not portable,
and I also think we'll want a way to share semantic meaning across
SDKs (especially if things like dates become logical types). Perhaps
URNs (+payloads) would be a better fit here?


Taking a step back, I think it's worth asking why we have different
types, rather than simply making everything a LogicalType of bytes
(aka coder). Other than encoding format, the answer I can come up with
is that the type decides the kinds of operations that can be done on
it, e.g. does it support comparison? Arithmetic? Containment?
Higher-level date operations? Perhaps this should be used to guide the
set of types we provide.

(Also, +1 to optional over nullable.)


From: Reuven Lax 
Date: Wed, May 8, 2019 at 6:54 PM
To: dev

> Beam Java's support for schemas is just about done: we infer schemas from a 
> variety of types, we have a variety of utility transforms (join, aggregate, 
> etc.) for schemas, and schemas are integrated with the ParDo machinery. The 
> big remaining task I'm working on is writing documentation and examples for 
> all of this so that users are aware. If you're interested, these slides from 
> the London Beam meetup show a bit more how schemas can be used and how they 
> simplify the API.
>
> I want to start integrating schemas into portability so that they can be used 
> from other languages such as Python (in particular this will also allow 
> BeamSQL to be invoked from other languages). In order to do this, the Beam 
> portability protos must have a way of representing schemas. Since this has 
> not been discussed before, I'm starting this discussion now on the list.
>
> As a reminder: a schema represents the type of a PCollection as a collection 
> of fields. Each field has a name, an id (position), and a field type. A field 
> type can be either a primitive type (int, long, string, byte array, etc.), a 
> nested row (itself with a schema), an array, or a map.
>
> We also support logical types. A logical type is a way for the user to embed 
> their own types in schema fields. A logical type is always backed by a schema 
> type, and contains a function for mapping the user's logical type to the 
> field type. You can think of this as a generalization of a coder: while a 
> coder always maps the user type to a byte array, a logical type can map to an 
> int, or a string, or any other schema field type (in fact any coder can 
> always be used as a logical type for mapping to byte-array field types). 
> Logical types are used extensively by Beam SQL to represent SQL types that 
> have no correspondence in Beam's field types (e.g. SQL has 4 different 
> date/time types). Logical types for Beam schemas have a lot of similarities 
> to AVRO logical types.
>
> An initial proto representation for schemas is here. Before we go further 
> with this, I would like community consensus on what this representation 
> should be. I can start by suggesting a few possible changes to this 
> representation (and hopefully others will suggest others):
>
> Kenn Knowles has suggested removing DATETIME as a primitive type, and instead 
> making it a logical type backed by INT64 as this keeps our primitive types 
> closer to "classical" PL primitive types. This also allows us to create 
> multiple versions of this type - e.g. TIMESTAMP(millis), TIMESTAMP(micros), 
> TIMESTAMP(nanos).
> If we do the above, we can also consider removing DECIMAL and making that a 
> logical type as well.
> The id field is currently used for some performance optimizations only. If we 
> formalized the idea of schema types having ids, then we might be able to use 
> this to allow self-recursive schemas (self-recursive types are not currently 
> allowed).
> Beam Schemas currently have an ARRAY type. However Beam supports "large 
> iterables" (iterables that don't fit in memory that the runner can page in), 
> and this doesn't match well to arrays. I think we need to add an ITERABLE 
> type as well to support things like GroupByKey results.
>
> It would also be interesting to explore allowing well-known metadata tags on 
> fields that Beam interprets. e.g. key and

Re: Artifact staging in cross-language pipelines

2019-05-07 Thread Robert Bradshaw

Looking forward to your writeup, Max. In the meantime, some comments below.


From: Lukasz Cwik 
Date: Thu, May 2, 2019 at 6:45 PM
To: dev

>
>
> On Thu, May 2, 2019 at 7:20 AM Robert Bradshaw  wrote:
>>
>> On Sat, Apr 27, 2019 at 1:14 AM Lukasz Cwik  wrote:
>> >
>> > We should stick with URN + payload + artifact metadata[1] where the only 
>> > mandatory one that all SDKs and expansion services understand is the 
>> > "bytes" artifact type. This allows us to add optional URNs for file://, 
>> > http://, Maven, PyPi, ... in the future. I would make the artifact staging 
>> > service use the same URN + payload mechanism to get compatibility of 
>> > artifacts across the different services and also have the artifact staging 
>> > service be able to be queried for the list of artifact types it supports.
>>
>> +1
>>
>> > Finally, we would need to have environments enumerate the artifact types 
>> > that they support.
>>
>> Meaning at runtime, or as another field statically set in the proto?
>
>
> I don't believe runners/SDKs should have to know what artifacts each 
> environment supports at runtime and instead have environments enumerate them 
> explicitly in the proto. I have been thinking about a more general 
> "capabilities" block on environments which allow them to enumerate URNs that 
> the environment understands. This would include artifact type URNs, 
> PTransform URNs, coder URNs, ... I haven't proposed anything specific down 
> this line yet because I was wondering how environment resources (CPU, min 
> memory, hardware like GPU, AWS/GCP/Azure/... machine types) should/could tie 
> into this.
>
>>
>> > Having everyone have the same "artifact" representation would be 
>> > beneficial since:
>> > a) Python environments could install dependencies from a requirements.txt 
>> > file (something that the Google Cloud Dataflow Python docker container 
>> > allows for today)
>> > b) It provides an extensible and versioned mechanism for SDKs, 
>> > environments, and artifact staging/retrieval services to support 
>> > additional artifact types
>> > c) Allow for expressing a canonical representation of an artifact like a 
>> > Maven package so a runner could merge environments that the runner deems 
>> > compatible.
>> >
>> > The flow I could see is:
>> > 1) (optional) query artifact staging service for supported artifact types
>> > 2) SDK request expansion service to expand transform passing in a list of 
>> > artifact types the SDK and artifact staging service support, the expansion 
>> > service returns a list of artifact types limited to those supported types 
>> > + any supported by the environment
>>
>> The crux of the issue seems to be how the expansion service returns
>> the artifacts themselves. Is this going with the approach that the
>> caller of the expansion service must host an artifact staging service?
>
>
> The caller would not need to host an artifact staging service (but would 
> become effectively a proxy service, see my comment below for more details) as 
> I would have expected this to be part of the expansion service response.
>
>>
>> There is also the question here is how the returned artifacts get
>> attached to the various environments, or whether they get implicitly
>> applied to all returned stages (which need not have a consistent
>> environment)?
>
>
> I would suggest returning additional information that says what artifact is 
> for which environment. Applying all artifacts to all environments is likely 
> to cause issues since some environments may not understand certain artifact 
> types or may get conflicting versions of artifacts. I would see this 
> happening since an expansion service that aggregates other expansion services 
> seems likely, for example:
>  /-> ExpansionSerivce(Python)
> ExpansionService(Aggregator) --> ExpansionService(Java)
>  \-> ExpansionSerivce(Go)

All of this goes back to the idea that I think the listing of
artifacts (or more general dependencies) should be a property of the
environment themselves.

>> > 3) SDK converts any artifact types that the artifact staging service or 
>> > environment doesn't understand, e.g. pulls down Maven dependencies and 
>> > converts them to "bytes" artifacts
>>
>> Here I think we're conflating two things. The "type" of an artifact is
>> both (1) how to fetch the bytes and (2) how to interpret them (

Re: [DISCUSS] Performance of Beam compare to "Bare Runner"

2019-05-03 Thread Robert Bradshaw

On Fri, May 3, 2019 at 9:29 AM Viliam Durina  wrote:
>
> > you MUST NOT mutate your inputs
> I think it's enough to not mutate the inputs after you emit them. From this 
> follows that when you receive an input, the upstream vertex will not try to 
> mutate it in parallel. This is what Hazelcast Jet expects. We have no option 
> to automatically clone objects after each step.

There's also the case of sibling fusion. E.g. if your graph looks like

   ---> B
 /
A
 \
   ---> C

which all gets fused together, then both B and C are applied to each
output of A, which means it is not safe for B and C to mutate their
inputs lest its sibling (whichever is applied second) see this
mutation.

> On Thu, 2 May 2019 at 20:01, Maximilian Michels  wrote:
>>
>> > I am not sure what are you referring to here. What do you mean Kryo is 
>> > simply slower ... Beam Kryo or Flink Kryo or?
>>
>> Flink uses Kryo as a fallback serializer when its own type serialization
>> system can't analyze the type. I'm just guessing here that this could be
>> slower.
>>
>> On 02.05.19 16:51, Jozef Vilcek wrote:
>> >
>> >
>> > On Thu, May 2, 2019 at 3:41 PM Maximilian Michels > > <mailto:m...@apache.org>> wrote:
>> >
>> > Thanks for the JIRA issues Jozef!
>> >
>> >  > So the feature in Flink is operator chaining and Flink per
>> > default initiate copy of input elements. In case of Beam coders copy
>> > seems to be more noticable than native Flink.
>> >
>> > Copying between chained operators can be turned off in the
>> > FlinkPipelineOptions (if you know what you're doing).
>> >
>> >
>> > Yes, I know that it can be instracted to reuse objects (if you are
>> > referring to this). I am just not sure I want to open this door in
>> > general :)
>> > But it is interesting to learn, that with portability, this will be
>> > turned On per default. Quite important finding imho.
>> >
>> > Beam coders should
>> > not be slower than Flink's. They are simple wrapped. It seems Kryo is
>> > simply slower which we could fix by providing more type hints to Flink.
>> >
>> >
>> > I am not sure what are you referring to here. What do you mean Kryo is
>> > simply slower ... Beam Kryo or Flink Kryo or?
>> >
>> > -Max
>> >
>> > On 02.05.19 13:15, Robert Bradshaw wrote:
>> >  > Thanks for filing those.
>> >  >
>> >  > As for how not doing a copy is "safe," it's not really. Beam simply
>> >  > asserts that you MUST NOT mutate your inputs (and direct runners,
>> >  > which are used during testing, do perform extra copies and checks to
>> >  > catch violations of this requirement).
>> >  >
>> >  > On Thu, May 2, 2019 at 1:02 PM Jozef Vilcek
>> > mailto:jozo.vil...@gmail.com>> wrote:
>> >  >>
>> >  >> I have created
>> >  >> https://issues.apache.org/jira/browse/BEAM-7204
>> >  >> https://issues.apache.org/jira/browse/BEAM-7206
>> >  >>
>> >  >> to track these topics further
>> >  >>
>> >  >> On Wed, May 1, 2019 at 1:24 PM Jozef Vilcek
>> > mailto:jozo.vil...@gmail.com>> wrote:
>> >  >>>
>> >  >>>
>> >  >>>
>> >  >>> On Tue, Apr 30, 2019 at 5:42 PM Kenneth Knowles
>> > mailto:k...@apache.org>> wrote:
>> >  >>>>
>> >  >>>>
>> >  >>>>
>> >  >>>> On Tue, Apr 30, 2019, 07:05 Reuven Lax > > <mailto:re...@google.com>> wrote:
>> >  >>>>>
>> >  >>>>> In that case, Robert's point is quite valid. The old Flink
>> > runner I believe had no knowledge of fusion, which was known to make
>> > it extremely slow. A lot of work went into making the portable
>> > runner fusion aware, so we don't need to round trip through coders
>> > on every ParDo.
>> >  >>>>
>> >  >>>>
>> >  >>>> The old Flink runner got fusion for free, since Flink does it.
>> > The new fusion in portability is because fusing the runner side of
>> > portability steps does not achieve real fusion
>>

Re: [Discuss] Publishing pre-release artifacts to repositories

2019-05-02 Thread Robert Bradshaw

On Thu, May 2, 2019 at 6:03 PM Michael Luckey  wrote:
>
> Yes, I understood this. But I m personally more paranoid about releasing.
>
> So formally vote (and corresponding testing) was done on rc. If we rebuild 
> and resign, wouldn't that mean we also need to revote?

Yeah, that's the sticking point. I suppose we could build the packages
with rc tags, push them to pypi, and also build them without rc tags,
and push those (and the full source tarball, which doesn't have an rc
tag either) to svn, and have the vote officially cover what's in svn
but the rc ones are just for convenience. (But, given that I can "pip
install https::svn.apache.org/path/to/tarball" it'd primarily have
value for others doing "pip install --pre".)

This is regardless of whether is OK per apache to publish such binary
blobs to a third party place (though IMHO it follows the intent of the
release process).

> If I understand correctly, there will be some changed version string in 
> distributed sources (setup.py?). So there is some binary difference. And just 
> talking about me, doing that repackaging I would certainly mess it up and 
> package some unwanted changes.

We definitely would not want this to be a manual step--I wouldn't
trust myself :).

> On Thu, May 2, 2019 at 5:43 PM Robert Bradshaw  wrote:
>>
>> On Thu, May 2, 2019 at 5:24 PM Michael Luckey  wrote:
>> >
>> > Thanks Ahmet for calling out to the airflow folks. I believe, I am able to 
>> > follow their argument. So from my point of view I do not have an issue 
>> > with apache policy. But honestly still trying to wrap my head around 
>> > Roberts concern with rebuilding/resigning. Currently, our actual release 
>> > is only a tag on source repo and promoting artefacts. Do not yet 
>> > understand how that needs to change to get PyPi included.
>>
>> It's not a big change, but let me clarify.
>>
>> Currently our release preparation goes something like this:
>>
>> 1) Check out the repo, update the versions to 2.x, build and sign the 
>> artifacts.
>> 2) Announce these artifacts as rcN
>> 2a) Push the artifacts to SVN dev/...
>> 2b) Push artifacts to the apache maven repository.
>> 3) Depending on vote, go back to step (1) or forward to step (4).
>> 4) Copy these artifacts as the actual release.
>>
>> Now if we just try to add (2c) Push these artifacts to Pypi, it will
>> be treated (by pypi's tooling, anyone who downloads the tarball, ...)
>> as an actual release. You also can't re-push a tarball with the same
>> name and different contents (the idea being that named releases should
>> never change). So we'd need to change step (1) to update the version
>> to 2.x.rcN *and* add a step in (4) to update the version to 2.x (no rc
>> suffix), rebuild, resign before publishing.
>>
>> As mentioned, possibly the rcN suffix could be part of the building
>> step for Python.
>>
>> > On Wed, May 1, 2019 at 1:33 AM Ahmet Altay  wrote:
>> >>
>> >> Michael, Max and other folks who are concerned about the compatibility 
>> >> with the apache release policy. Does the information in this thread 
>> >> sufficiently address your concerns? Especially the part where, the rc 
>> >> artifacts will be protected by a flag (i.e. --pre) from general 
>> >> consumption.
>> >>
>> >> On Tue, Apr 30, 2019 at 3:59 PM Robert Bradshaw  
>> >> wrote:
>> >>>
>> >>> On Tue, Apr 30, 2019 at 6:11 PM Ahmet Altay  wrote:
>> >>> >
>> >>> > This conversation get quite Python centric. Is there a similar need 
>> >>> > for Java?
>> >>>
>> >>> I think Java is already covered. Go is a different story (but the even
>> >>> versioning and releasing is being worked out).
>> >>>
>> >>> > On Tue, Apr 30, 2019 at 4:54 AM Robert Bradshaw  
>> >>> > wrote:
>> >>> >>
>> >>> >> If we can, by the apache guidelines, post RCs to pypy that is
>> >>> >> definitely the way to go. (Note that test.pypi is for developing
>> >>> >> against the pypi interface, not for pushing anything real.) The caveat
>> >>> >> about naming these with rcN in the version number still applies
>> >>> >> (that's how pypi guards them against non-explicit installs).
>> >>> >
>> >>> > Related to the caveat, I believe this can be easily scripted or even 
>> >>> > made

Re: [Discuss] Publishing pre-release artifacts to repositories

2019-05-02 Thread Robert Bradshaw

On Thu, May 2, 2019 at 5:24 PM Michael Luckey  wrote:
>
> Thanks Ahmet for calling out to the airflow folks. I believe, I am able to 
> follow their argument. So from my point of view I do not have an issue with 
> apache policy. But honestly still trying to wrap my head around Roberts 
> concern with rebuilding/resigning. Currently, our actual release is only a 
> tag on source repo and promoting artefacts. Do not yet understand how that 
> needs to change to get PyPi included.

It's not a big change, but let me clarify.

Currently our release preparation goes something like this:

1) Check out the repo, update the versions to 2.x, build and sign the artifacts.
2) Announce these artifacts as rcN
2a) Push the artifacts to SVN dev/...
2b) Push artifacts to the apache maven repository.
3) Depending on vote, go back to step (1) or forward to step (4).
4) Copy these artifacts as the actual release.

Now if we just try to add (2c) Push these artifacts to Pypi, it will
be treated (by pypi's tooling, anyone who downloads the tarball, ...)
as an actual release. You also can't re-push a tarball with the same
name and different contents (the idea being that named releases should
never change). So we'd need to change step (1) to update the version
to 2.x.rcN *and* add a step in (4) to update the version to 2.x (no rc
suffix), rebuild, resign before publishing.

As mentioned, possibly the rcN suffix could be part of the building
step for Python.

> On Wed, May 1, 2019 at 1:33 AM Ahmet Altay  wrote:
>>
>> Michael, Max and other folks who are concerned about the compatibility with 
>> the apache release policy. Does the information in this thread sufficiently 
>> address your concerns? Especially the part where, the rc artifacts will be 
>> protected by a flag (i.e. --pre) from general consumption.
>>
>> On Tue, Apr 30, 2019 at 3:59 PM Robert Bradshaw  wrote:
>>>
>>> On Tue, Apr 30, 2019 at 6:11 PM Ahmet Altay  wrote:
>>> >
>>> > This conversation get quite Python centric. Is there a similar need for 
>>> > Java?
>>>
>>> I think Java is already covered. Go is a different story (but the even
>>> versioning and releasing is being worked out).
>>>
>>> > On Tue, Apr 30, 2019 at 4:54 AM Robert Bradshaw  
>>> > wrote:
>>> >>
>>> >> If we can, by the apache guidelines, post RCs to pypy that is
>>> >> definitely the way to go. (Note that test.pypi is for developing
>>> >> against the pypi interface, not for pushing anything real.) The caveat
>>> >> about naming these with rcN in the version number still applies
>>> >> (that's how pypi guards them against non-explicit installs).
>>> >
>>> > Related to the caveat, I believe this can be easily scripted or even made 
>>> > part of the travis/wheels pipeline to take the release branch, edit the 
>>> > version string in place to add rc, and build the necessary files.
>>>
>>> Yes. But the resulting artifacts would have to be rebuilt (and
>>> re-signed) without the version edit for the actual release. (Well, we
>>> could possibly edit the artifacts rather than rebuild them.) And
>>> pushing un-edited ones early would be really bad. (It's the classic
>>> tension of whether a pre-release should be marked internally or
>>> externally, re-publishing a new set of bits for the actual release or
>>> re-using version numbers for different sets of bits. Pypi does one,
>>> apache does another...)
>>>
>>> >> The advantage is that a user can do "pip install --pre apache-beam" to
>>> >> get the latest rc rather than "pip install
>>> >> https://dist.apache.org/repos/dist/dev/beam/changing/and/ephemeral/path";
>>> >>
>>> >> On Mon, Apr 29, 2019 at 11:34 PM Pablo Estrada  
>>> >> wrote:
>>> >> >
>>> >> > Aw that's interesting!
>>> >> >
>>> >> > I think, with these considerations, I am only marginally more inclined 
>>> >> > towards publishing to test.pypi. That would make me a +0.9 on 
>>> >> > publishing RCs to the main pip repo then.
>>> >> >
>>> >> > Thanks for doing the research Ahmet. :)
>>> >> > Best
>>> >> > -P
>>> >> >
>>> >> > On Mon, Apr 29, 2019 at 1:53 PM Ahmet Altay  wrote:
>>> >> >>
>>> >> >> I asked to Airflow folks about this. See [1] for the full response 
>>> >&

Re: Artifact staging in cross-language pipelines

2019-05-02 Thread Robert Bradshaw

On Sat, Apr 27, 2019 at 1:14 AM Lukasz Cwik  wrote:
>
> We should stick with URN + payload + artifact metadata[1] where the only 
> mandatory one that all SDKs and expansion services understand is the "bytes" 
> artifact type. This allows us to add optional URNs for file://, http://, 
> Maven, PyPi, ... in the future. I would make the artifact staging service use 
> the same URN + payload mechanism to get compatibility of artifacts across the 
> different services and also have the artifact staging service be able to be 
> queried for the list of artifact types it supports.

+1

> Finally, we would need to have environments enumerate the artifact types that 
> they support.

Meaning at runtime, or as another field statically set in the proto?

> Having everyone have the same "artifact" representation would be beneficial 
> since:
> a) Python environments could install dependencies from a requirements.txt 
> file (something that the Google Cloud Dataflow Python docker container allows 
> for today)
> b) It provides an extensible and versioned mechanism for SDKs, environments, 
> and artifact staging/retrieval services to support additional artifact types
> c) Allow for expressing a canonical representation of an artifact like a 
> Maven package so a runner could merge environments that the runner deems 
> compatible.
>
> The flow I could see is:
> 1) (optional) query artifact staging service for supported artifact types
> 2) SDK request expansion service to expand transform passing in a list of 
> artifact types the SDK and artifact staging service support, the expansion 
> service returns a list of artifact types limited to those supported types + 
> any supported by the environment

The crux of the issue seems to be how the expansion service returns
the artifacts themselves. Is this going with the approach that the
caller of the expansion service must host an artifact staging service?
There is also the question here is how the returned artifacts get
attached to the various environments, or whether they get implicitly
applied to all returned stages (which need not have a consistent
environment)?

> 3) SDK converts any artifact types that the artifact staging service or 
> environment doesn't understand, e.g. pulls down Maven dependencies and 
> converts them to "bytes" artifacts

Here I think we're conflating two things. The "type" of an artifact is
both (1) how to fetch the bytes and (2) how to interpret them (e.g. is
this a jar file, or a pip tarball, or just some data needed by a DoFn,
or ...) Only (1) can be freely transmuted.

> 4) SDK sends artifacts to artifact staging service
> 5) Artifact staging service converts any artifacts to types that the 
> environment understands
> 6) Environment is started and gets artifacts from the artifact retrieval 
> service.
>
> On Wed, Apr 24, 2019 at 4:44 AM Robert Bradshaw  wrote:
>>
>> On Wed, Apr 24, 2019 at 12:21 PM Maximilian Michels  wrote:
>> >
>> > Good idea to let the client expose an artifact staging service that the
>> > ExpansionService could use to stage artifacts. This solves two problems:
>> >
>> > (1) The Expansion Service not being able to access the Job Server
>> > artifact staging service
>> > (2) The client not having access to the dependencies returned by the
>> > Expansion Server
>> >
>> > The downside is that it adds an additional indirection. The alternative
>> > to let the client handle staging the artifacts returned by the Expansion
>> > Server is more transparent and easier to implement.
>>
>> The other downside is that it may not always be possible for the
>> expansion service to connect to the artifact staging service (e.g.
>> when constructing a pipeline locally against a remote expansion
>> service).
>
> Just to make sure, your saying the expansion service would return all the 
> artifacts (bytes, urls, ...) as part of the response since the expansion 
> service wouldn't be able to connect to the SDK that is running locally either.

Yes. Well, more I'm asking how the expansion service would return any
artifacts.

What we have is

Runner <--- SDK ---> Expansion service.

Where the unidirectional arrow means "instantiates a connection with"
and the other direction (and missing arrows) may not be possible.

>> > Ideally, the Expansion Service won't return any dependencies because the
>> > environment already contains the required dependencies. We could make it
>> > a requirement for the expansion to be performed inside an environment.
>> > Then we would already ensure during expansion time that the runtime
>> > dependencies are available.

Re: [DISCUSS] Performance of Beam compare to "Bare Runner"

2019-05-02 Thread Robert Bradshaw

Thanks for filing those.

As for how not doing a copy is "safe," it's not really. Beam simply
asserts that you MUST NOT mutate your inputs (and direct runners,
which are used during testing, do perform extra copies and checks to
catch violations of this requirement).

On Thu, May 2, 2019 at 1:02 PM Jozef Vilcek  wrote:
>
> I have created
> https://issues.apache.org/jira/browse/BEAM-7204
> https://issues.apache.org/jira/browse/BEAM-7206
>
> to track these topics further
>
> On Wed, May 1, 2019 at 1:24 PM Jozef Vilcek  wrote:
>>
>>
>>
>> On Tue, Apr 30, 2019 at 5:42 PM Kenneth Knowles  wrote:
>>>
>>>
>>>
>>> On Tue, Apr 30, 2019, 07:05 Reuven Lax  wrote:
>>>>
>>>> In that case, Robert's point is quite valid. The old Flink runner I 
>>>> believe had no knowledge of fusion, which was known to make it extremely 
>>>> slow. A lot of work went into making the portable runner fusion aware, so 
>>>> we don't need to round trip through coders on every ParDo.
>>>
>>>
>>> The old Flink runner got fusion for free, since Flink does it. The new 
>>> fusion in portability is because fusing the runner side of portability 
>>> steps does not achieve real fusion
>>
>>
>> Aha, I see. So the feature in Flink is operator chaining and Flink per 
>> default initiate copy of input elements. In case of Beam coders copy seems 
>> to be more noticable than native Flink.
>> So do I get it right that in portable runner scenario, you do similar 
>> chaining via this "fusion of stages"? Curious here... how is it different 
>> from chaining so runner can be sure that not doing copy is "safe" with 
>> respect to user defined functions and their behaviour over inputs?
>>
>>>>
>>>>
>>>> Reuven
>>>>
>>>> On Tue, Apr 30, 2019 at 6:58 AM Jozef Vilcek  wrote:
>>>>>
>>>>> It was not a portable Flink runner.
>>>>>
>>>>> Thanks all for the thoughts, I will create JIRAs, as suggested, with my 
>>>>> findings and send them out
>>>>>
>>>>> On Tue, Apr 30, 2019 at 11:34 AM Reuven Lax  wrote:
>>>>>>
>>>>>> Jozef did you use the portable Flink runner or the old one?
>>>>>>
>>>>>> Reuven
>>>>>>
>>>>>> On Tue, Apr 30, 2019 at 1:03 AM Robert Bradshaw  
>>>>>> wrote:
>>>>>>>
>>>>>>> Thanks for starting this investigation. As mentioned, most of the work
>>>>>>> to date has been on feature parity, not performance parity, but we're
>>>>>>> at the point that the latter should be tackled as well. Even if there
>>>>>>> is a slight overhead (and there's talk about integrating more deeply
>>>>>>> with the Flume DAG that could elide even that) I'd expect it should be
>>>>>>> nowhere near the 3x that you're seeing. Aside from the timer issue,
>>>>>>> sounds like the cloning via coders is is a huge drag that needs to be
>>>>>>> addressed. I wonder if this is one of those cases where using the
>>>>>>> portability framework could be a performance win (specifically, no
>>>>>>> cloning would happen between operators of fused stages, and the
>>>>>>> cloning between operators could be on the raw bytes[] (if needed at
>>>>>>> all, because we know they wouldn't be mutated).
>>>>>>>
>>>>>>> On Tue, Apr 30, 2019 at 12:31 AM Kenneth Knowles  
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > Specifically, a lot of shared code assumes that repeatedly setting a 
>>>>>>> > timer is nearly free / the same cost as determining whether or not to 
>>>>>>> > set the timer. ReduceFnRunner has been refactored in a way so it 
>>>>>>> > would be very easy to set the GC timer once per window that occurs in 
>>>>>>> > a bundle, but there's probably some underlying inefficiency around 
>>>>>>> > why this isn't cheap that would be a bigger win.
>>>>>>> >
>>>>>>> > Kenn
>>>>>>> >
>>>>>>> > On Mon, Apr 29, 2019 at 10:05 AM Reuven Lax  wrote:
>>>>>>> >>
>>>>>>> >&g

Re: [DISCUSS] Reconciling ValueState in Java and Python (was: [docs] Python State & Timers)

2019-05-02 Thread Robert Bradshaw

On Wed, May 1, 2019 at 8:09 PM Kenneth Knowles  wrote:
>
> On Wed, May 1, 2019 at 8:51 AM Reuven Lax  wrote:
>>
>> ValueState is not  necessarily racy if you're doing a read-modify-write. 
>> It's only racy if you're doing something like writing last element seen.
>
> Race conditions are not inherently a problem. They are neither necessary nor 
> sufficient for correctness. In this case, it is not the classic sense of race 
> condition anyhow, it is simply a nondeterministic result, which may often be 
> perfectly fine.

One can write correct code with ValueState, but it's harder to do.
This is exacerbated by the fact that at first glance it looks easier
to use.

>>>>> On Wed, May 1, 2019 at 8:30 AM Lukasz Cwik  wrote:
>>>>>>
>>>>>> Isn't a value state just a bag state with at most one element and the 
>>>>>> usage pattern would be?
>>>>>> 1) value_state.get == bag_state.read.next() (both have to handle the 
>>>>>> case when neither have been set)
>>>>>> 2) user logic on what to do with current state + additional information 
>>>>>> to produce new state
>>>>>> 3) value_state.set == bag_state.clear + bag_state.append? (note that 
>>>>>> Runners should optimize clear + append to become a single 
>>>>>> transaction/write)
>
> Your unpacking is accurate, but "X is just a Y" is not accurate. In this case 
> you've demonstrated that value state *can be implemented using* bag state / 
> has a workaround. But it is not subsumed by bag state. One important feature 
> of ValueState is that it is statically determined that the transform cannot 
> be used with merging windows.

The flip side is that it makes it easy to write code that cannot be
used with merging windows. Which hurts composition (especially if
these operations are used as part of larger composite operations).

> Another feature is that it is impossible to accidentally write more than one 
> value. And a third important feature is that it declares what it is so that 
> the code is more readable.

+1. Which is why I think CombiningState is a better substitute than
BagState where it makes sense (and often does, and often can even be
an improvement over ValueState for performance and readability).

Perhaps instead we could call it ReadModifyWrite state. It could make
sense, as well as a read() and write() operation, that we even offer a
modify(I, (I, S) -> S) operation.

(Also, yes, when I said Latest I too meant a hypothetical "throw away
everything else when a new element is written" one, not the specific
one in the code. Sorry for the confusion.)

>>>>>> For example, the blog post with the counter example would be:
>>>>>>   @StateId("buffer")
>>>>>>   private final StateSpec> bufferedEvents = 
>>>>>> StateSpecs.bag();
>>>>>>
>>>>>>   @StateId("count")
>>>>>>   private final StateSpec> countState = 
>>>>>> StateSpecs.bag();
>>>>>>
>>>>>>   @ProcessElement
>>>>>>   public void process(
>>>>>>   ProcessContext context,
>>>>>>   @StateId("buffer") BagState bufferState,
>>>>>>   @StateId("count") BagState countState) {
>>>>>>
>>>>>> int count = Iterables.getFirst(countState.read(), 0);
>>>>>> count = count + 1;
>>>>>> countState.clear();
>>>>>> countState.append(count);
>>>>>> bufferState.add(context.element());
>>>>>>
>>>>>> if (count > MAX_BUFFER_SIZE) {
>>>>>>   for (EnrichedEvent enrichedEvent : 
>>>>>> enrichEvents(bufferState.read())) {
>>>>>> context.output(enrichedEvent);
>>>>>>   }
>>>>>>   bufferState.clear();
>>>>>>   countState.clear();
>>>>>> }
>>>>>>   }
>>>>>>
>>>>>> On Tue, Apr 30, 2019 at 5:39 PM Kenneth Knowles  wrote:
>>>>>>>
>>>>>>> Anything where the state evolves serially but arbitrarily - the toy 
>>>>>>> example is the integer counter in my blog post - needs ValueState. You 
>>>>>>> can't do it with AnyCombineFn. And I think LatestCombineFn is 
>>>>>>> dangerous, especially when it comes to CombingState. ValueState is more 
>>>&

Re: Congrats to Beam's first 6 Google Open Source Peer Bonus recipients!

2019-05-02 Thread Robert Bradshaw

Congratulation, and thanks for all the great contributions each one of you
has made to Beam!

On Thu, May 2, 2019 at 5:51 AM Ruoyun Huang  wrote:

> Congratulations everyone!  Well deserved!
>
> On Wed, May 1, 2019 at 8:38 PM Kenneth Knowles  wrote:
>
>> Congrats! All well deserved!
>>
>> Kenn
>>
>> On Wed, May 1, 2019 at 8:09 PM Reza Rokni  wrote:
>>
>>> Congratulations!
>>>
>>> On Thu, 2 May 2019 at 10:53, Connell O'Callaghan 
>>> wrote:
>>>
 Well done - congratulations to you all!!! Rose thank you for sharing
 this news!!!

 On Wed, May 1, 2019 at 19:45 Rose Nguyen  wrote:

> Matthias Baetens, Lukazs Gajowy, Suneel Marthi, Maximilian Michels,
> Alex Van Boxel, and Thomas Weise:
>
> Thank you for your exceptional contributions to Apache Beam.👏 I'm
> looking forward to seeing this project grow and for more folks to
> contribute and be recognized! Everyone can read more about this award on
> the Google Open Source blog:
> https://opensource.googleblog.com/2019/04/google-open-source-peer-bonus-winners.html
>
> Cheers,
> --
> Rose Thị Nguyễn
>

>>>
>>> --
>>>
>>> This email may be confidential and privileged. If you received this
>>> communication by mistake, please don't forward it to anyone else, please
>>> erase all copies and attachments, and please let me know that it has gone
>>> to the wrong person.
>>>
>>> The above terms reflect a potential business arrangement, are provided
>>> solely as a basis for further discussion, and are not intended to be and do
>>> not constitute a legally binding obligation. No legally binding obligations
>>> will be created, implied, or inferred until an agreement in final form is
>>> executed in writing by all parties involved.
>>>
>>
>
> --
> 
> Ruoyun  Huang
>
>

Re: [DISCUSS] Reconciling ValueState in Java and Python (was: [docs] Python State & Timers)

2019-04-30 Thread Robert Bradshaw

On Wed, May 1, 2019 at 1:55 AM Brian Hulette  wrote:
>
> Reza - you're definitely not derailing, that's exactly what I was looking for!
>
> I've actually recently encountered an additional use case where I'd like to 
> use ValueState in the Python SDK. I'm experimenting with an ArrowBatchingDoFn 
> that uses state and timers to batch up python dictionaries into arrow record 
> batches (actually my entire purpose for jumping down this python state rabbit 
> hole).
>
> At first blush it seems like the best way to do this would be to just 
> replicate the batching approach in the timely processing post [1], but when 
> the bag is full combine the elements into an arrow record batch, rather than 
> enriching all of the elements and writing them out separately. However, if 
> possible I'd like to pre-allocate buffers for each column and populate them 
> as elements arrive (at least for columns with a fixed size type), so a bag 
> state wouldn't be ideal.

It seems it'd be preferable to do the conversion from a bag of
elements to a single arrow frame all at once, when emitting, rather
than repeatedly reading and writing the partial batch to and from
state with every element that comes in. (Bag state has blind append.)

> Also, a CombiningValueState is not ideal because I'd need to implement a 
> merge_accumulators function that combines several in-progress batches. I 
> could certainly implement that, but I'd prefer that it never be called unless 
> absolutely necessary, which doesn't seem to be the case for 
> CombiningValueState. (As an aside, maybe there's some room there for a middle 
> ground between ValueState and CombiningValueState

This does actually feel natural (to me), because you're repeatedly
adding elements to build something up. merge_accumulators would
probably be pretty easy (concatenation) but unless your windows are
merging could just throw a not implemented error to really guard
against it being used.

> I suppose you could argue that this is a pretty low-level optimization we 
> should be able to shield our users from, but right now I just wish I had 
> ValueState in python so I didn't have to hack it up with a BagState :)
>
> Anyway, in light of this and all the other use-cases mentioned here, I think 
> the resolution is to just implement ValueState in python, and document the 
> danger with ValueState in both Python and Java. Just to be clear, the danger 
> I'm referring to is that users might easily forget that data can be out of 
> order, and use ValueState in a way that assumes it's been populated with data 
> from the most recent element in event time, then in practice out of order 
> data clobbers their state. I'm happy to write up a PR for this - are there 
> any objections to that?

I still haven't seen a good case for it (though I haven't looked at
Reza's BiTemporalStream yet). Much harder to remove things once
they're in. Can we just add a Any and/or LatestCombineFn and use (and
point to) that instead? With the comment that if you're doing
read-modify-write, an add_input may be better.

> [1] https://beam.apache.org/blog/2017/08/28/timely-processing.html
>
> On Mon, Apr 29, 2019 at 12:23 AM Robert Bradshaw  wrote:
>>
>> On Mon, Apr 29, 2019 at 3:43 AM Reza Rokni  wrote:
>> >
>> > @Robert Bradshaw Some examples, mostly built out from cases around 
>> > Timeseries data, don't want to derail this thread so at a hi level  :
>>
>> Thanks. Perfectly on-topic for the thread.
>>
>> > Looping timers, a timer which allows for creation of a value within a 
>> > window when no external input has been seen. Requires metadata like "is 
>> > timer set".
>> >
>> > BiTemporalStream join, where we need to match leftCol.timestamp to a value 
>> > ==  (max(rightCol.timestamp) where rightCol.timestamp <= 
>> > leftCol.timestamp)) , this if for a application matching trades to quotes.
>>
>> I'd be interested in seeing the code here. The fact that you have a
>> max here makes me wonder if combining would be applicable.
>>
>> (FWIW, I've long thought it would be useful to do this kind of thing
>> with Windows. Basically, it'd be like session windows with one side
>> being the window from the timestamp forward into the future, and the
>> other side being from the timestamp back a certain amount in the past.
>> This seems a common join pattern.)
>>
>> > Metadata is used for
>> >
>> > Taking the Key from the KV  for use within the OnTimer call.
>> > Knowing where we are in watermarks for GC of objects in state.
>> > More timer metad

Re: Scope of windows?

2019-04-30 Thread Robert Bradshaw

In the original version of the dataflow model, windowing was not
annotated on each PCollection, rather it was inferred based on tracing
up the graph to the latest WindowInto operation. This tracing logic
was put in the SDK for simplicity.

I agree that there is room for a variety of SDK/DSL choices, but would
strongly argue that for SDKs that implicitly specify triggering, the
rules should be consistent and defined by the model. This is
consistent with the principle of least surprise, as well as fact that
the "beam:transform:group_by_key:v1" transform (should such an
operation be provided), when applied to a PCollection with specific
windowing strategy, should produce a PCollection with a well specified
windowing strategy (and similarly for other well-known transforms).

Likewise, I see sink triggers, once we figure them out, as semantic
definitions belonging to the model (with likely some flexibility in
implementation), not a choice each SDK should make on its own (though
some may be able to declare/support them sooner than others).

On Tue, Apr 30, 2019 at 6:24 PM Kenneth Knowles  wrote:
>
> +dev@ since this has taken a turn in that direction
>
> SDK/DSL consistency is nice. But each SDK/DSL being the best thing it can be 
> is more important IMO. I'm including DSLs to be clear that this is a 
> construction issue having little/nothing to do with SDK in the sense of the 
> per-run-time coprocessor we call the SDK harness, because that part of the 
> construction-time decision is not executed by the harness.
>
> So, for example, I am supportive of all of these:
>
>  - SDK/DSL where every aggregation has an explicit trigger configuration
>  - SDK/DSL where the default trigger is "always" and explicit triggers are 
> used for throttling
>  - SDK/DSL that implements sink triggers and assigns triggering in the 
> pipeline graph as an implementation detail of that
>
> Each of these will have technical challenges to overcome (most notably 
> retractions) and won't look like Beam's original Java SDK and that is fine 
> with me. Python and Go already look very different, and it sounds like their 
> behavior has diverged as well, to say nothing of Scio, Euphoria, SQL. FWIW I 
> think this is somewhat comparable to how SDKs handle coders - they do the 
> best thing in their context and the proto/model makes many things possible.
>
> To go in the direction of consistency amongst the core SDKs, we could make 
> all triggers downstream of an initial GBK use the "repeat(always)" trigger. I 
> think we've discussed and this is simpler and more reliable than today's 
> continuation trigger, while keeping its intent.

Well, the default, after watermark trigger probably shouldn't become
repeat(always).

> On Tue, Apr 30, 2019 at 2:41 AM Maximilian Michels  wrote:
>>
>> While it might be debatable whether "continuation triggers" are part of
>> the model, the goal should be to provide a consistent experience across
>> SDKs. I don't see a reason why the Java SDK would use continuation
>> triggers while the Python SDK doesn't.
>>
>> This makes me think that trigger behavior across transforms should
>> actually be part of the model. Or at least be standardized for SDK
>> authors. This would also imply that it is documented for end users.
>>
>> In the end, users do not care about whether it's part of the model or
>> not, but they like having no surprises :)
>>
>> On 29.04.19 09:20, Robert Bradshaw wrote:
>> > I would say that the triggering done in stacked GBKs, with windowings
>> > in between, is part of the model (at least in the sense that it's not
>> > something that we'd want different SDKs to do separately.)
>> >
>> > OTOH, I'm not sure the continuation trigger should be part of the
>> > model. Much easier to either let WindowInto with no trigger specified
>> > either keep the existing one or reset it to the default. A runner can
>> > mutate this to a continuation trigger under the hood, which should be
>> > strictly looser (triggers are a promise about the earliest possible
>> > firing, they don't force firings to happen).
>> >
>> > On Mon, Apr 29, 2019 at 4:34 AM Kenneth Knowles  wrote:
>> >>
>> >> It is accurate to say that the "continuation trigger" is not documented 
>> >> in the general programming guide. It shows up in the javadoc only, as far 
>> >> as I can tell [1]. Technically, this is accurate. It is not part of the 
>> >> core of Beam - each language SDK is required to explicitly specify a 
>> >> trigger for every GroupByKey when they submit a pipe

Re: [Discuss] Publishing pre-release artifacts to repositories

2019-04-30 Thread Robert Bradshaw

On Tue, Apr 30, 2019 at 6:11 PM Ahmet Altay  wrote:
>
> This conversation get quite Python centric. Is there a similar need for Java?

I think Java is already covered. Go is a different story (but the even
versioning and releasing is being worked out).

> On Tue, Apr 30, 2019 at 4:54 AM Robert Bradshaw  wrote:
>>
>> If we can, by the apache guidelines, post RCs to pypy that is
>> definitely the way to go. (Note that test.pypi is for developing
>> against the pypi interface, not for pushing anything real.) The caveat
>> about naming these with rcN in the version number still applies
>> (that's how pypi guards them against non-explicit installs).
>
> Related to the caveat, I believe this can be easily scripted or even made 
> part of the travis/wheels pipeline to take the release branch, edit the 
> version string in place to add rc, and build the necessary files.

Yes. But the resulting artifacts would have to be rebuilt (and
re-signed) without the version edit for the actual release. (Well, we
could possibly edit the artifacts rather than rebuild them.) And
pushing un-edited ones early would be really bad. (It's the classic
tension of whether a pre-release should be marked internally or
externally, re-publishing a new set of bits for the actual release or
re-using version numbers for different sets of bits. Pypi does one,
apache does another...)

>> The advantage is that a user can do "pip install --pre apache-beam" to
>> get the latest rc rather than "pip install
>> https://dist.apache.org/repos/dist/dev/beam/changing/and/ephemeral/path";
>>
>> On Mon, Apr 29, 2019 at 11:34 PM Pablo Estrada  wrote:
>> >
>> > Aw that's interesting!
>> >
>> > I think, with these considerations, I am only marginally more inclined 
>> > towards publishing to test.pypi. That would make me a +0.9 on publishing 
>> > RCs to the main pip repo then.
>> >
>> > Thanks for doing the research Ahmet. :)
>> > Best
>> > -P
>> >
>> > On Mon, Apr 29, 2019 at 1:53 PM Ahmet Altay  wrote:
>> >>
>> >> I asked to Airflow folks about this. See [1] for the full response and a 
>> >> link to one of their RC emails. To summarize their position (specifically 
>> >> for pypi) is: Unless a user does something explicit (such as using a 
>> >> flag, or explicitly requesting an rc release), pip install will not serve 
>> >> RC binaries. And that is compatible with RC section of 
>> >> http://www.apache.org/legal/release-policy.html#release-types
>> >>
>> >> Ahmet
>> >>
>> >> [1] 
>> >> https://lists.apache.org/thread.html/f1f342332c1e180f57d60285bebe614ffa77bb53c4f74c4cbc049096@%3Cdev.airflow.apache.org%3E
>> >>
>> >> On Fri, Apr 26, 2019 at 3:38 PM Ahmet Altay  wrote:
>> >>>
>> >>> The incremental value of publishing python artifacts to a separate place 
>> >>> but not to actual pypi listing will be low. Users can already download 
>> >>> RC artifacts, or even pip install from http location directly. I think 
>> >>> the incremental value will be low, because for a user or a downstream 
>> >>> library to test with Beam RCs using their usual ways will still require 
>> >>> them to get other dependencies from the regular pypi listing. That would 
>> >>> mean they need to change their setup to test with beam rcs, which is the 
>> >>> same state as today. There will be some incremental value of putting 
>> >>> them in more obvious places (e.g. pypi test repository). I would rather 
>> >>> not complicate the release process for doing this.
>> >>>
>> >>>
>> >>>
>> >>> On Thu, Apr 25, 2019 at 2:25 PM Kenneth Knowles  wrote:
>> >>>>
>> >>>> Pip is also able to be pointed at any raw hosted directory for the 
>> >>>> install, right? So we could publish RCs or snapshots somewhere with 
>> >>>> more obvious caveats and not interfere with the pypi list of actual 
>> >>>> releases. Much like the Java snapshots are stored in a separate opt-in 
>> >>>> repository.
>> >>>>
>> >>>> Kenn
>> >>>>
>> >>>> On Thu, Apr 25, 2019 at 5:39 AM Maximilian Michels  
>> >>>> wrote:
>> >>>>>
>> >>>>> > wouldn't that be in conflict with Apache release policy [1] ?
>> >>>>> > [1] http://www.

Re: [Discuss] Publishing pre-release artifacts to repositories

2019-04-30 Thread Robert Bradshaw

If we can, by the apache guidelines, post RCs to pypy that is
definitely the way to go. (Note that test.pypi is for developing
against the pypi interface, not for pushing anything real.) The caveat
about naming these with rcN in the version number still applies
(that's how pypi guards them against non-explicit installs).

The advantage is that a user can do "pip install --pre apache-beam" to
get the latest rc rather than "pip install
https://dist.apache.org/repos/dist/dev/beam/changing/and/ephemeral/path";

On Mon, Apr 29, 2019 at 11:34 PM Pablo Estrada  wrote:
>
> Aw that's interesting!
>
> I think, with these considerations, I am only marginally more inclined 
> towards publishing to test.pypi. That would make me a +0.9 on publishing RCs 
> to the main pip repo then.
>
> Thanks for doing the research Ahmet. :)
> Best
> -P
>
> On Mon, Apr 29, 2019 at 1:53 PM Ahmet Altay  wrote:
>>
>> I asked to Airflow folks about this. See [1] for the full response and a 
>> link to one of their RC emails. To summarize their position (specifically 
>> for pypi) is: Unless a user does something explicit (such as using a flag, 
>> or explicitly requesting an rc release), pip install will not serve RC 
>> binaries. And that is compatible with RC section of 
>> http://www.apache.org/legal/release-policy.html#release-types
>>
>> Ahmet
>>
>> [1] 
>> https://lists.apache.org/thread.html/f1f342332c1e180f57d60285bebe614ffa77bb53c4f74c4cbc049096@%3Cdev.airflow.apache.org%3E
>>
>> On Fri, Apr 26, 2019 at 3:38 PM Ahmet Altay  wrote:
>>>
>>> The incremental value of publishing python artifacts to a separate place 
>>> but not to actual pypi listing will be low. Users can already download RC 
>>> artifacts, or even pip install from http location directly. I think the 
>>> incremental value will be low, because for a user or a downstream library 
>>> to test with Beam RCs using their usual ways will still require them to get 
>>> other dependencies from the regular pypi listing. That would mean they need 
>>> to change their setup to test with beam rcs, which is the same state as 
>>> today. There will be some incremental value of putting them in more obvious 
>>> places (e.g. pypi test repository). I would rather not complicate the 
>>> release process for doing this.
>>>
>>>
>>>
>>> On Thu, Apr 25, 2019 at 2:25 PM Kenneth Knowles  wrote:
>>>>
>>>> Pip is also able to be pointed at any raw hosted directory for the 
>>>> install, right? So we could publish RCs or snapshots somewhere with more 
>>>> obvious caveats and not interfere with the pypi list of actual releases. 
>>>> Much like the Java snapshots are stored in a separate opt-in repository.
>>>>
>>>> Kenn
>>>>
>>>> On Thu, Apr 25, 2019 at 5:39 AM Maximilian Michels  wrote:
>>>>>
>>>>> > wouldn't that be in conflict with Apache release policy [1] ?
>>>>> > [1] http://www.apache.org/legal/release-policy.html
>>>>>
>>>>> Indeed, advertising pre-release artifacts is against ASF rules. For
>>>>> example, Flink was asked to remove a link to the Maven snapshot
>>>>> repository from their download page.
>>>>>
>>>>> However, that does not mean we cannot publish Python artifacts. We just
>>>>> have to clearly mark them for developers only and not advertise them
>>>>> alongside with the official releases.
>>>>>
>>>>> -Max
>>>>>
>>>>> On 25.04.19 10:23, Robert Bradshaw wrote:
>>>>> > Don't we push java artifacts to maven repositories as part of the RC
>>>>> > process? And completely unvetted snapshots? (Or is this OK because
>>>>> > they are special opt-in apache-only ones?)
>>>>> >
>>>>> > I am generally in favor of the idea, but would like to avoid increased
>>>>> > toil on the release manager.
>>>>> >
>>>>> > One potential hitch I see is that current release process updates the
>>>>> > versions to x.y.z (no RC or other pre-release indicator in the version
>>>>> > number) whereas pypi (and other systems) typically expect distinct
>>>>> > (recognizable) version numbers for each attempt, and only the actual
>>>>> > final result has the actual final release version.
>>>>> >
>>>>> > On Thu, Apr 25,

Re: [DISCUSS] Performance of Beam compare to "Bare Runner"

2019-04-30 Thread Robert Bradshaw

Thanks for starting this investigation. As mentioned, most of the work
to date has been on feature parity, not performance parity, but we're
at the point that the latter should be tackled as well. Even if there
is a slight overhead (and there's talk about integrating more deeply
with the Flume DAG that could elide even that) I'd expect it should be
nowhere near the 3x that you're seeing. Aside from the timer issue,
sounds like the cloning via coders is is a huge drag that needs to be
addressed. I wonder if this is one of those cases where using the
portability framework could be a performance win (specifically, no
cloning would happen between operators of fused stages, and the
cloning between operators could be on the raw bytes[] (if needed at
all, because we know they wouldn't be mutated).

On Tue, Apr 30, 2019 at 12:31 AM Kenneth Knowles  wrote:
>
> Specifically, a lot of shared code assumes that repeatedly setting a timer is 
> nearly free / the same cost as determining whether or not to set the timer. 
> ReduceFnRunner has been refactored in a way so it would be very easy to set 
> the GC timer once per window that occurs in a bundle, but there's probably 
> some underlying inefficiency around why this isn't cheap that would be a 
> bigger win.
>
> Kenn
>
> On Mon, Apr 29, 2019 at 10:05 AM Reuven Lax  wrote:
>>
>> I think the short answer is that folks working on the BeamFlink runner have 
>> mostly been focused on getting everything working, and so have not dug into 
>> this performance too deeply. I suspect that there is low-hanging fruit to 
>> optimize as a result.
>>
>> You're right that ReduceFnRunner schedules a timer for each element. I think 
>> this code dates back to before Beam; on Dataflow timers are identified by 
>> tag, so this simply overwrites the existing timer which is very cheap in 
>> Dataflow. If it is not cheap on Flink, this might be something to optimize.
>>
>> Reuven
>>
>> On Mon, Apr 29, 2019 at 3:48 AM Jozef Vilcek  wrote:
>>>
>>> Hello,
>>>
>>> I am interested in any knowledge or thoughts on what should be / is an 
>>> overhead of running Beam pipelines instead of pipelines written on "bare 
>>> runner". Is this something which is being tested or investigated by 
>>> community? Is there a consensus in what bounds should the overhead 
>>> typically be? I realise this is very runner specific, but certain things 
>>> are imposed also by SDK model itself.
>>>
>>> I tested simple streaming pipeline on Flink vs Beam-Flink and found very 
>>> noticeable differences. I want to stress out, it was not a performance 
>>> test. Job does following:
>>>
>>> Read Kafka -> Deserialize to Proto -> Filter deserialisation errors -> 
>>> Reshuffle -> Report counter.inc() to metrics for throughput
>>>
>>> Both jobs had same configuration and same state backed with same 
>>> checkpointing strategy. What I noticed from few simple test runs:
>>>
>>> * first run on Flink 1.5.0 from CPU profiles on one worker I have found out 
>>> that ~50% time was spend either on removing timers from 
>>> HeapInternalTimerService or in java.io.ByteArrayOutputStream from 
>>> CoderUtils.clone()
>>>
>>> * problem with timer delete was addressed by FLINK-9423. I have retested on 
>>> Flink 1.7.2 and there was not much time is spend in timer delete now, but 
>>> root cause was not removed. It still remains that timers are frequently 
>>> registered and removed ( I believe from 
>>> ReduceFnRunner.scheduleGarbageCollectionTimer() in which case it is called 
>>> per processed element? )  which is noticeable in GC activity, Heap and 
>>> State ...
>>>
>>> * in Flink I use FileSystem state backed which keeps state in memory 
>>> CopyOnWriteStateTable which after some time is full of PaneInfo objects. 
>>> Maybe they come from PaneInfoTracker activity
>>>
>>> * Coder clone is painfull. Pure Flink job does copy between operators too, 
>>> in my case it is via Kryo.copy() but this is not noticeable in CPU profile. 
>>> Kryo.copy() does copy on object level not boject -> bytes -> object which 
>>> is cheaper
>>>
>>> Overall, my observation is that pure Flink can be roughly 3x faster.
>>>
>>> I do not know what I am trying to achieve here :) Probably just start a 
>>> discussion and collect thoughts and other experiences on the cost of 
>>> running some data processing on Beam and particular runner.
>>>

Re: [discuss] A tweak to the Python API for SDF?

2019-04-29 Thread Robert Bradshaw

+1 to introducing this Param for consistency (and making the
substitution more obvious), and I think SDF is still new/experimental
enough we can do this. I don't know if we need Spec in addition to
Param and Provider.

On Sat, Apr 27, 2019 at 1:07 AM Chamikara Jayalath  wrote:
>
>
>
> On Fri, Apr 26, 2019 at 3:43 PM Pablo Estrada  wrote:
>>
>> Hi all,
>> Sorry about the wall of text.
>> So, first of all, I thought about this while reviewing a PR by Boyuan with 
>> an example of an SDF[1]. This is very exciting btw : ).
>>
>> Anyway... I certainly have a limited view of the whole SDF effort, but I 
>> think it's worth discussing this particular point about the API before 
>> finalizing SDF and making it widely available. So here I go:
>>
>> The Python API for SDF asks users to provide a restriction provider in their 
>> process function signature. More or less the following:
>>
>> class MyOwnLittleSDF(beam.DoFn):
>>   def process(self, element,
>>   restriction_tracker=MyOwnLittleRestrictionProvider()):
>> # My DoFn logic...
>>
>> This is all fine, but something that I found a little odd is that the 
>> restriction provider gets replaced at runtime with a restriction tracker:
>>
>> class MyOwnLittleSDF(beam.DoFn):
>>   def process(self, element,
>>   restriction_tracker=MyOwnLittleRestrictionProvider()):
>> # This assert succeeds : )
>> assert not isinstance(restriction_tracker,
>>   MyOwnLittleRestrictionProvider)
>>
>> After thinking a little bit about it, I realized that the default argument 
>> simply allows us to inform the runner where to find the restriction 
>> provider; but that the thing that we need at runtime is NOT the restriction 
>> provider - but rather, the restriction tracker.
>>
>> A similar pattern occurs with state and timers, where the runner needs to 
>> know the sort of state, the coder for the values in that state (or the time 
>> domain for timers); but the runtime parameter is different[2]. For state and 
>> timers (and window, timestamp, pane, etc.) we provide a pattern where users 
>> give a default value that is clearly a placeholder: beam.DoFn.TimerParam, or 
>> beam.DoFn.StateParam.
>
>
> This is the way (new) DoFn work for Python SDK. SDK (harness) identifies 
> meanings of different (potential) arguments to a DoFn using pre-defined 
> default values.
>
>>
>>
>> In this case, the API is fairly similar, but (at least in my imagination), 
>> it is much more clear about how the DoFnParam will be replaced with 
>> something else at runtime. A similar change could be done for SDF:
>>
>> class MyOwnLittleSDF(beam.DoFn):
>>   MY_RESTRICTION = \
>>   RestrictionSpec(provider=MyOwnLittleRestrictionProvider())
>>
>>   def process(
>>   self, element,
>>   restriction_tracker=beam.DoFn.RestrictionParam(MY_RESTRICTION)):
>> # My DoFn logic..
>
>
>
> If I understood correctly, what you propose is similar to the existing 
> solution but we add a XXXParam parameter for consistency ?
> I think this is fine and should be a relatively small change. Main point is, 
> SDK should be able to find out the RestrictionProvider class to utilize it's 
> methods before passing elements to DoFn.process() method: 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/common.py#L241
>
>
>>
>>
>> Perhaps it is a good opportunity to consider this, since SDF is still in 
>> progress.
>>
>> Some pros:
>> - Consistent with other parameters that we pass to DoFn methods
>> - A bit more clear about what will happen at runtime
>>
>> Some cons:
>> - SDF developers are "power users", and will have gone through the SDF 
>> documentation. This point will be clear to them.
>> - This may create unnecessary work, and perhaps unintended consequences.
>> - I bet there's more
>>
>> Thoughts?
>>
>> -P.
>>
>> [1] https://github.com/apache/beam/pull/8338
>> [2] 
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/userstate_test.py#L560-L586
>>  .
>>
>>
>>

Re: Removing Java Reference Runner code

2019-04-29 Thread Robert Bradshaw

I'd imagine that most users will continue to debug their pipelines
using a direct runner, and even if the portable runner is used it can
be run in "loopback" mode where the pipeline-submitting process also
acts as the worker(s), so one can output print statements, set
breakpoints, etc. as if it were all in-process (unless there's
actually something strange with the runner <-> SDK API itself).

Similarly, for development, many (most) features (IO, SQL, schemas)
are runner-agnostic, though of course this is not always the case
especially if there are fundamental changes to the model (e.g. one
that comes to mind is retractions).

That's not to say there isn't also value in testing your code on a
portable runner that will more faithfully represent production
environments, but at this level of integration test (e.g. using docker
and all) I don't think having Python is that high of a barrier.

As for a gradle command to run JVR tests on the Python ULR, I don't
think that's currently available, but it should be.



On Sat, Apr 27, 2019 at 4:53 AM Daniel Oliveira  wrote:
>
> Hey Boyuan,
>
> I think that's a good question. Mikhail's mostly right, that the user 
> shouldn't need to know how the Python ULR works for their debugging. This is 
> actually more of an issue with portability itself anyway. Even when I was 
> coding Java pipelines on the Java ULR, if something went wrong in the runner 
> it was still really difficult to debug. Hopefully the only people that will 
> need to do that painful exercise are Beam devs doing development work on the 
> runners. If an average user is having a problem, the runner's logs and error 
> messages should be effective enough that the user shouldn't care what 
> language the runner is using or how it's implemented.
>
> On Fri, Apr 26, 2019 at 12:36 PM Boyuan Zhang  wrote:
>>
>> Another concern from me is, will it be difficult for a Java person (who 
>> developing Java SDK) to figure out what's going on in Python ULR when 
>> debugging?
>>
>> On Fri, Apr 26, 2019 at 12:05 PM Kenneth Knowles  wrote:
>>>
>>> Good points. Distilling one single item: can I, today, run the Java SDK's 
>>> suite of ValidatesRunner command against the Python ULR + Java SDK Harness, 
>>> in a single Gradle command?
>>>
>>> Kenn
>>>
>>> On Fri, Apr 26, 2019 at 9:54 AM Anton Kedin  wrote:

 If there is no plans to invest in ULR then it makes sense to remove it.

 Going forward, however, I think we should try to document the higher level 
 approach we're taking with runners (and portability) now that we have 
 something working and can reflect on it. For example, couple of things 
 that are not 100% clear to me:
  - if the focus is on python runner for portability efforts, how does java 
 SDK (and other languages) tie into this? E.g. how do we run, test, 
 measure, and develop things (pipelines, aspects of the SDK, runner);
  - what's our approach to developing new features, should we make sure 
 python runner supports them as early as possible (e.g. schemas and SQL)?
  - java DirectRunner is still there:
 - it is still the primary tool for java SDK development purposes, and 
 as Kenn mentioned in the linked threads it adds value by making sure users 
 don't rely on implementation details of specific runners. Do we have a 
 similar story for portable scenarios?
 - I assume that extra validations in the DirectRunner have impact on 
 performance in various ways (potentially non-deterministic). While this 
 doesn't matter in some cases, it might do in others. Having a local runner 
 that is (better) optimized for execution would probably make more sense 
 for perf measurements, integration tests, and maybe even local production 
 jobs. Is this something potentially worth looking into?

 Regards,
 Anton


 On Fri, Apr 26, 2019 at 4:41 AM Maximilian Michels  wrote:
>
> Thanks for following up with this. I have mixed feelings to see the
> portable Java DirectRunner go, but I'm in favor of this change because
> it removes a lot of code that we do not really make use of.
>
> -Max
>
> On 26.04.19 02:58, Kenneth Knowles wrote:
> > Thanks for providing all this background on the PR. It is very easy to
> > see where it came from. Definitely nice to have less code and fewer
> > things that can break. Perhaps lazy consensus is enough.
> >
> > Kenn
> >
> > On Thu, Apr 25, 2019 at 4:01 PM Daniel Oliveira  > > wrote:
> >
> > Hey everyone,
> >
> > I made a preliminary PR for removing all the Java Reference Runner
> > code (PR-8380 ) since I
> > wanted to see if it could be done easily. It seems to be working
> > fine, so I wanted to open up this discussion to make sure people are
> > still in agreement on getti

Re: [DISCUSS] Reconciling ValueState in Java and Python (was: [docs] Python State & Timers)

2019-04-29 Thread Robert Bradshaw

On Mon, Apr 29, 2019 at 3:43 AM Reza Rokni  wrote:
>
> @Robert Bradshaw Some examples, mostly built out from cases around Timeseries 
> data, don't want to derail this thread so at a hi level  :

Thanks. Perfectly on-topic for the thread.

> Looping timers, a timer which allows for creation of a value within a window 
> when no external input has been seen. Requires metadata like "is timer set".
>
> BiTemporalStream join, where we need to match leftCol.timestamp to a value == 
>  (max(rightCol.timestamp) where rightCol.timestamp <= leftCol.timestamp)) , 
> this if for a application matching trades to quotes.

I'd be interested in seeing the code here. The fact that you have a
max here makes me wonder if combining would be applicable.

(FWIW, I've long thought it would be useful to do this kind of thing
with Windows. Basically, it'd be like session windows with one side
being the window from the timestamp forward into the future, and the
other side being from the timestamp back a certain amount in the past.
This seems a common join pattern.)

> Metadata is used for
>
> Taking the Key from the KV  for use within the OnTimer call.
> Knowing where we are in watermarks for GC of objects in state.
> More timer metadata (min timer ..)
>
> It could be argued that what we are using state for mostly workarounds for 
> things that could eventually end up in the API itself. For example
>
> There is a Jira for OnTimer Context to have Key.
>  The GC needs are mostly due to not having a Map State object in all runners 
> yet.

Yeah. GC could probably be done with a max combine. The Key (which
should be in the API) could be an AnyCombine for now (safe to
overwrite because it's always the same).

> However I think as folks explore Beam there will always be little things that 
> require Metadata and so having access to something which gives us fine grain 
> control ( as Kenneth mentioned) is useful.

Likely. I guess in line with making easy things easy, I'd like to make
dangerous things hard(er). As Kenn says, we'll probably need some kind
of lower-level thing, especially if we introduce OnMerge.

> Cheers
>
> Reza
>
> On Sat, 27 Apr 2019 at 02:59, Kenneth Knowles  wrote:
>>
>> To be clear, the intent was always that ValueState would be not usable in 
>> merging pipelines. So no danger of clobbering, but also limited 
>> functionality. Is there a runner than accepts it and clobbers? The whole 
>> idea of the new DoFn is that it is easy to do the construction-time analysis 
>> and reject the invalid pipeline. It is actually runner independent and I 
>> think already implemented in ParDo's validation, no?
>>
>> Kenn
>>
>> On Fri, Apr 26, 2019 at 10:14 AM Lukasz Cwik  wrote:
>>>
>>> I am in the camp where we should only support merging state (either 
>>> naturally via things like bags or via combiners). I believe that having the 
>>> wrapper that Brian suggests is useful for users. As for the @OnMerge 
>>> method, I believe combiners should have the ability to look at the window 
>>> information and we should treat @OnMerge as syntactic sugar over a combiner 
>>> if the combiner API is too cumbersome.
>>>
>>> I believe using combiners can also extend to side inputs and help us deal 
>>> with singleton and map like side inputs when multiple firings occur. I also 
>>> like treating everything like a combiner because it will give us a lot 
>>> reuse of combiner implementations across all the places they could be used 
>>> and will be especially useful when we start exposing APIs related to 
>>> retractions on combiners.
>>>
>>> On Fri, Apr 26, 2019 at 9:43 AM Brian Hulette  wrote:
>>>>
>>>> Yeah the danger with out of order processing concerns me more than the 
>>>> merging as well. As a new Beam user, I immediately gravitated towards 
>>>> ValueState since it was easy to think about and I just assumed there 
>>>> wasn't anything to be concerned about. So it was shocking to learn that 
>>>> there is this dangerous edge-case.
>>>>
>>>> What if ValueState were just implemented as a wrapper of CombiningState 
>>>> with a LatestCombineFn and documented as such (and perhaps we encourage 
>>>> users to consider using a CombiningState explicitly if at all possible)?
>>>>
>>>> Brian
>>>>
>>>>
>>>>
>>>> On Fri, Apr 26, 2019 at 2:29 AM Robert Bradshaw  
>>>> wrote:
>>>>>
>>>>> On Fri, Apr 26, 2019 at 6:40 AM Kenneth Knowles  wrote:
>>>>> >
&

Re: [PROPOSAL] Prepare for LTS bugfix release 2.7.1

2019-04-26 Thread Robert Bradshaw

IIRC, there was some talk on making 2.12 the next LTS, but the
consensus is to decide on a LTS after having had some experience with
it, not at or before the release itself.


On Fri, Apr 26, 2019 at 3:04 PM Alexey Romanenko
 wrote:
>
> Thanks for working on this, Kenn.
>
> Perhaps, I missed this but has it been already discussed/decided what will be 
> the next LTS release?
>
> On 26 Apr 2019, at 08:02, Kenneth Knowles  wrote:
>
> Since it is all trivially reversible if there is some other feeling about 
> this thread, I have gone ahead and started the work:
>
>  - I made release-2.7.1 branch point to the same commit as release-2.7.0 so 
> there is something to target PRs
>  - I have opened the first PR, cherry-picking the set_version script and 
> using it to set the version on the branch to 2.7.1: 
> https://github.com/apache/beam/pull/8407 (found bug in the new script right 
> away :-)
>
> Here is the release with list of issues: 
> https://issues.apache.org/jira/projects/BEAM/versions/12344458. So anyone can 
> grab a ticket and volunteer to open a backport PR to the release-2.7.1 branch.
>
> I don't have a strong opinion about how long we should support the 2.7.x 
> line. I am curious about different perspectives on user / vendor needs. I 
> have two very basic thoughts: (1) we surely need to keep it going until some 
> time after we have another LTS designated, to make sure there is a clear path 
> for anyone only using LTS releases and (2) if we decide to end support of 
> 2.7.x but then someone volunteers to backport and release, of course I would 
> not expect anyone to block them, so it has no maximum lifetime, but we just 
> need consensus on a minimum. And of course that consensus cannot force anyone 
> to do the work, but is just a resolution of the community.
>
> Kenn
>
> On Thu, Apr 25, 2019 at 10:29 PM Jean-Baptiste Onofré  
> wrote:
>>
>> +1 it sounds good to me.
>>
>> Thanks !
>>
>> Regards
>> JB
>>
>> On 26/04/2019 02:42, Kenneth Knowles wrote:
>> > Hi all,
>> >
>> > Since the release of 2.7.0 we have identified some serious bugs:
>> >
>> >  - There are 8 (non-dupe) issues* tagged with Fix Version 2.7.1
>> >  - 2 are rated "Blocker" (aka P0) but I think the others may be underrated
>> >  - If you know of a critical bug that is not on that list, please file
>> > an LTS backport ticket for it
>> >
>> > If a user is on an old version and wants to move to the LTS, there are
>> > some real blockers. I propose that we perform a 2.7.1 release starting now.
>> >
>> > I volunteer to manage the release. What do you think?
>> >
>> > Kenn
>> >
>> > *Some are "resolved" but this is not accurate as the LTS 2.7.1 branch is
>> > not created yet. I suggest filing a ticket to track just the LTS
>> > backport when you hit a bug that merits it.
>> >
>
>

Re: [DISCUSS] Reconciling ValueState in Java and Python (was: [docs] Python State & Timers)

2019-04-26 Thread Robert Bradshaw

On Fri, Apr 26, 2019 at 6:40 AM Kenneth Knowles  wrote:
>
> You could use a CombiningState with a CombineFn that returns the minimum for 
> this case.

We've also wanted to be able to set data when setting a timer that
would be returned when the timer fires. (It's in the FnAPI, but not
the SDKs yet.)

The metadata is an interesting usecase, do you have some more specific
examples? Might boil down to not having a rich enough (single) state
type.

> But I've come to feel there is a mismatch. On the one hand, ParDo( DoFn>) is a way to drop to a lower level and write logic that does not fit a 
> more general computational pattern, really taking fine control. On the other 
> hand, automatically merging state via CombiningState or BagState is more of a 
> no-knobs higher level of programming. To me there seems to be a bit of a 
> philosophical conflict.
>
> These days, I feel like an @OnMerge method would be more natural. If you are 
> using state and timers, you probably often want more direct control over how 
> state from windows gets merged. An of course we don't even have a design for 
> timers - you would need some kind of timestamp CombineFn but I think 
> setting/unsetting timers manually makes more sense. Especially considering 
> the trickiness around merging windows in the absence of retractions, you 
> really need this callback, so you can issue retractions manually for any 
> output your stateful DoFn emitted in windows that no longer exist.

I agree we'll probably need an @OnMerge. On the other hand, I like
being able to have good defaults. The high/low level thing is a
continuum (the indexing example falling towards the high end).

Actually, the merging questions bother me less than how easy it is to
accidentally clobber previous values. It looks so easy (like the
easiest state to use) but is actually the most dangerous. If one wants
this behavior, I would rather an explicit AnyCombineFn or
LatestCombineFn which makes you think about the semantics.

- Robert

> On Thu, Apr 25, 2019 at 5:49 PM Reza Rokni  wrote:
>>
>> +1 on the metadata use case.
>>
>> For performance reasons the Timer API does not support a read() operation, 
>> which for the  vast majority of use cases is not a required feature. In the 
>> small set of use cases where it is needed, for example when you need to set 
>> a Timer in EventTime based on the smallest timestamp seen in the elements 
>> within a DoFn, we can make use of a ValueState object to keep track of the 
>> value.
>>
>> On Fri, 26 Apr 2019 at 00:38, Reuven Lax  wrote:
>>>
>>> I see examples of people using ValueState that I think are not captured 
>>> CombiningState. For example, one common one is users who set a timer and 
>>> then record the timestamp of that timer in a ValueState. In general when 
>>> you store state that is metadata about other state you store, then 
>>> ValueState will usually make more sense than CombiningState.
>>>
>>> On Thu, Apr 25, 2019 at 9:32 AM Brian Hulette  wrote:
>>>>
>>>> Currently the Python SDK does not make ValueState available to users. My 
>>>> initial inclination was to go ahead and implement it there to be 
>>>> consistent with Java, but Robert brings up a great point here that 
>>>> ValueState has an inherent race condition for out of order data, and a lot 
>>>> of it's use cases can actually be implemented with a CombiningState 
>>>> instead.
>>>>
>>>> It seems to me that at the very least we should discourage the use of 
>>>> ValueState by noting the danger in the documentation and preferring 
>>>> CombiningState in examples, and perhaps we should go further and deprecate 
>>>> it in Java and not implement it in python. Either way I think we should be 
>>>> consistent between Java and Python.
>>>>
>>>> I'm curious what people think about this, are there use cases that we 
>>>> really need to keep ValueState around for?
>>>>
>>>> Brian
>>>>
>>>> -- Forwarded message -
>>>> From: Robert Bradshaw 
>>>> Date: Thu, Apr 25, 2019, 08:31
>>>> Subject: Re: [docs] Python State & Timers
>>>> To: dev 
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Apr 25, 2019, 5:26 PM Maximilian Michels  wrote:
>>>>>
>>>>> Completely agree that CombiningState is nicer in this example. Users may
>>>>> still want to use ValueState when there is nothing to combine.
>>>>
>>>>
>>>> I've alway

Re: [VOTE] Release 2.12.0, release candidate #4

2019-04-26 Thread Robert Bradshaw

Thanks for all the hard work!

https://dist.apache.org/repos/dist/dev/beam/2.12.0/ seems empty; were
the artifacts already moved?

On Fri, Apr 26, 2019 at 10:31 AM Etienne Chauchot  wrote:
>
> Hi,
> Thanks for all your work and patience Andrew !
>
> PS: as a side note, there were 5 binding votes (I voted +1)
>
> Etienne
>
> Le jeudi 25 avril 2019 à 11:16 -0700, Andrew Pilloud a écrit :
>
> I reran the Nexmark tests, each runner passed. I compared the numbers
>
> on the direct runner to the dashboard and they are where they should
>
> be.
>
>
> With that, I'm happy to announce that we have unanimously approved this 
> release.
>
>
> There are 8 approving votes, 4 of which are binding:
>
> * Jean-Baptiste Onofré
>
> * Lukasz Cwik
>
> * Maximilian Michels
>
> * Ahmet Altay
>
>
> There are no disapproving votes.
>
>
> Thanks everyone!
>
>

Re: [docs] Python State & Timers

2019-04-25 Thread Robert Bradshaw

On Thu, Apr 25, 2019, 5:26 PM Maximilian Michels  wrote:

> Completely agree that CombiningState is nicer in this example. Users may
> still want to use ValueState when there is nothing to combine.


I've always had trouble coming up with any good examples of this.

Also,
> users already know ValueState from the Java SDK.
>

Maybe we should deprecate that :)


On 25.04.19 17:12, Robert Bradshaw wrote:
> > On Thu, Apr 25, 2019 at 4:58 PM Maximilian Michels 
> wrote:
> >>
> >> I forgot to give an example, just to clarify for others:
> >>
> >>> What was the specific example that was less natural?
> >>
> >> Basically every time we use ListState to express ValueState, e.g.
> >>
> >> next_index, = list(state.read()) or [0]
> >>
> >> Taken from:
> >>
> https://github.com/apache/beam/pull/8363/files#diff-ba1a2aed98079ccce869cd660ca9d97dR301
> >
> > Yes, ListState is much less natural here. I think generally
> > CombiningValue is often a better replacement. E.g. the Java example
> > reads
> >
> >
> > public void processElement(
> >ProcessContext context, @StateId("index") ValueState
> index) {
> >  int current = firstNonNull(index.read(), 0);
> >  context.output(KV.of(current, context.element()));
> >  index.write(current+1);
> > }
> >
> >
> > which is replaced with bag state
> >
> >
> > def process(self, element, state=DoFn.StateParam(INDEX_STATE)):
> >  next_index, = list(state.read()) or [0]
> >  yield (element, next_index)
> >  state.clear()
> >  state.add(next_index + 1)
> >
> >
> > whereas CombiningState would be more natural (than ListState, and
> > arguably than even ValueState), giving
> >
> >
> > def process(self, element, index=DoFn.StateParam(INDEX_STATE)):
> >  yield element, index.read()
> >  index.add(1)
> >
> >
> >
> >
> >>
> >> -Max
> >>
> >> On 25.04.19 16:40, Robert Bradshaw wrote:
> >>> https://github.com/apache/beam/pull/8402
> >>>
> >>> On Thu, Apr 25, 2019 at 4:26 PM Robert Bradshaw 
> wrote:
> >>>>
> >>>> Oh, this is for the indexing example.
> >>>>
> >>>> I actually think using CombiningState is more cleaner than ValueState.
> >>>>
> >>>>
> https://github.com/apache/beam/blob/release-2.12.0/sdks/python/apache_beam/runners/portability/fn_api_runner_test.py#L262
> >>>>
> >>>> (The fact that one must specify the accumulator coder is, however,
> >>>> unfortunate. We should probably infer that if we can.)
> >>>>
> >>>> On Thu, Apr 25, 2019 at 4:19 PM Robert Bradshaw 
> wrote:
> >>>>>
> >>>>> The desire was to avoid the implicit disallowed combination wart in
> >>>>> Python (until we could make sense of it), and also ValueState could
> be
> >>>>> surprising with respect to older values overwriting newer ones. What
> >>>>> was the specific example that was less natural?
> >>>>>
> >>>>> On Thu, Apr 25, 2019 at 3:01 PM Maximilian Michels 
> wrote:
> >>>>>>
> >>>>>> @Pablo: Thanks for following up with the PR! :)
> >>>>>>
> >>>>>> @Brian: I was wondering about this as well. It makes the Python
> state
> >>>>>> code a bit unnatural. I'd suggest to add a ValueState wrapper around
> >>>>>> ListState/CombiningState.
> >>>>>>
> >>>>>> @Robert: Like Reuven pointed out, we can disallow ValueState for
> merging
> >>>>>> windows with state.
> >>>>>>
> >>>>>> @Reza: Great. Let's make sure it has Python examples out of the box.
> >>>>>> Either Pablo or me could help there.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Max
> >>>>>>
> >>>>>> On 25.04.19 04:14, Reza Ardeshir Rokni wrote:
> >>>>>>> Pablo, Kenneth and I have a new blog ready for publication which
> covers
> >>>>>>> how to create a "looping timer" it allows for default values to be
> >>>>>>> created in a window when no incoming elements exists. We just need
> to
> >>>>>>> clear a few bits before publication, but would

Re: [docs] Python State & Timers

2019-04-25 Thread Robert Bradshaw

On Thu, Apr 25, 2019 at 4:58 PM Maximilian Michels  wrote:
>
> I forgot to give an example, just to clarify for others:
>
> > What was the specific example that was less natural?
>
> Basically every time we use ListState to express ValueState, e.g.
>
>next_index, = list(state.read()) or [0]
>
> Taken from:
> https://github.com/apache/beam/pull/8363/files#diff-ba1a2aed98079ccce869cd660ca9d97dR301

Yes, ListState is much less natural here. I think generally
CombiningValue is often a better replacement. E.g. the Java example
reads


public void processElement(
  ProcessContext context, @StateId("index") ValueState index) {
int current = firstNonNull(index.read(), 0);
context.output(KV.of(current, context.element()));
index.write(current+1);
}


which is replaced with bag state


def process(self, element, state=DoFn.StateParam(INDEX_STATE)):
next_index, = list(state.read()) or [0]
yield (element, next_index)
state.clear()
state.add(next_index + 1)


whereas CombiningState would be more natural (than ListState, and
arguably than even ValueState), giving


def process(self, element, index=DoFn.StateParam(INDEX_STATE)):
yield element, index.read()
index.add(1)




>
> -Max
>
> On 25.04.19 16:40, Robert Bradshaw wrote:
> > https://github.com/apache/beam/pull/8402
> >
> > On Thu, Apr 25, 2019 at 4:26 PM Robert Bradshaw  wrote:
> >>
> >> Oh, this is for the indexing example.
> >>
> >> I actually think using CombiningState is more cleaner than ValueState.
> >>
> >> https://github.com/apache/beam/blob/release-2.12.0/sdks/python/apache_beam/runners/portability/fn_api_runner_test.py#L262
> >>
> >> (The fact that one must specify the accumulator coder is, however,
> >> unfortunate. We should probably infer that if we can.)
> >>
> >> On Thu, Apr 25, 2019 at 4:19 PM Robert Bradshaw  
> >> wrote:
> >>>
> >>> The desire was to avoid the implicit disallowed combination wart in
> >>> Python (until we could make sense of it), and also ValueState could be
> >>> surprising with respect to older values overwriting newer ones. What
> >>> was the specific example that was less natural?
> >>>
> >>> On Thu, Apr 25, 2019 at 3:01 PM Maximilian Michels  
> >>> wrote:
> >>>>
> >>>> @Pablo: Thanks for following up with the PR! :)
> >>>>
> >>>> @Brian: I was wondering about this as well. It makes the Python state
> >>>> code a bit unnatural. I'd suggest to add a ValueState wrapper around
> >>>> ListState/CombiningState.
> >>>>
> >>>> @Robert: Like Reuven pointed out, we can disallow ValueState for merging
> >>>> windows with state.
> >>>>
> >>>> @Reza: Great. Let's make sure it has Python examples out of the box.
> >>>> Either Pablo or me could help there.
> >>>>
> >>>> Thanks,
> >>>> Max
> >>>>
> >>>> On 25.04.19 04:14, Reza Ardeshir Rokni wrote:
> >>>>> Pablo, Kenneth and I have a new blog ready for publication which covers
> >>>>> how to create a "looping timer" it allows for default values to be
> >>>>> created in a window when no incoming elements exists. We just need to
> >>>>> clear a few bits before publication, but would be great to have that
> >>>>> also include a python example, I wrote it in java...
> >>>>>
> >>>>> Cheers
> >>>>>
> >>>>> Reza
> >>>>>
> >>>>> On Thu, 25 Apr 2019 at 04:34, Reuven Lax  >>>>> <mailto:re...@google.com>> wrote:
> >>>>>
> >>>>>  Well state is still not implemented for merging windows even for
> >>>>>  Java (though I believe the idea was to disallow ValueState there).
> >>>>>
> >>>>>  On Wed, Apr 24, 2019 at 1:11 PM Robert Bradshaw 
> >>>>>  >>>>>  <mailto:rober...@google.com>> wrote:
> >>>>>
> >>>>>  It was unclear what the semantics were for ValueState for 
> >>>>> merging
> >>>>>  windows. (It's also a bit weird as it's inherently a race 
> >>>>> condition
> >>>>>  wrt element ordering, unlike Bag and CombineState, though you 
> >>>>> can
> >>>>>

Re: [docs] Python State & Timers

2019-04-25 Thread Robert Bradshaw

https://github.com/apache/beam/pull/8402

On Thu, Apr 25, 2019 at 4:26 PM Robert Bradshaw  wrote:
>
> Oh, this is for the indexing example.
>
> I actually think using CombiningState is more cleaner than ValueState.
>
> https://github.com/apache/beam/blob/release-2.12.0/sdks/python/apache_beam/runners/portability/fn_api_runner_test.py#L262
>
> (The fact that one must specify the accumulator coder is, however,
> unfortunate. We should probably infer that if we can.)
>
> On Thu, Apr 25, 2019 at 4:19 PM Robert Bradshaw  wrote:
> >
> > The desire was to avoid the implicit disallowed combination wart in
> > Python (until we could make sense of it), and also ValueState could be
> > surprising with respect to older values overwriting newer ones. What
> > was the specific example that was less natural?
> >
> > On Thu, Apr 25, 2019 at 3:01 PM Maximilian Michels  wrote:
> > >
> > > @Pablo: Thanks for following up with the PR! :)
> > >
> > > @Brian: I was wondering about this as well. It makes the Python state
> > > code a bit unnatural. I'd suggest to add a ValueState wrapper around
> > > ListState/CombiningState.
> > >
> > > @Robert: Like Reuven pointed out, we can disallow ValueState for merging
> > > windows with state.
> > >
> > > @Reza: Great. Let's make sure it has Python examples out of the box.
> > > Either Pablo or me could help there.
> > >
> > > Thanks,
> > > Max
> > >
> > > On 25.04.19 04:14, Reza Ardeshir Rokni wrote:
> > > > Pablo, Kenneth and I have a new blog ready for publication which covers
> > > > how to create a "looping timer" it allows for default values to be
> > > > created in a window when no incoming elements exists. We just need to
> > > > clear a few bits before publication, but would be great to have that
> > > > also include a python example, I wrote it in java...
> > > >
> > > > Cheers
> > > >
> > > > Reza
> > > >
> > > > On Thu, 25 Apr 2019 at 04:34, Reuven Lax  > > > <mailto:re...@google.com>> wrote:
> > > >
> > > > Well state is still not implemented for merging windows even for
> > > > Java (though I believe the idea was to disallow ValueState there).
> > > >
> > > > On Wed, Apr 24, 2019 at 1:11 PM Robert Bradshaw  > > > <mailto:rober...@google.com>> wrote:
> > > >
> > > > It was unclear what the semantics were for ValueState for 
> > > > merging
> > > > windows. (It's also a bit weird as it's inherently a race 
> > > > condition
> > > > wrt element ordering, unlike Bag and CombineState, though you 
> > > > can
> > > > always implement it as a CombineState that always returns the 
> > > > latest
> > > > value which is a bit more explicit about the dangers here.)
> > > >
> > > > On Wed, Apr 24, 2019 at 10:08 PM Brian Hulette
> > > > mailto:bhule...@google.com>> wrote:
> > > >  >
> > > >  > That's a great idea! I thought about this too after those
> > > > posts came up on the list recently. I started to look into it,
> > > > but I noticed that there's actually no implementation of
> > > > ValueState in userstate. Is there a reason for that? I started
> > > > to work on a patch to add it but I was just curious if there was
> > > > some reason it was omitted that I should be aware of.
> > > >  >
> > > >  > We could certainly replicate the example without ValueState
> > > > by using BagState and clearing it before each write, but it
> > > > would be nice if we could draw a direct parallel.
> > > >  >
> > > >  > Brian
> > > >  >
> > > >  > On Fri, Apr 12, 2019 at 7:05 AM Maximilian Michels
> > > > mailto:m...@apache.org>> wrote:
> > > >  >>
> > > >  >> > It would probably be pretty easy to add the corresponding
> > > > code snippets to the docs as well.
> > > >  >>
> > > >  >> It's probably a bit more work because there is no section
> > > > dedicated to
> > > >

Re: [docs] Python State & Timers

2019-04-25 Thread Robert Bradshaw

Oh, this is for the indexing example.

I actually think using CombiningState is more cleaner than ValueState.

https://github.com/apache/beam/blob/release-2.12.0/sdks/python/apache_beam/runners/portability/fn_api_runner_test.py#L262

(The fact that one must specify the accumulator coder is, however,
unfortunate. We should probably infer that if we can.)

On Thu, Apr 25, 2019 at 4:19 PM Robert Bradshaw  wrote:
>
> The desire was to avoid the implicit disallowed combination wart in
> Python (until we could make sense of it), and also ValueState could be
> surprising with respect to older values overwriting newer ones. What
> was the specific example that was less natural?
>
> On Thu, Apr 25, 2019 at 3:01 PM Maximilian Michels  wrote:
> >
> > @Pablo: Thanks for following up with the PR! :)
> >
> > @Brian: I was wondering about this as well. It makes the Python state
> > code a bit unnatural. I'd suggest to add a ValueState wrapper around
> > ListState/CombiningState.
> >
> > @Robert: Like Reuven pointed out, we can disallow ValueState for merging
> > windows with state.
> >
> > @Reza: Great. Let's make sure it has Python examples out of the box.
> > Either Pablo or me could help there.
> >
> > Thanks,
> > Max
> >
> > On 25.04.19 04:14, Reza Ardeshir Rokni wrote:
> > > Pablo, Kenneth and I have a new blog ready for publication which covers
> > > how to create a "looping timer" it allows for default values to be
> > > created in a window when no incoming elements exists. We just need to
> > > clear a few bits before publication, but would be great to have that
> > > also include a python example, I wrote it in java...
> > >
> > > Cheers
> > >
> > > Reza
> > >
> > > On Thu, 25 Apr 2019 at 04:34, Reuven Lax  > > <mailto:re...@google.com>> wrote:
> > >
> > > Well state is still not implemented for merging windows even for
> > > Java (though I believe the idea was to disallow ValueState there).
> > >
> > > On Wed, Apr 24, 2019 at 1:11 PM Robert Bradshaw  > > <mailto:rober...@google.com>> wrote:
> > >
> > > It was unclear what the semantics were for ValueState for merging
> > > windows. (It's also a bit weird as it's inherently a race 
> > > condition
> > > wrt element ordering, unlike Bag and CombineState, though you can
> > > always implement it as a CombineState that always returns the 
> > > latest
> > > value which is a bit more explicit about the dangers here.)
> > >
> > > On Wed, Apr 24, 2019 at 10:08 PM Brian Hulette
> > > mailto:bhule...@google.com>> wrote:
> > >  >
> > >  > That's a great idea! I thought about this too after those
> > > posts came up on the list recently. I started to look into it,
> > > but I noticed that there's actually no implementation of
> > > ValueState in userstate. Is there a reason for that? I started
> > > to work on a patch to add it but I was just curious if there was
> > > some reason it was omitted that I should be aware of.
> > >  >
> > >  > We could certainly replicate the example without ValueState
> > > by using BagState and clearing it before each write, but it
> > > would be nice if we could draw a direct parallel.
> > >  >
> > >  > Brian
> > >  >
> > >  > On Fri, Apr 12, 2019 at 7:05 AM Maximilian Michels
> > > mailto:m...@apache.org>> wrote:
> > >  >>
> > >  >> > It would probably be pretty easy to add the corresponding
> > > code snippets to the docs as well.
> > >  >>
> > >  >> It's probably a bit more work because there is no section
> > > dedicated to
> > >  >> state/timer yet in the documentation. Tracked here:
> > >  >> https://jira.apache.org/jira/browse/BEAM-2472
> > >  >>
> > >  >> > I've been going over this topic a bit. I'll add the
> > > snippets next week, if that's fine by y'all.
> > >  >>
> > >  >> That would be great. The blog posts are a great way to get
> > > started with
> > >  >> state/timers.
> &

Re: [docs] Python State & Timers

2019-04-25 Thread Robert Bradshaw

The desire was to avoid the implicit disallowed combination wart in
Python (until we could make sense of it), and also ValueState could be
surprising with respect to older values overwriting newer ones. What
was the specific example that was less natural?

On Thu, Apr 25, 2019 at 3:01 PM Maximilian Michels  wrote:
>
> @Pablo: Thanks for following up with the PR! :)
>
> @Brian: I was wondering about this as well. It makes the Python state
> code a bit unnatural. I'd suggest to add a ValueState wrapper around
> ListState/CombiningState.
>
> @Robert: Like Reuven pointed out, we can disallow ValueState for merging
> windows with state.
>
> @Reza: Great. Let's make sure it has Python examples out of the box.
> Either Pablo or me could help there.
>
> Thanks,
> Max
>
> On 25.04.19 04:14, Reza Ardeshir Rokni wrote:
> > Pablo, Kenneth and I have a new blog ready for publication which covers
> > how to create a "looping timer" it allows for default values to be
> > created in a window when no incoming elements exists. We just need to
> > clear a few bits before publication, but would be great to have that
> > also include a python example, I wrote it in java...
> >
> > Cheers
> >
> > Reza
> >
> > On Thu, 25 Apr 2019 at 04:34, Reuven Lax  > <mailto:re...@google.com>> wrote:
> >
> > Well state is still not implemented for merging windows even for
> > Java (though I believe the idea was to disallow ValueState there).
> >
> > On Wed, Apr 24, 2019 at 1:11 PM Robert Bradshaw  > <mailto:rober...@google.com>> wrote:
> >
> > It was unclear what the semantics were for ValueState for merging
> > windows. (It's also a bit weird as it's inherently a race condition
> > wrt element ordering, unlike Bag and CombineState, though you can
> > always implement it as a CombineState that always returns the latest
> > value which is a bit more explicit about the dangers here.)
> >
> > On Wed, Apr 24, 2019 at 10:08 PM Brian Hulette
> > mailto:bhule...@google.com>> wrote:
> >  >
> >  > That's a great idea! I thought about this too after those
> > posts came up on the list recently. I started to look into it,
> > but I noticed that there's actually no implementation of
> > ValueState in userstate. Is there a reason for that? I started
> > to work on a patch to add it but I was just curious if there was
> > some reason it was omitted that I should be aware of.
> >  >
> >  > We could certainly replicate the example without ValueState
> > by using BagState and clearing it before each write, but it
> > would be nice if we could draw a direct parallel.
> >  >
> >  > Brian
> >  >
> >  > On Fri, Apr 12, 2019 at 7:05 AM Maximilian Michels
> > mailto:m...@apache.org>> wrote:
> >  >>
> >  >> > It would probably be pretty easy to add the corresponding
> > code snippets to the docs as well.
> >  >>
> >  >> It's probably a bit more work because there is no section
> > dedicated to
> >  >> state/timer yet in the documentation. Tracked here:
> >  >> https://jira.apache.org/jira/browse/BEAM-2472
> >  >>
> >  >> > I've been going over this topic a bit. I'll add the
> > snippets next week, if that's fine by y'all.
> >  >>
> >  >> That would be great. The blog posts are a great way to get
> > started with
> >  >> state/timers.
> >  >>
> >  >> Thanks,
> >  >> Max
> >  >>
> >  >> On 11.04.19 20:21, Pablo Estrada wrote:
> >  >> > I've been going over this topic a bit. I'll add the
> > snippets next week,
> >  >> > if that's fine by y'all.
> >  >> > Best
> >  >> > -P.
> >  >> >
> >  >> > On Thu, Apr 11, 2019 at 5:27 AM Robert Bradshaw
> > mailto:rober...@google.com>
> >  >> > <mailto:rober...@google.com <mailto:rober...@google.com>>>
> > wrote:
> >  >> >
> >  >> > That's a great idea

Re: [Discuss] Publishing pre-release artifacts to repositories

2019-04-25 Thread Robert Bradshaw

Don't we push java artifacts to maven repositories as part of the RC
process? And completely unvetted snapshots? (Or is this OK because
they are special opt-in apache-only ones?)

I am generally in favor of the idea, but would like to avoid increased
toil on the release manager.

One potential hitch I see is that current release process updates the
versions to x.y.z (no RC or other pre-release indicator in the version
number) whereas pypi (and other systems) typically expect distinct
(recognizable) version numbers for each attempt, and only the actual
final result has the actual final release version.

On Thu, Apr 25, 2019 at 6:38 AM Ahmet Altay  wrote:
>
> I do not know the answer.I believe this will be similar to sharing the RC 
> artifacts for validation purposes and would not be a formal release by 
> itself. But I am not an expert and I hope others will share their opinions.
>
> I quickly searched pypi for apache projects and found at least airflow [1] 
> and libcloud [2] are publishing rc artifacts to pypi. We can reach out to 
> those communities and learn about their processes.
>
> Ahmet
>
> [1] https://pypi.org/project/apache-airflow/#history
> [2] https://pypi.org/project/apache-libcloud/#history
>
> On Wed, Apr 24, 2019 at 6:15 PM Michael Luckey  wrote:
>>
>> Hi,
>>
>> wouldn't that be in conflict with Apache release policy [1] ?
>>
>> [1] http://www.apache.org/legal/release-policy.html
>>
>> On Thu, Apr 25, 2019 at 1:35 AM Alan Myrvold  wrote:
>>>
>>> Great idea. I like the RC candidates to follow as much as the release 
>>> artifact process as possible.
>>>
>>> On Wed, Apr 24, 2019 at 3:27 PM Ahmet Altay  wrote:

 To clarify my proposal, I am proposing publishing to the production pypi 
 repository with an rc tag in the version. And in turn allow users to 
 depend on beam's rc version + all the other regular dependencies users 
 would have directly from pypi.

 Publishing to test pypi repo would also be helpful if test pypi repo also 
 mirrors other packages that exist in the production pypi repository.

 On Wed, Apr 24, 2019 at 3:12 PM Pablo Estrada  wrote:
>
> I think this is a great idea. A way of doing it for python would be by 
> using the test repository for PyPi[1], and that way we would not have to 
> do an official PyPi release, but still would be able to install it with 
> pip (by passing an extra flag), and test.
>
> In fact, there are some Beam artifacts already in there[2]. At some point 
> I looked into this, but couldn't figure out who has access/the password 
> for it.


 I also don't know who owns beam package in test pypi repo. Does anybody 
 know?

>
>
> In short: +1, and I would suggest using the test PyPi repo to avoid 
> publishing to the main PyPi repo.
> Best
> -P.
>
> [1] https://test.pypi.org/
> [2] https://test.pypi.org/project/apache-beam/
>
> On Wed, Apr 24, 2019 at 3:04 PM Ahmet Altay  wrote:
>>
>> Hi all,
>>
>> What do you think about the idea of publishing pre-release artifacts as 
>> part of the RC emails?
>>
>> For Python this would translate into publishing the same artifacts from 
>> RC email with a version like "2.X.0rcY" to pypi. I do not know, but I am 
>> guessing we can do a similar thing with Maven central for Java artifacts 
>> as well.
>>
>> Advantages would be:
>> - Allow end users to validate RCs for their own purposes using the same 
>> exact process they will normally use.
>>  - Enable early-adaptors to start using RC releases early on in the 
>> release cycle if that is what they would like to do. This will in turn 
>> reduce time pressure on some releases. Especially for cases like someone 
>> needs a release to be finalized for an upcoming event.
>>
>> There will also be disadvantages, some I could think of:
>> - Users could request support for RC artifacts. Hopefully in the form of 
>> feedback for us to improve the release. But it could also be in the form 
>> of folks using RC artifacts for production for a long time.
>> - It will add toil to the current release process, there will be one 
>> more step for each RC. I think for python this will be a small step but 
>> nevertheless it will be additional work.
>>
>> For an example of this, you can take a look at tensorflow releases. For 
>> 1.13 there were 3 pre-releases [1].
>>
>> Ahmet
>>
>> [1] https://pypi.org/project/tensorflow/#history

Re: [DISCUSS] FLIP-38 Support python language in flink TableAPI

2019-04-25 Thread Robert Bradshaw

On Thu, Apr 25, 2019 at 6:04 AM jincheng sun  wrote:
>
> Hi Robert,
>
> In addition to the questions described by Dian, I also want to know what 
> difficult problems Py4j's solution will encounter in add UDF support, which 
> you mentioned as follows:
>
>> Using something like Py4j is an easy way to get up an running, especially 
>> for a very faithful API, but the instant one wants to add UDFs one hits a 
>> cliff of sorts (which is surmountable, but likely a lot harder than having 
>> gone the above approach).
>
> I appreciate if you can share more specific cases?

The orchestration involved in supporting UDFs is non-trivial. I think
it is true that a lot of effort can be saved by re-using significant
portions of the design, concepts, and even implementation we already
have for Beam, but still re-building it out of the individual pieces
(likely necessitated due to Py4j having hooked in at lower than the
DAG level) is likely harder (initially and on-going) than simply
leveraging the complete, working package.

> Dian Fu  于2019年4月25日周四 上午11:53写道：
>>
>> Thanks everyone for the discussion here.
>>
>> Regarding to the Java/Scala UDF and the built-in UDF to execute in the 
>> current Flink way (directly in JVM, not via RPC), I share the same thoughts 
>> with Max and Robert and I think it will not be a big problem. From the 
>> design doc, I guess the main reason to take the Py4J way instead of the DAG 
>> way at present is that DAG has some limitations in some scenarios such as 
>> interactive programing which may be a strong requirement for data scientist.
>>
>> > In addition (and I'll admit this is rather subjective) it seems to me one 
>> > of the primary values of a table-like API in a given language (vs. just 
>> > using (say) plain old SQL itself via a console) is the ability to embed it 
>> > in a larger pipeline, or at least drop in operations that are not (as) 
>> > naturally expressed in the "table way," including existing libraries. In 
>> > other words, a full SDK. The Py4j wrapping doesn't extend itself to such 
>> > integration nearly as easily.
>>
>>
>> Hi Robert, regarding to "a larger pipeline", do you mean translating a 
>> table-like API jobs from/to another kind of API job or embedding third-part 
>> libraries into a table-like API jobs via UDF? Could you kindly explain why 
>> this would be a problem for Py4J and will not be a problem if expressing the 
>> job with DAG?
>>
>> Thanks,
>> Dian
>>
>>
>> > 在 2019年4月25日，上午12:16，Robert Bradshaw  写道：
>> >
>> > Thanks for the meeting summary, Stephan. Sound like you covered a lot of 
>> > ground. Some more comments below, adding onto what Max has said.
>> >
>> > On Wed, Apr 24, 2019 at 3:20 PM Maximilian Michels > > <mailto:m...@apache.org>> wrote:
>> > >
>> > > Hi Stephan,
>> > >
>> > > This is excited! Thanks for sharing. The inter-process communication
>> > > code looks like the most natural choice as a common ground. To go
>> > > further, there are indeed some challenges to solve.
>> >
>> > It certainly does make sense to share this work, though it does to me seem 
>> > like a rather low level to integrate at.
>> >
>> > > > => Biggest question is whether the language-independent DAG is 
>> > > > expressive enough to capture all the expressions that we want to map 
>> > > > directly to Table API expressions. Currently much is hidden in opaque 
>> > > > UDFs. Kenn mentioned the structure should be flexible enough to 
>> > > > capture more expressions transparently.
>> > >
>> > > Just to add some context how this could be done, there is the concept of
>> > > a FunctionSpec which is part of a transform in the DAG. FunctionSpec
>> > > contains a URN and with a payload. FunctionSpec can be either (1)
>> > > translated by the Runner directly, e.g. map to table API concepts or (2)
>> > > run a user-defined function with an Environment. It could be feasible
>> > > for Flink to choose the direct path, whereas Beam Runners would leverage
>> > > the more generic approach using UDFs. Granted, compatibility across
>> > > Flink and Beam would only work if both of the translation paths yielded
>> > > the same semantics.
>> >
>> > To elaborate a bit on this, Beam DAGs are built up by applying Transforms 
>> > (basically operations) to PColections (the equivalent of 
>> > d

Re: [DISCUSS] FLIP-38 Support python language in flink TableAPI

2019-04-25 Thread Robert Bradshaw

On Thu, Apr 25, 2019 at 5:59 AM Dian Fu  wrote:
>
> Thanks everyone for the discussion here.
>
> Regarding to the Java/Scala UDF and the built-in UDF to execute in the 
> current Flink way (directly in JVM, not via RPC), I share the same thoughts 
> with Max and Robert and I think it will not be a big problem. From the design 
> doc, I guess the main reason to take the Py4J way instead of the DAG way at 
> present is that DAG has some limitations in some scenarios such as 
> interactive programing which may be a strong requirement for data scientist.

I definitely agree that interactive is  strong requirement for the
data scientist (and others). I don't think this is incompatible with
the DAG model, and something I want to see more of. (For one
exploration, see BeamPython's (still WIP) InteractiveRunner). There
are lots of interesting challenges here (e.g. sampling, partial
results, optimal caching of results vs. re-execution, especially in
the face of fusion) that would be worth working out together.

> In addition (and I'll admit this is rather subjective) it seems to me one of 
> the primary values of a table-like API in a given language (vs. just using 
> (say) plain old SQL itself via a console) is the ability to embed it in a 
> larger pipeline, or at least drop in operations that are not (as) naturally 
> expressed in the "table way," including existing libraries. In other words, a 
> full SDK. The Py4j wrapping doesn't extend itself to such integration nearly 
> as easily.
>
> Hi Robert, regarding to "a larger pipeline", do you mean translating a 
> table-like API jobs from/to another kind of API job or embedding third-part 
> libraries into a table-like API jobs via UDF? Could you kindly explain why 
> this would be a problem for Py4J and will not be a problem if expressing the 
> job with DAG?

I'm talking about anything one would want to do after
tableEnv.toDataSet() or before tableEnv.registerTable(...). Unless you
plan on also wrapping the DataSet/DataStream APIs too, which is a much
taller task. Let alone wrapping all the libraries one might want to
use that are built on these APIs.

If this is instead integrated at a higher level, you could swap back
and forth between the new Tables API and the existing Python SDK
(including libraries such as TFX, and cross langauge capabilities)
almost for free.

> 在 2019年4月25日，上午12:16，Robert Bradshaw  写道：
>
> Thanks for the meeting summary, Stephan. Sound like you covered a lot of 
> ground. Some more comments below, adding onto what Max has said.
>
> On Wed, Apr 24, 2019 at 3:20 PM Maximilian Michels  wrote:
> >
> > Hi Stephan,
> >
> > This is excited! Thanks for sharing. The inter-process communication
> > code looks like the most natural choice as a common ground. To go
> > further, there are indeed some challenges to solve.
>
> It certainly does make sense to share this work, though it does to me seem 
> like a rather low level to integrate at.
>
> > > => Biggest question is whether the language-independent DAG is expressive 
> > > enough to capture all the expressions that we want to map directly to 
> > > Table API expressions. Currently much is hidden in opaque UDFs. Kenn 
> > > mentioned the structure should be flexible enough to capture more 
> > > expressions transparently.
> >
> > Just to add some context how this could be done, there is the concept of
> > a FunctionSpec which is part of a transform in the DAG. FunctionSpec
> > contains a URN and with a payload. FunctionSpec can be either (1)
> > translated by the Runner directly, e.g. map to table API concepts or (2)
> > run a user-defined function with an Environment. It could be feasible
> > for Flink to choose the direct path, whereas Beam Runners would leverage
> > the more generic approach using UDFs. Granted, compatibility across
> > Flink and Beam would only work if both of the translation paths yielded
> > the same semantics.
>
> To elaborate a bit on this, Beam DAGs are built up by applying Transforms 
> (basically operations) to PColections (the equivalent of dataset/datastream), 
> but the key point here is that these transforms are often composite 
> operations that expand out into smaller subtransforms. This expansion happens 
> during pipeline construction, but with the recent work on cross language 
> pipelines can happen out of process. This is one point of extendability. 
> Secondly, and importantly, this composite structure is preserved in the DAG, 
> and so a runner is free to ignore the provided expansion and supply its own 
> (so long as semantically it produces exactly the same output). These 
> composite operations can be identified by arbitrary URNs +

Re: [docs] Python State & Timers

2019-04-24 Thread Robert Bradshaw

It was unclear what the semantics were for ValueState for merging
windows. (It's also a bit weird as it's inherently a race condition
wrt element ordering, unlike Bag and CombineState, though you can
always implement it as a CombineState that always returns the latest
value which is a bit more explicit about the dangers here.)

On Wed, Apr 24, 2019 at 10:08 PM Brian Hulette  wrote:
>
> That's a great idea! I thought about this too after those posts came up on 
> the list recently. I started to look into it, but I noticed that there's 
> actually no implementation of ValueState in userstate. Is there a reason for 
> that? I started to work on a patch to add it but I was just curious if there 
> was some reason it was omitted that I should be aware of.
>
> We could certainly replicate the example without ValueState by using BagState 
> and clearing it before each write, but it would be nice if we could draw a 
> direct parallel.
>
> Brian
>
> On Fri, Apr 12, 2019 at 7:05 AM Maximilian Michels  wrote:
>>
>> > It would probably be pretty easy to add the corresponding code snippets to 
>> > the docs as well.
>>
>> It's probably a bit more work because there is no section dedicated to
>> state/timer yet in the documentation. Tracked here:
>> https://jira.apache.org/jira/browse/BEAM-2472
>>
>> > I've been going over this topic a bit. I'll add the snippets next week, if 
>> > that's fine by y'all.
>>
>> That would be great. The blog posts are a great way to get started with
>> state/timers.
>>
>> Thanks,
>> Max
>>
>> On 11.04.19 20:21, Pablo Estrada wrote:
>> > I've been going over this topic a bit. I'll add the snippets next week,
>> > if that's fine by y'all.
>> > Best
>> > -P.
>> >
>> > On Thu, Apr 11, 2019 at 5:27 AM Robert Bradshaw > > <mailto:rober...@google.com>> wrote:
>> >
>> > That's a great idea! It would probably be pretty easy to add the
>> > corresponding code snippets to the docs as well.
>> >
>> > On Thu, Apr 11, 2019 at 2:00 PM Maximilian Michels > > <mailto:m...@apache.org>> wrote:
>> >  >
>> >  > Hi everyone,
>> >  >
>> >  > The Python SDK still lacks documentation on state and timers.
>> >  >
>> >  > As a first step, what do you think about updating these two blog
>> > posts
>> >  > with the corresponding Python code?
>> >  >
>> >  > https://beam.apache.org/blog/2017/02/13/stateful-processing.html
>> >  > https://beam.apache.org/blog/2017/08/28/timely-processing.html
>> >  >
>> >  > Thanks,
>> >  > Max
>> >

Re: JDK 11 compatibility testing

2019-04-24 Thread Robert Bradshaw

It seems to me that we can assume that if Beam is running in a Java 11
runtime, any Java 11 features used in the body of a DoFn should just work.
The interesting part will be whether there is anything on the boundary that
changes (e.g. are there changes to type inference rules that make them
stricter and/or smarter, or places where we reach into implementation
details like bytecode generation (with the full permutation of signature
options we support)).

Tests of this, of course, are critical.

On Wed, Apr 24, 2019 at 1:39 PM Michał Walenia 
wrote:

> Hi all,
>
> I’m currently working on enhancing a Beam test suite to check
> compatibility with Java 11 UDFs. As JDK11 introduces several useful
> features, I wanted to turn to the Devlist to gather your opinions on which
> features should be included in the DoFn.
>
> To give you an idea of how the test will be designed, I’m planning to
> create a test pipeline with a custom DoFn which will use JDK11 specific
> features. This test will be compiled with JDK11 and ran using a binary of
> Beam built with JDK8 in order to simulate a situation in which the user
> downloads the Beam from Maven repository and uses it in their project built
> in Java 11.
>
> The features I believe are worth checking are:
>
>
>-
>
>String manipulation methods:
>-
>
>   .repeat
>   -
>
>   stripTrailing, stripLeading and strip()
>   -
>
>   isBlank
>   -
>
>   Lines
>   -
>
>RegEx asMatchPredicate
>-
>
>Local parameter type inference in lambda expressions
>-
>
>Optional::isEmpty
>-
>
>Collection::toArray
>-
>
>Path API change - Path::of
>
>
> I don’t think that checking other new features in Java 11 such as flight
> recording or new HTTP client which we probably won’t use in Beam is
> justified, but I’m open to suggestions and discussion.
>
> Which of those new features should be included in the DoFn applied in the
> test?
>
> I will be grateful for any input.
>
> Have a good day
>
> Michal
>
>
> --
>
> Michał Walenia
> Polidea  | Software Engineer
>
> M: +48 791 432 002 <+48791432002>
> E: michal.wale...@polidea.com
>
> We create human & business stories through technology.
> Check out our projects! 
>

Re: [DISCUSS] FLIP-38 Support python language in flink TableAPI

2019-04-24 Thread Robert Bradshaw

Thanks for the meeting summary, Stephan. Sound like you covered a lot of
ground. Some more comments below, adding onto what Max has said.

On Wed, Apr 24, 2019 at 3:20 PM Maximilian Michels  wrote:
>
> Hi Stephan,
>
> This is excited! Thanks for sharing. The inter-process communication
> code looks like the most natural choice as a common ground. To go
> further, there are indeed some challenges to solve.

It certainly does make sense to share this work, though it does to me seem
like a rather low level to integrate at.

> > => Biggest question is whether the language-independent DAG is
expressive enough to capture all the expressions that we want to map
directly to Table API expressions. Currently much is hidden in opaque UDFs.
Kenn mentioned the structure should be flexible enough to capture more
expressions transparently.
>
> Just to add some context how this could be done, there is the concept of
> a FunctionSpec which is part of a transform in the DAG. FunctionSpec
> contains a URN and with a payload. FunctionSpec can be either (1)
> translated by the Runner directly, e.g. map to table API concepts or (2)
> run a user-defined function with an Environment. It could be feasible
> for Flink to choose the direct path, whereas Beam Runners would leverage
> the more generic approach using UDFs. Granted, compatibility across
> Flink and Beam would only work if both of the translation paths yielded
> the same semantics.

To elaborate a bit on this, Beam DAGs are built up by applying Transforms
(basically operations) to PColections (the equivalent of
dataset/datastream), but the key point here is that these transforms are
often composite operations that expand out into smaller subtransforms. This
expansion happens during pipeline construction, but with the recent work on
cross language pipelines can happen out of process. This is one point of
extendability. Secondly, and importantly, this composite structure is
preserved in the DAG, and so a runner is free to ignore the provided
expansion and supply its own (so long as semantically it produces exactly
the same output). These composite operations can be identified by arbitrary
URNs + payloads, and any runner that does not understand them simply uses
the pre-provided expansion.

The existing Flink runner operates on exactly this principle, translating
URNs for the leaf operations (Map, Flatten, ...) as well as some composites
it can do better (e.g. Reshard). It is intentionally easy to define and add
new ones. This actually seems the easier approach (to me at least, but
that's probably heavily influenced by what I'm familiar with vs. what I'm
not).

As for how well this maps onto the Flink Tables API, part of that depends
on how much of the API is the operations themselves, and how much is
concerning configuration/environment/etc. which is harder to talk about in
an agnostic way.

Using something like Py4j is an easy way to get up an running, especially
for a very faithful API, but the instant one wants to add UDFs one hits a
cliff of sorts (which is surmountable, but likely a lot harder than having
gone the above approach). In addition (and I'll admit this is rather
subjective) it seems to me one of the primary values of a table-like API in
a given language (vs. just using (say) plain old SQL itself via a console)
is the ability to embed it in a larger pipeline, or at least drop in
operations that are not (as) naturally expressed in the "table way,"
including existing libraries. In other words, a full SDK. The Py4j wrapping
doesn't extend itself to such integration nearly as easily.

But I really do understand the desire to not block immediate work (and
value) for a longer term solution.

> >  If the DAG is generic enough to capture the additional information, we
probably still need some standardization, so that all the different
language APIs represent their expressions the same way
>
> I wonder whether that's necessary as a first step. I think it would be
> fine for Flink to have its own way to represent API concepts in the Beam
> DAG which Beam Runners may not be able to understand. We could then
> successively add the capability for these transforms to run with Beam.
>
> >  Similarly, it makes sense to standardize the type system (and type
inference) as far as built-in expressions and their interaction with UDFs
are concerned. The Flink Table API and Blink teams found this to be
essential for a consistent API behavior. This would not prevent all-UDF
programs from still using purely binary/opaque types.
>
> Beam has a set of standard coders which can be used across languages. We
> will have to expand those to play well with Flink's:
>
https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/tableApi.html#data-types
>
> I think we will need to exchange more ideas to work out a model that
> will work for both Flink and Beam. A regular meeting could be helpful.

+1, I think this would be really good for both this effort and general
collaboration between

Re: Artifact staging in cross-language pipelines

2019-04-24 Thread Robert Bradshaw

On Wed, Apr 24, 2019 at 12:21 PM Maximilian Michels  wrote:
>
> Good idea to let the client expose an artifact staging service that the
> ExpansionService could use to stage artifacts. This solves two problems:
>
> (1) The Expansion Service not being able to access the Job Server
> artifact staging service
> (2) The client not having access to the dependencies returned by the
> Expansion Server
>
> The downside is that it adds an additional indirection. The alternative
> to let the client handle staging the artifacts returned by the Expansion
> Server is more transparent and easier to implement.

The other downside is that it may not always be possible for the
expansion service to connect to the artifact staging service (e.g.
when constructing a pipeline locally against a remote expansion
service).

> Ideally, the Expansion Service won't return any dependencies because the
> environment already contains the required dependencies. We could make it
> a requirement for the expansion to be performed inside an environment.
> Then we would already ensure during expansion time that the runtime
> dependencies are available.

Yes, it's cleanest if the expansion service provides an environment
without all the dependencies provided. Interesting idea to make this a
property of the expansion service itself.

> > In this case, the runner would (as
> > requested by its configuration) be free to merge environments it
> > deemed compatible, including swapping out beam-java-X for
> > beam-java-embedded if it considers itself compatible with the
> > dependency list.
>
> Could you explain how that would work in practice?

Say one has a pipeline with environments

A: beam-java-sdk-2.12-docker
B: beam-java-sdk-2.12-docker + dep1
C: beam-java-sdk-2.12-docker + dep2
D: beam-java-sdk-2.12-docker + dep3

A runner could (conceivably) be intelligent enough to know that dep1
and dep2 are indeed compatible, and run A, B, and C in a single
beam-java-sdk-2.12-docker + dep1 + dep2 environment (with the
corresponding fusion and lower overhead benefits). If a certain
pipeline option is set, it might further note that dep1 and dep2 are
compatible with its own workers, which are build against sdk-2.12, and
choose to run these in embedded + dep1 + dep2 environment.

Re: Python SDK timestamp precision

2019-04-23 Thread Robert Bradshaw

On Tue, Apr 23, 2019 at 4:20 PM Kenneth Knowles  wrote:
>
> On Tue, Apr 23, 2019 at 5:48 AM Robert Bradshaw  wrote:
>>
>> On Thu, Apr 18, 2019 at 12:23 AM Kenneth Knowles  wrote:
>> >
>> > For Robert's benefit, I want to point out that my proposal is to support 
>> > femtosecond data, with femtosecond-scale windows, even if watermarks/event 
>> > timestamps/holds are only millisecond precision.
>> >
>> > So the workaround once I have time, for SQL and schema-based transforms, 
>> > will be to have a logical type that matches the Java and protobuf 
>> > definition of nanos (seconds-since-epoch + nanos-in-second) to preserve 
>> > the user's data. And then when doing windowing inserting the necessary 
>> > rounding somewhere in the SQL or schema layers.
>>
>> It seems to me that the underlying granularity of element timestamps
>> and window boundaries, as seen an operated on by the runner (and
>> transmitted over the FnAPI boundary), is not something we can make
>> invisible to the user (and consequently we cannot just insert rounding
>> on higher precision data and get the right results). However, I would
>> be very interested in seeing proposals that could get around this.
>> Watermarks, of course, can be as approximate (in one direction) as one
>> likes.
>
>
> I outlined a way... or perhaps I retracted it to ponder and sent the rest of 
> my email. Sorry! Something like this, TL;DR store the original data but do 
> runner ops on rounded data.
>
>  -  WindowFn must receive exactly the data that came from the user's data 
> source. So that cannot be rounded.
>  - The user's WindowFn assigns to a window, so it can contain arbitrary 
> precision as it should be grouped as bytes.
>  - End of window, timers, watermark holds, etc, are all treated only as 
> bounds, so can all be rounded based on their use as an upper or lower bound.
>
> We already do a lot of this - Pubsub publish timestamps are microsecond 
> precision (you could say our current connector constitutes data loss) as are 
> Windmill timestamps (since these are only combines of Beam timestamps here 
> there is no data loss). There are undoubtedly some corner cases I've missed, 
> and naively this might look like duplicating timestamps so that could be an 
> unacceptable performance concern.

If I understand correctly, in this scheme WindowInto assignment is
paramaterized by a function that specifies how to parse/extract the
timestamp from the data element (maybe just a field specifier for
schema'd data) rather than store the (exact) timestamp in a standard
place in the WindowedValue, and the window merging always goes back to
the SDK rather than the possibility of it being handled runner-side.
Even if the runner doesn't care about interpreting the window, I think
we'll want to have compatible window representations (and timestamp
representations, and windowing fns) across SDKs (especially for
cross-language) which favors choosing a consistent resolution.

The end-of-window, for firing, can be approximate, but it seems it
should be exact for timestamp assignment of the result (and similarly
with the other timestamp combiners).

>> As for choice of granularity, it would be ideal if any time-like field
>> could be used as the timestamp (for subsequent windowing). On the
>> other hand, nanoseconds (or smaller) complicates the arithmetic and
>> encoding as a 64-bit int has a time range of only a couple hundred
>> years without overflow (which is an argument for microseconds, as they
>> are a nice balance between sub-second granularity and multi-millennia
>> span). Standardizing on milliseconds is more restrictive but has the
>> advantage that it's what Java and Joda Time use now (though it's
>> always easier to pad precision than round it away).
>
> A correction: Java *now* uses nanoseconds [1]. It uses the same breakdown as 
> proto (int64 seconds since epoch + int32 nanos within second). It has legacy 
> classes that use milliseconds, and Joda itself now encourages moving back to 
> Java's new Instant type. Nanoseconds should complicate the arithmetic only 
> for the one person authoring the date library, which they have already done.

The encoding and decoding need to be done in a language-consistent way
as well. Also, most date libraries don't division, etc. operators, so
we have to do that as well. Not that it should be *that* hard.

>> It would also be really nice to clean up the infinite-future being the
>> somewhat arbitrary max micros rounded to millis, and
>> end-of-global-window being infinite-future minus 1 hour (IIRC), etc.
>> as well as the ugly logic in Python to cop

Re: Python SDK timestamp precision

2019-04-23 Thread Robert Bradshaw

On Thu, Apr 18, 2019 at 12:23 AM Kenneth Knowles  wrote:
>
> For Robert's benefit, I want to point out that my proposal is to support 
> femtosecond data, with femtosecond-scale windows, even if watermarks/event 
> timestamps/holds are only millisecond precision.
>
> So the workaround once I have time, for SQL and schema-based transforms, will 
> be to have a logical type that matches the Java and protobuf definition of 
> nanos (seconds-since-epoch + nanos-in-second) to preserve the user's data. 
> And then when doing windowing inserting the necessary rounding somewhere in 
> the SQL or schema layers.

It seems to me that the underlying granularity of element timestamps
and window boundaries, as seen an operated on by the runner (and
transmitted over the FnAPI boundary), is not something we can make
invisible to the user (and consequently we cannot just insert rounding
on higher precision data and get the right results). However, I would
be very interested in seeing proposals that could get around this.
Watermarks, of course, can be as approximate (in one direction) as one
likes.

As for choice of granularity, it would be ideal if any time-like field
could be used as the timestamp (for subsequent windowing). On the
other hand, nanoseconds (or smaller) complicates the arithmetic and
encoding as a 64-bit int has a time range of only a couple hundred
years without overflow (which is an argument for microseconds, as they
are a nice balance between sub-second granularity and multi-millennia
span). Standardizing on milliseconds is more restrictive but has the
advantage that it's what Java and Joda Time use now (though it's
always easier to pad precision than round it away).

It would also be really nice to clean up the infinite-future being the
somewhat arbitrary max micros rounded to millis, and
end-of-global-window being infinite-future minus 1 hour (IIRC), etc.
as well as the ugly logic in Python to cope with millis-micros
conversion.

> On Wed, Apr 17, 2019 at 3:13 PM Robert Burke  wrote:
>>
>> +1 for plan B. Nano second precision on windowing seems... a little much for 
>> a system that's aggregating data over time. Even for processing say particle 
>> super collider data, they'd get away with artificially increasing the 
>> granularity in batch settings.
>>
>> Now if they were streaming... they'd probably want femtoseconds anyway.
>> The point is, we should see if users demand it before adding in the 
>> necessary work.
>>
>> On Wed, 17 Apr 2019 at 14:26, Chamikara Jayalath  
>> wrote:
>>>
>>> +1 for plan B as well. I think it's important to make timestamp precision 
>>> consistent now without introducing surprising behaviors for existing users. 
>>> But we should move towards a higher granularity timestamp precision in the 
>>> long run to support use-cases that Beam users otherwise might miss out (on 
>>> a runner that supports such precision).
>>>
>>> - Cham
>>>
>>> On Wed, Apr 17, 2019 at 1:35 PM Lukasz Cwik  wrote:

 I also like Plan B because in the cross language case, the pipeline would 
 not work since every party (Runners & SDKs) would have to be aware of the 
 new beam:coder:windowed_value:v2 coder. Plan A has the property where if 
 the SDK/Runner wasn't updated then it may start truncating the timestamps 
 unexpectedly.

 On Wed, Apr 17, 2019 at 1:24 PM Lukasz Cwik  wrote:
>
> Kenn, this discussion is about the precision of the timestamp in the user 
> data. As you had mentioned, Runners need not have the same granularity of 
> user data as long as they correctly round the timestamp to guarantee that 
> triggers are executed correctly but the user data should have the same 
> precision across SDKs otherwise user data timestamps will be truncated in 
> cross language scenarios.
>
> Based on the systems that were listed, either microsecond or nanosecond 
> would make sense. The issue with changing the precision is that all Beam 
> runners except for possibly Beam Python on Dataflow are using millisecond 
> precision since they are all using the same Java Runner windowing/trigger 
> logic.
>
> Plan A: Swap precision to nanosecond
> 1) Change the Python SDK to only expose millisecond precision timestamps 
> (do now)
> 2) Change the user data encoding to support nanosecond precision (do now)
> 3) Swap runner libraries to be nanosecond precision aware updating all 
> window/triggering logic (do later)
> 4) Swap SDKs to expose nanosecond precision (do later)
>
> Plan B:
> 1) Change the Python SDK to only expose millisecond precision timestamps 
> and keep the data encoding as is (do now)
> (We could add greater precision later to plan B by creating a new version 
> beam:coder:windowed_value:v2 which would be nanosecond and would require 
> runners to correctly perform an internal conversions for 
> windowing/triggering.)
>
> I think we should go

Re: Artifact staging in cross-language pipelines

2019-04-23 Thread Robert Bradshaw

I've been out, so coming a bit late to the discussion, but here's my thoughts.

The expansion service absolutely needs to be able to provide the
dependencies for the transform(s) it expands. It seems the default,
foolproof way of doing this is via the environment, which can be a
docker image with all the required dependencies. More than this an
(arguably important, but possibly messy) optimization.

The standard way to provide artifacts outside of the environment is
via the artifact staging service. Of course, the expansion service may
not have access to the (final) artifact staging service (due to
permissions, locality, or it may not even be started up yet) but the
SDK invoking the expansion service could offer an artifact staging
environment for the SDK to publish artifacts to. However, there are
some difficulties here, in particular avoiding name collisions with
staged artifacts, assigning semantic meaning to the artifacts (e.g.
should jar files get automatically placed in the classpath, or Python
packages recognized and installed at startup). The alternative is
going with a (type, pointer) scheme for naming dependencies; if we go
this route I think we should consider migrating all artifact staging
to this style. I am concerned that the "file" version will be less
than useful for what will become the most convenient expansion
services (namely, hosted and docker image). I am still at a loss,
however, as to how to solve the diamond dependency problem among
dependencies--perhaps the information is there if one walks
maven/pypi/go modules/... but do we expect every runner to know about
every packaging platform? This also wouldn't solve the issue if fat
jars are used as dependencies. The only safe thing to do here is to
force distinct dependency sets to live in different environments,
which could be too conservative.

This all leads me to think that perhaps the environment itself should
be docker image (often one of "vanilla" beam-java-x.y ones) +
dependency list, rather than have the dependency/artifact list as some
kind of data off to the side. In this case, the runner would (as
requested by its configuration) be free to merge environments it
deemed compatible, including swapping out beam-java-X for
beam-java-embedded if it considers itself compatible with the
dependency list.

I agree with Thomas that we'll want to make expansion services, and
the transforms they offer, more discoverable. The whole lifetime cycle
of expansion services is something that has yet to be fully fleshed
out, and may influence some of these decisions.

As for adding --jar_package to the Python SDK, this seems really
specific to calling java-from-python (would we have O(n^2) such
options?) as well as out-of-place for a Python user to specify. I
would really hope we can figure out a more generic solution. If we
need this option in the meantime, let's at least make it clear
(probably in the name) that it's temporary.

On Tue, Apr 23, 2019 at 1:08 AM Thomas Weise  wrote:
>
> One more suggestion:
>
> It would be nice to be able to select the environment for the external 
> transforms. For example, I would like to be able to use EMBEDDED for Flink. 
> That's implicit for sources which are runner native unbounded read 
> translations, but it should also be possible for writes. That would then be 
> similar to how pipelines are packaged and run with the "legacy" runner.
>
> Thomas
>
>
> On Mon, Apr 22, 2019 at 1:18 PM Ankur Goenka  wrote:
>>
>> Great discussion!
>> I have a few points around the structure of proto but that is less important 
>> as it can evolve.
>> However, I think that artifact compatibility is another important aspect to 
>> look at.
>> Example: TransformA uses Guava 1.6>< 1.7, TransformB uses 1.8><1.9 and 
>> TransformC uses 1.6><1.8. As sdk provide the environment for each transform, 
>> it can not simply say EnvironmentJava for both TransformA and TransformB as 
>> the dependencies are not compatible.
>> We should have separate environment associated with TransformA and 
>> TransformB in this case.
>>
>> To support this case, we need 2 things.
>> 1: Granular metadata about the dependency including type.
>> 2: Complete list of the transforms to be expanded.
>>
>> Elaboration:
>> The compatibility check can be done in a crude way if we provide all the 
>> metadata about the dependency to expansion service.
>> Also, the expansion service should expand all the applicable transforms in a 
>> single call so that it knows about incompatibility and create separate 
>> environments for these transforms. So in the above example, expansion 
>> service will associate EnvA to TransformA and EnvB to TransformB and EnvA to 
>> TransformC. This will ofcource require changes to Expansion service proto 
>> but giving all the information to expansion service will make it support 
>> more case and make it a bit more future proof.
>>
>>
>> On Mon, Apr 22, 2019 at 10:16 AM Maximilian Michels  wrote:
>>>
>>> Thanks for the summary Cham. A

Re: [docs] Python State & Timers

2019-04-11 Thread Robert Bradshaw

That's a great idea! It would probably be pretty easy to add the
corresponding code snippets to the docs as well.

On Thu, Apr 11, 2019 at 2:00 PM Maximilian Michels  wrote:
>
> Hi everyone,
>
> The Python SDK still lacks documentation on state and timers.
>
> As a first step, what do you think about updating these two blog posts
> with the corresponding Python code?
>
> https://beam.apache.org/blog/2017/02/13/stateful-processing.html
> https://beam.apache.org/blog/2017/08/28/timely-processing.html
>
> Thanks,
> Max

Re: [ANNOUNCE] New committer announcement: Boyuan Zhang

2019-04-11 Thread Robert Bradshaw

Congratulations!

On Thu, Apr 11, 2019 at 12:29 PM Michael Luckey  wrote:
>
> Congrats and welcome, Boyuan
>
> On Thu, Apr 11, 2019 at 12:27 PM Tim Robertson  
> wrote:
>>
>> Many congratulations Boyuan!
>>
>> On Thu, Apr 11, 2019 at 10:50 AM Łukasz Gajowy  wrote:
>>>
>>> Congrats Boyuan! :)
>>>
>>> śr., 10 kwi 2019 o 23:49 Chamikara Jayalath  
>>> napisał(a):

 Congrats Boyuan!

 On Wed, Apr 10, 2019 at 11:14 AM Yifan Zou  wrote:
>
> Congratulations Boyuan!
>
> On Wed, Apr 10, 2019 at 10:49 AM Daniel Oliveira  
> wrote:
>>
>> Congrats Boyuan!
>>
>> On Wed, Apr 10, 2019 at 10:20 AM Rui Wang  wrote:
>>>
>>> So well deserved!
>>>
>>> -Rui
>>>
>>> On Wed, Apr 10, 2019 at 10:12 AM Pablo Estrada  
>>> wrote:

 Well deserved : ) congrats Boyuan!

 On Wed, Apr 10, 2019 at 10:08 AM Aizhamal Nurmamat kyzy 
  wrote:
>
> Congratulations Boyuan!
>
> On Wed, Apr 10, 2019 at 9:52 AM Ruoyun Huang  
> wrote:
>>
>> Thanks for your contributions and congratulations Boyuan!
>>
>> On Wed, Apr 10, 2019 at 9:00 AM Kenneth Knowles  
>> wrote:
>>>
>>> Hi all,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming a new 
>>> committer: Boyuan Zhang.
>>>
>>> Boyuan has been contributing to Beam since early 2018. She has 
>>> proposed 100+ pull requests across a wide range of topics: bug 
>>> fixes, to integration tests, build improvements, metrics features, 
>>> release automation. Two big picture things to highlight are 
>>> building/releasing Beam Python wheels and managing the donation of 
>>> the Beam Dataflow Java Worker, including help with I.P. clearance.
>>>
>>> In consideration of Boyuan's contributions, the Beam PMC trusts 
>>> Boyuan with the responsibilities of a Beam committer [1].
>>>
>>> Thank you, Boyuan, for your contributions.
>>>
>>> Kenn
>>>
>>> [1] 
>>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>
>>
>>
>> --
>> 
>> Ruoyun  Huang
>>

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-08 Thread Robert Bradshaw

On Mon, Apr 8, 2019 at 8:04 PM Kenneth Knowles  wrote:
>
> On Mon, Apr 8, 2019 at 1:57 AM Robert Bradshaw  wrote:
>>
>> On Sat, Apr 6, 2019 at 12:08 AM Kenneth Knowles  wrote:
>> >
>> > On Fri, Apr 5, 2019 at 2:24 PM Robert Bradshaw  wrote:
>> >>
>> >> On Fri, Apr 5, 2019 at 6:24 PM Kenneth Knowles  wrote:
>> >> >
>> >> > Nested and unnested contexts are two different encodings. Can we just 
>> >> > give them different URNs? We can even just express the length-prefixed 
>> >> > UTF-8 as a composition of the length-prefix URN and the UTF-8 URN.
>> >>
>> >> It's not that simple, especially when it comes to composite encodings.
>> >> E.g. for some coders, nested(C) == unnested(C), for some coders
>> >> nested(C) == lenth_prefix(unnested(C)), and for other coders it's
>> >> something else altogether (e.g. when creating a kv coder, the first
>> >> component must use nested context, and the second inherits the nested
>> >> vs. unnested context). When creating TupleCoder(A, B) one doesn't want
>> >> to forcibly use LenthPrefixCoder(A) and LengthPrefixCoder(B), nor does
>> >> one want to force LengthPrefixCoder(TupleCoder(A, B)) because A and B
>> >> may themselves be large and incrementally written (e.g.
>> >> IterableCoder). Using distinct URNs doesn't work well if the runner is
>> >> free to compose and decompose tuple, iterable, etc. coders that it
>> >> doesn't understand.
>> >>
>> >> Until we stop using Coders for IO (a worthy but probably lofty goal)
>> >> we will continue to need the unnested context (lest we expect and
>> >> produce length-prefixed coders in text files, as bigtable keys, etc.).
>> >> On the other hand, almost all internal use is nested (due to sending
>> >> elements around as part of element streams). The other place we cross
>> >> over is LengthPrefixCoder that encodes its values using the unnested
>> >> context prefixed by the unnested encoding length.
>> >>
>> >> Perhaps a step in the right direction would be to consistently use the
>> >> unnested context everywhere but IOs (meaning when we talked about
>> >> coders from the FnAPI perspective, they're *always* in the nested
>> >> context, and hence always have the one and only encoding defined by
>> >> that URN, including when wrapped by a length prefix coder (which would
>> >> sometimes result in double length prefixing, but I think that's a
>> >> price worth paying (or we could do something more clever like make
>> >> length-prefix an (explicit) modifier on a coder rather than a new
>> >> coder itself that would default to length prefixing (or some of the
>> >> other delimiting schemes we've come up with) but allow the component
>> >> coder to offer alternative length-prefix-compatible encodings))). IOs
>> >> could be updated to take Parsers and Formatters (or whatever we call
>> >> them) with the Coders in the constructors left as syntactic sugar
>> >> until we could remove them in 3.0. As Luke says, we have a chance to
>> >> fix our coders for portable pipelines now.
>> >>
>> >> In the (very) short term, we're stuck with a nested and unnested
>> >> version of StringUtf8, just as we have for bytes, lest we change the
>> >> meaning of (or disallow some of) TupleCoder[StrUtf8Coder, ...],
>> >> LengthPrefixCoder[StrUtf8Coder], and using StringUtf8Coder for IO.
>> >
>> >
>> > First, let's note that "nested" and "outer" are a misnomer. The 
>> > distinction is whether it is the last thing encoded in the stream. In a 
>> > KvCoder the ValueCoder is actually encoded in the 
>> > "outer" context though the value is nested. No doubt a good amount of 
>> > confusion comes from the initial and continued use of this terminology.
>>
>> +1. I think of these as "self-delimiting" vs. "externally-delimited."
>>
>> > So, all that said, it is a simple fact that UTF-8 and length-prefixed 
>> > UTF-8 are two different encodings. Encodings are the fundamental concept 
>> > here and coders encapsulate two encodings, with some subtle and 
>> > inconsistently-applied rules about when to use which encoding. I think we 
>> > should still give them distinct URNs unless impossible. You've outlined 
>> > some

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-08 Thread Robert Bradshaw

On Sat, Apr 6, 2019 at 12:08 AM Kenneth Knowles  wrote:
>
>
>
> On Fri, Apr 5, 2019 at 2:24 PM Robert Bradshaw  wrote:
>>
>> On Fri, Apr 5, 2019 at 6:24 PM Kenneth Knowles  wrote:
>> >
>> > Nested and unnested contexts are two different encodings. Can we just give 
>> > them different URNs? We can even just express the length-prefixed UTF-8 as 
>> > a composition of the length-prefix URN and the UTF-8 URN.
>>
>> It's not that simple, especially when it comes to composite encodings.
>> E.g. for some coders, nested(C) == unnested(C), for some coders
>> nested(C) == lenth_prefix(unnested(C)), and for other coders it's
>> something else altogether (e.g. when creating a kv coder, the first
>> component must use nested context, and the second inherits the nested
>> vs. unnested context). When creating TupleCoder(A, B) one doesn't want
>> to forcibly use LenthPrefixCoder(A) and LengthPrefixCoder(B), nor does
>> one want to force LengthPrefixCoder(TupleCoder(A, B)) because A and B
>> may themselves be large and incrementally written (e.g.
>> IterableCoder). Using distinct URNs doesn't work well if the runner is
>> free to compose and decompose tuple, iterable, etc. coders that it
>> doesn't understand.
>>
>> Until we stop using Coders for IO (a worthy but probably lofty goal)
>> we will continue to need the unnested context (lest we expect and
>> produce length-prefixed coders in text files, as bigtable keys, etc.).
>> On the other hand, almost all internal use is nested (due to sending
>> elements around as part of element streams). The other place we cross
>> over is LengthPrefixCoder that encodes its values using the unnested
>> context prefixed by the unnested encoding length.
>>
>> Perhaps a step in the right direction would be to consistently use the
>> unnested context everywhere but IOs (meaning when we talked about
>> coders from the FnAPI perspective, they're *always* in the nested
>> context, and hence always have the one and only encoding defined by
>> that URN, including when wrapped by a length prefix coder (which would
>> sometimes result in double length prefixing, but I think that's a
>> price worth paying (or we could do something more clever like make
>> length-prefix an (explicit) modifier on a coder rather than a new
>> coder itself that would default to length prefixing (or some of the
>> other delimiting schemes we've come up with) but allow the component
>> coder to offer alternative length-prefix-compatible encodings))). IOs
>> could be updated to take Parsers and Formatters (or whatever we call
>> them) with the Coders in the constructors left as syntactic sugar
>> until we could remove them in 3.0. As Luke says, we have a chance to
>> fix our coders for portable pipelines now.
>>
>> In the (very) short term, we're stuck with a nested and unnested
>> version of StringUtf8, just as we have for bytes, lest we change the
>> meaning of (or disallow some of) TupleCoder[StrUtf8Coder, ...],
>> LengthPrefixCoder[StrUtf8Coder], and using StringUtf8Coder for IO.
>
>
> First, let's note that "nested" and "outer" are a misnomer. The distinction 
> is whether it is the last thing encoded in the stream. In a KvCoder ValueCoder> the ValueCoder is actually encoded in the "outer" context though 
> the value is nested. No doubt a good amount of confusion comes from the 
> initial and continued use of this terminology.

+1. I think of these as "self-delimiting" vs. "externally-delimited."

> So, all that said, it is a simple fact that UTF-8 and length-prefixed UTF-8 
> are two different encodings. Encodings are the fundamental concept here and 
> coders encapsulate two encodings, with some subtle and inconsistently-applied 
> rules about when to use which encoding. I think we should still give them 
> distinct URNs unless impossible. You've outlined some steps to clarify the 
> situation.

Currently, we have a one-to-two relationship between Coders and
encodings, and a one-to-one relationship between URNs and encoders.

To make the first one-to-one, we would either have to make
StringUtf8Coder unsuitable for TextIO (letting it always prefix its
contents with a length) or unsuitable for the key part of a KV, the
element of an iterable, etc. (where the length is required).
Alternatively we could give Coders the ability to return the
nested/unnested version of themselves, but this also gets messy
because it depends on the ultimate outer context which we don't always
have at hand (and leads to surprises, e.g. asking for the key coder of
a KV coder may no

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-05 Thread Robert Bradshaw

On Fri, Apr 5, 2019 at 6:24 PM Kenneth Knowles  wrote:
>
> Nested and unnested contexts are two different encodings. Can we just give 
> them different URNs? We can even just express the length-prefixed UTF-8 as a 
> composition of the length-prefix URN and the UTF-8 URN.

It's not that simple, especially when it comes to composite encodings.
E.g. for some coders, nested(C) == unnested(C), for some coders
nested(C) == lenth_prefix(unnested(C)), and for other coders it's
something else altogether (e.g. when creating a kv coder, the first
component must use nested context, and the second inherits the nested
vs. unnested context). When creating TupleCoder(A, B) one doesn't want
to forcibly use LenthPrefixCoder(A) and LengthPrefixCoder(B), nor does
one want to force LengthPrefixCoder(TupleCoder(A, B)) because A and B
may themselves be large and incrementally written (e.g.
IterableCoder). Using distinct URNs doesn't work well if the runner is
free to compose and decompose tuple, iterable, etc. coders that it
doesn't understand.

Until we stop using Coders for IO (a worthy but probably lofty goal)
we will continue to need the unnested context (lest we expect and
produce length-prefixed coders in text files, as bigtable keys, etc.).
On the other hand, almost all internal use is nested (due to sending
elements around as part of element streams). The other place we cross
over is LengthPrefixCoder that encodes its values using the unnested
context prefixed by the unnested encoding length.

Perhaps a step in the right direction would be to consistently use the
unnested context everywhere but IOs (meaning when we talked about
coders from the FnAPI perspective, they're *always* in the nested
context, and hence always have the one and only encoding defined by
that URN, including when wrapped by a length prefix coder (which would
sometimes result in double length prefixing, but I think that's a
price worth paying (or we could do something more clever like make
length-prefix an (explicit) modifier on a coder rather than a new
coder itself that would default to length prefixing (or some of the
other delimiting schemes we've come up with) but allow the component
coder to offer alternative length-prefix-compatible encodings))). IOs
could be updated to take Parsers and Formatters (or whatever we call
them) with the Coders in the constructors left as syntactic sugar
until we could remove them in 3.0. As Luke says, we have a chance to
fix our coders for portable pipelines now.

In the (very) short term, we're stuck with a nested and unnested
version of StringUtf8, just as we have for bytes, lest we change the
meaning of (or disallow some of) TupleCoder[StrUtf8Coder, ...],
LengthPrefixCoder[StrUtf8Coder], and using StringUtf8Coder for IO.

>
> On Fri, Apr 5, 2019 at 12:38 AM Robert Bradshaw  wrote:
>>
>> On Fri, Apr 5, 2019 at 12:50 AM Heejong Lee  wrote:
>> >
>> > Robert, does nested/unnested context work properly for Java?
>>
>> I believe so. It is similar to the bytes coder, that prefixes vs. not
>> based on the context.
>>
>> > I can see that the Context is fixed to NESTED[1] and the encode method 
>> > with the Context parameter is marked as deprecated[2].
>> >
>> > [1]: 
>> > https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L68
>> > [2]: 
>> > https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/Coder.java#L132
>>
>> That doesn't mean it's unused, e.g.
>>
>> https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/util/CoderUtils.java#L160
>> https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/LengthPrefixCoder.java#L64
>>
>> (and I'm sure there's others).
>>
>> > On Thu, Apr 4, 2019 at 3:25 PM Robert Bradshaw  wrote:
>> >>
>> >> I don't know why there are two separate copies of
>> >> standard_coders.yaml--originally there was just one (though it did
>> >> live in the Python directory). I'm guessing a copy was made rather
>> >> than just pointing both to the new location, but that completely
>> >> defeats the point. I can't seem to access JIRA right now; could
>> >> someone file an issue to resolve this?
>> >>
>> >> I also think the spec should be next to the definition of the URN,
>> >> that's one of the reason the URNs were originally in a markdown file
>> >> (to encourage good documentation, literate programming style). Many
>&g

Re: [DISCUSS] Backwards compatibility of @Experimental features

2019-04-05 Thread Robert Bradshaw

if it's technically feasible, I am also in favor of requiring experimental
features to be (per-tag, Python should be updated) opt-in only. We should
probably regularly audit the set of experimental features we ship (I'd say
as part of the release, but that process is laborious enough, perhaps we
should do it on a half-release cycle?) I think imposing hard deadlines
(chosen when a feature is introduced) is too extreme, but might be valuable
if opt-in plus regular audit is insufficient.

On Thu, Apr 4, 2019 at 5:28 AM Kenneth Knowles  wrote:

> This all makes me think that we should rethink how we ship experimental
> features. My experience is also that (1) users don't know if something is
> experimental or don't think hard about it and (2) we don't use experimental
> time period to gather feedback and make changes.
>
> How can we change both of these? Perhaps we could require experimental
> features to be opt-in. Flags work and also clearly marked experimental
> dependencies that a user has to add. Changing the core is sometimes tricky
> to put behind a flag but rarely impossible. This way a contributor is also
> motivated to gather feedback to mature their feature to become default
> instead of opt-in.
>
> The need that @Experimental was trying to address is real. We *do* need a
> way to try things and get feedback prior to committing to forever support.
> We have discovered real problems far too late, or not had the will to fix
> the issue we did find:
>  - many trigger combinators should probably be deleted
>  - many triggers cannot meet a good spec with merging windows
>  - the continuation trigger idea doesn't work well
>  - CombineFn had to have its spec changed in order to be both correct and
> efficient
>  - OutputTimeFn as a UDF is convenient for Java but it turns out an enum
> is better for portability
>  - Coder contexts turned out to be a major usability problem
>  - The built-in data types for schemas are evolving (luckily these are
> really being worked on!)
>
> That's just what I can think of off the top of my head. I expect the
> examples from IOs are more numerous; in that case it is pretty easy to fork
> and make a new and better IO.
>
> And as an extreme view, I would prefer if we add a deadline for
> experimental features, then our default action is to remove them, not
> declare them stable. If noone is trying to mature it and get it out of
> opt-in status, then it probably has not matured. And perhaps if noone care
> enough to do that work it also isn't that important.
>
> Kenn
>
> On Wed, Apr 3, 2019 at 5:57 PM Ahmet Altay  wrote:
>
>> I agree with Reuven that our experimental annotation is not useful any
>> more. For example Datastore IO in python sdk is experimental for 2 years
>> now. Even though it is marked as experimental an upgrade is carefully
>> planned [1] as if it is not experimental. Given that I do not think we can
>> remove features within a small number of minor releases. (Exception to this
>> would be, if we have a clear knowledge of very low usage of a certain IO.)
>>
>> I am worried that tagging experimental features with release versions
>> will add toil to the release process as mentioned and will also add to the
>> user confusion. What would be the signal to a user if they see an
>> experimental feature target release bumped between releases? How about
>> tagging experimental features with JIRAs (similar to TODOs) with an action
>> to either promote them as supported features or remove them? These JIRAs
>> could have fix version targets as any other release blocking JIRAs. It will
>> also clarify who is responsible for a given experimental feature.
>>
>> [1]
>> https://lists.apache.org/thread.html/5ec88967aa4a382db07a60e0101c4eb36165909076867155ab3546a6@%3Cdev.beam.apache.org%3E
>>
>> On Wed, Apr 3, 2019 at 5:24 PM Reuven Lax  wrote:
>>
>>> Experiments are already tagged with a Kind enum
>>> (e.g. @Experimental(Kind.Schemas)).
>>>
>>
>> This not the case for python's annotations. It will be a good idea to add
>> there as well.
>>
>>
>>>
>>> On Wed, Apr 3, 2019 at 4:56 PM Ankur Goenka  wrote:
>>>
 I think a release version with Experimental flag makes sense.
 In addition, I think many of our user start to rely on experimental
 features because they are not even aware that these features are
 experimental and its really hard to find the experimental features used
 without giving a good look at the Beam code and having some knowledge about
 it.

 It will be good it we can have a step at the pipeline submission time
 which can print all the experiments used in verbose mode. This might also
 require to add a meaningful group name for the experiment example

 @Experimental("SDF", 2.15.0)

 This will of-course add additional effort and require additional
 context while tagging experiments.

 On Wed, Apr 3, 2019 at 4:43 PM Reuven Lax  wrote:

> Our Experimental annotation has become almost useless.

Re: [PROPOSAL] commit granularity in master

2019-04-05 Thread Robert Bradshaw

On Thu, Apr 4, 2019 at 3:18 PM Etienne Chauchot  wrote:
>
> Brain,
> It is good that you automated commits quality checks, thanks.
>
> But it don't agree with reducing the commit history of a PR to only one 
> commit, I was just referring about meaningless commits such as fixup, 
> checktyle, spotless ... I prefer not to squash everything and only squash 
> meaningless commits because:
> - sometimes small related fixes to different parts (with different jiras) are 
> done in the same PR and they should stay separate commits because they deal 
> with different problems
> - more important, keeping the commit at a relative small size but still 
> isolable is better to track bugs/regressions (among other things during 
> bisect sessions) that if the commit is big.

Agreed, we should not enforce one commit per PR; there are many good
reasons to break a PR into multiple commits.

> Le vendredi 22 mars 2019 à 09:38 -0700, Brian Hulette a écrit :
>
> It sounds like maybe we've already reached a consensus that committers just 
> need to be more vigilant about squashing fixup commits, and hopefully 
> automate it as much as possible. But I just thought I'd also point out how 
> the arrow project handles this as a point of reference, since it's kind of 
> interesting.
>
> They've written a merge_arrow_pr.py script [1], which committers run to merge 
> a PR. It enforces that the PR has an associated JIRA in the title, squashes 
> the entire PR into a single commit, and closes the associated JIRA with the 
> appropriate fix version.
>
> As a result, the commit granularity is equal to the granularity of PRs, JIRAs 
> are always linked to PRs, and the commit history on master is completely 
> linear (they also have to force push master after releases in order to 
> maintain this, which is the subject of much consternation and debate).
>
> The simplicity of 1 PR = 1 commit is a nice way to avoid the manual 
> intervention required to squash fixup commits and enforce that every commit 
> has passed CI, but it does have down-sides as Etienne already pointed out.
>
> Brian
>
> [1] https://github.com/apache/arrow/blob/master/dev/merge_arrow_pr.py
>
>
> On Fri, Mar 22, 2019 at 7:46 AM Mikhail Gryzykhin 
>  wrote:
>
> I agree with keeping history clean.
>
> Although, Small commits like address PR comments are useful during review 
> process. They allow reviewer to see only new changes, not review whole diff 
> again. Best to squash then before/on merge though.
>
> On Fri, Mar 22, 2019, 07:34 Ismaël Mejía  wrote:
>
> > I like the extra delimitation the brackets give, worth the two extra
> > characters to me. More importantly, it's nice to have consistency, and
> > the only way to be consistent with the past is leave them there.
>
> My point with the brackets is that we are 'getting close' to 10K issue
> so we will then have 3 chars less, probably it does not change much
> but still.
>
> On Fri, Mar 22, 2019 at 3:19 PM Robert Bradshaw  wrote:
> >
> > On Fri, Mar 22, 2019 at 3:02 PM Ismaël Mejía  wrote:
> > >
> > > It is good to remind committers of their responsability on the
> > > 'cleanliness' of the merged code. Github sadly does not have an easy
> > > interface to do this and this should be done manually in many cases,
> > > sadly I have seen many committers just merging code with multiple
> > > 'fixup' style commits by clicking Github's merge button. Maybe it is
> > > time to find a way to automatically detect these cases and disallow
> > > the merge or maybe we should reconsider the policy altogether if they
> > > are people who don't see the value of this.
> >
> > I agree about keeping our history clean and useful, and think those
> > four points summarize things well (but a clarification on fixup
> > commits would be good).
> >
> > +1 to an automated check that there are many extraneous commits.
> > Anything the person hitting the merge button would easily see before
> > doing the merge.
> >
> > > I would like to propose a small modification to the commit title style
> > > on that guide. We use two brackets to enclose the issue id, but that
> > > really does not improve much the readibility and uses 2 extra spaces
> > > of the already short space title, what about getting rid of them?
> > >
> > > Current style: "[BEAM-] Commit title"
> > > Proposed style: "BEAM- Commit title"
> > >
> > > Any ideas or opinons pro/con ?
> >
> > I like the extra delimitation the brackets give, worth the two extra
>

Re: Hazelcast Jet Runner - validation tests

2019-04-05 Thread Robert Bradshaw

On Thu, Apr 4, 2019 at 6:38 PM Lukasz Cwik  wrote:
>
> The issue with unbounded tests that rely on triggers/late data/early 
> firings/processing time is that these are several sources of non-determinism. 
> The sources make non-deterministic decisions around when to produce data, 
> checkpoint, and resume and runners make non-deterministic decisions around 
> when to output elements, in which order, and when to evaluate triggers. 
> UsesTestStream is the best set of tests we currently have for making 
> non-deterministic processing decisions deterministic but are more difficult 
> to write then the other ValidatesRunner tests and also not well supported 
> because of the special nature of UsesTestStream needing special hooks within 
> the runner to control when to output and when to advance time.
>
> I'm not aware of any tests that we currently have that run a non 
> deterministic pipeline and evaluate it against all possible outcomes that 
> could have been produced and check that the output was valid. We would 
> welcome ideas in how to improve this space to get more runners being tested 
> for non-deterministic pipelines.

Python has some tests of this nature, e.g.

https://github.com/apache/beam/blob/release-2.12.0/sdks/python/apache_beam/runners/portability/fn_api_runner_test.py#L308

I'd imagine we could do similar for Java.

> On Thu, Apr 4, 2019 at 3:36 AM Jozsef Bartok  wrote:
>>
>> Hi.
>>
>> My name is Jozsef, I've been working on Runners based on Hazelcast Jet. 
>> Plural because we have both an "old-style" and a "portable" Runner in 
>> development (https://github.com/hazelcast/hazelcast-jet-beam-runner).
>>
>> While our portable one isn't even functional yet, the "old-style" type of 
>> Runner is a bit more mature. It handles only bounded data, but for that case 
>> it passes all Beam tests of ValidatesRunner category and runs the Nexmark 
>> suite successfully too (I'm refering only to correctness, because 
>> performance is not yet where it can be, we aren't doing any Pipeline surgery 
>> yet and no other optimizations either).
>>
>> Since a few days we have started extending it for unbounded data, so we have 
>> started adding support for things like triggers, watermarks and such and we 
>> are wondering how come we can't find ValidatesRunner tests specific to 
>> unbounded data. Tests from the UsesTestStream category seem to be kind of a 
>> candidate for this, but they have nowhere near the coverage and completeness 
>> provided by the ValidatesRunner ones.
>>
>> I think we are missing something and I don't know what... Could you pls. 
>> advise?
>>
>> Rgds,
>> Jozsef

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-05 Thread Robert Bradshaw

On Fri, Apr 5, 2019 at 12:50 AM Heejong Lee  wrote:
>
> Robert, does nested/unnested context work properly for Java?

I believe so. It is similar to the bytes coder, that prefixes vs. not
based on the context.

> I can see that the Context is fixed to NESTED[1] and the encode method with 
> the Context parameter is marked as deprecated[2].
>
> [1]: 
> https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L68
> [2]: 
> https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/Coder.java#L132

That doesn't mean it's unused, e.g.

https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/util/CoderUtils.java#L160
https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/LengthPrefixCoder.java#L64

(and I'm sure there's others).

> On Thu, Apr 4, 2019 at 3:25 PM Robert Bradshaw  wrote:
>>
>> I don't know why there are two separate copies of
>> standard_coders.yaml--originally there was just one (though it did
>> live in the Python directory). I'm guessing a copy was made rather
>> than just pointing both to the new location, but that completely
>> defeats the point. I can't seem to access JIRA right now; could
>> someone file an issue to resolve this?
>>
>> I also think the spec should be next to the definition of the URN,
>> that's one of the reason the URNs were originally in a markdown file
>> (to encourage good documentation, literate programming style). Many
>> coders already have their specs there.
>>
>> Regarding backwards compatibility, we can't change existing coders,
>> and making new coders won't help with inference ('cause changing that
>> would also be backwards incompatible). Fortunately, I think we're
>> already doing the consistent thing here: In both Python and Java the
>> raw UTF-8 encoded bytes are encoded when used in an *unnested* context
>> and the length-prefixed UTF-8 encoded bytes are used when the coder is
>> used in a *nested* context.
>>
>> I'd really like to see the whole nested/unnested context go away, but
>> that'll probably require Beam 3.0; it causes way more confusion than
>> the couple of bytes it saves in a couple of places.
>>
>> - Robert
>>
>> On Thu, Apr 4, 2019 at 10:55 PM Robert Burke  wrote:
>> >
>> > My 2cents is that the "Textual description" should be part of the 
>> > documentation of the URNs on the Proto messages, since that's the common 
>> > place. I've added a short description for the varints for example, and we 
>> > already have lenghthier format & protocol descriptions there for iterables 
>> > and similar.
>> >
>> > The proto [1] *can be* the spec if we want it to be.
>> >
>> > [1]: 
>> > https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L557
>> >
>> > On Thu, 4 Apr 2019 at 13:51, Kenneth Knowles  wrote:
>> >>
>> >>
>> >>
>> >> On Thu, Apr 4, 2019 at 1:49 PM Robert Burke  wrote:
>> >>>
>> >>> We should probably move the "java" version of the yaml file [1] to a 
>> >>> common location rather than deep in the java hierarchy, or copying it 
>> >>> for Go and Python, but that can be a separate task. It's probably 
>> >>> non-trivial since it looks like it's part of a java resources structure.
>> >>
>> >>
>> >> Seems like /model is a good place for this if we don't want to invent a 
>> >> new language-independent hierarchy.
>> >>
>> >> Kenn
>> >>
>> >>
>> >>>
>> >>> Luke, the Go SDK doesn't currently do this validation, but it shouldn't 
>> >>> be difficult, given pointers to the Java and Python variants of the 
>> >>> tests to crib from [2]. Care would need to be taken so that Beam Go SDK 
>> >>> users (such as they are) aren't forced to run them, and not have the 
>> >>> yaml file to read. I'd suggest putting it with the integration tests [3].
>> >>>
>> >>> I've filed a JIRA (BEAM-7009) for tracking this Go SDK side work. [4]
>> >>>
>> >>> 1: 
>> >>> https://github.com/apache/beam/blob/m

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-04 Thread Robert Bradshaw

 PR[3] that adds the "beam:coder:double:v1" as tests 
>>>>>> to the Java and Python SDKs to ensure interoperability.
>>>>>>
>>>>>> Robert Burke, does the Go SDK have a test where it uses 
>>>>>> standard_coders.yaml and runs compatibility tests?
>>>>>>
>>>>>> Chamikara, creating new coder classes is a pain since the type -> coder 
>>>>>> mapping per SDK language would select the non-well known type if we 
>>>>>> added a new one to a language. If we swapped the default type->coder 
>>>>>> mapping, this would still break update for pipelines forcing users to 
>>>>>> update their code to select the non-well known type. If we don't change 
>>>>>> the default type->coder mapping, the well known coder will gain little 
>>>>>> usage. I think we should fix the Python coder to use the same encoding 
>>>>>> as Java for UTF-8 strings before there are too many Python SDK users.
>>>>>
>>>>>
>>>>> I was thinking that may be we should just change the default UTF-8 coder 
>>>>> for Fn API path which is experimental. Updating Python to do what's done 
>>>>> for Java is fine if we agree that encoding used for Java should be the 
>>>>> standard.
>>>>>
>>>>
>>>> That is a good idea to use the Fn API experiment to control which gets 
>>>> selected.
>>>>
>>>>>>
>>>>>>
>>>>>> 1: 
>>>>>> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>>>>>> 2: 
>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/data/standard_coders.yaml
>>>>>> 3: https://github.com/apache/beam/pull/8205
>>>>>>
>>>>>> On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath 
>>>>>>  wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw  
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> A URN defines the encoding.
>>>>>>>>
>>>>>>>> There are (unfortunately) *two* encodings defined for a Coder (defined
>>>>>>>> by a URN), the nested and the unnested one. IIRC, in both Java and
>>>>>>>> Python, the nested one prefixes with a var-int length, and the
>>>>>>>> unnested one does not.
>>>>>>>
>>>>>>>
>>>>>>> Could you clarify where we define the exact encoding ? I only see a URN 
>>>>>>> for UTF-8 [1] while if you look at the implementations Java includes 
>>>>>>> length in the encoding [1] while Python [1] does not.
>>>>>>>
>>>>>>> [1] 
>>>>>>> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
>>>>>>> [2] 
>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
>>>>>>> [3] 
>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> We should define the spec clearly and have cross-language tests.
>>>>>>>
>>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> Regarding backwards compatibility, I agree that we should probably not 
>>>>>>> update existing coder classes. Probably we should just standardize the 
>>>>>>> correct encoding (may be as a comment near corresponding URN in the 
>>>>>>> beam_runner_api.proto ?) and create new coder classes as needed.
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada  
>>>>>>>> wrote:
>>>>>>>> >
>>>>>>>> > Could this be a backwards-incompatible change that would break 
>>>>>>>> > pipelines from upgrading? If they have data in-flight in between 
>>>>>>>> > operators, and we change the coder, they would break?
>>>>>>>> > I know very little about coders, but since nobody has mentioned it, 
>>>>>>>> > I wanted to make sure we have it in mind.
>>>>>>>> > -P.
>>>>>>>> >
>>>>>>>> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles  
>>>>>>>> > wrote:
>>>>>>>> >>
>>>>>>>> >> Agree that a coder URN defines the encoding. I see that string 
>>>>>>>> >> UTF-8 was added to the proto enum, but it needs a written spec of 
>>>>>>>> >> the encoding. Ideally some test data that different languages can 
>>>>>>>> >> use to drive compliance testing.
>>>>>>>> >>
>>>>>>>> >> Kenn
>>>>>>>> >>
>>>>>>>> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke  
>>>>>>>> >> wrote:
>>>>>>>> >>>
>>>>>>>> >>> String UTF8 was recently added as a "standard coder " URN in the 
>>>>>>>> >>> protos, but I don't think that developed beyond Java, so adding it 
>>>>>>>> >>> to Python would be reasonable in my opinion.
>>>>>>>> >>>
>>>>>>>> >>> The Go SDK handles Strings as "custom coders" presently which for 
>>>>>>>> >>> Go are always length prefixed (and reported to the Runner as 
>>>>>>>> >>> LP+CustomCoder). It would be straight forward to add the correct 
>>>>>>>> >>> handling for strings, as Go natively treats strings as UTF8.
>>>>>>>> >>>
>>>>>>>> >>>
>>>>>>>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee  
>>>>>>>> >>> wrote:
>>>>>>>> >>>>
>>>>>>>> >>>> Hi all,
>>>>>>>> >>>>
>>>>>>>> >>>> It looks like UTF-8 String Coder in Java and Python SDKs uses 
>>>>>>>> >>>> different encoding schemes. StringUtf8Coder in Java SDK puts the 
>>>>>>>> >>>> varint length of the input string before actual data bytes 
>>>>>>>> >>>> however StrUtf8Coder in Python SDK directly encodes the input 
>>>>>>>> >>>> string to bytes value. For the last few weeks, I've been testing 
>>>>>>>> >>>> and fixing cross-language IO transforms and this discrepancy is a 
>>>>>>>> >>>> major blocker for me. IMO, we should unify the encoding schemes 
>>>>>>>> >>>> of UTF8 strings across the different SDKs and make it a standard 
>>>>>>>> >>>> coder. Any thoughts?
>>>>>>>> >>>>
>>>>>>>> >>>> Thanks,

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-04 Thread Robert Bradshaw

A URN defines the encoding.

There are (unfortunately) *two* encodings defined for a Coder (defined
by a URN), the nested and the unnested one. IIRC, in both Java and
Python, the nested one prefixes with a var-int length, and the
unnested one does not.

We should define the spec clearly and have cross-language tests.

On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada  wrote:
>
> Could this be a backwards-incompatible change that would break pipelines from 
> upgrading? If they have data in-flight in between operators, and we change 
> the coder, they would break?
> I know very little about coders, but since nobody has mentioned it, I wanted 
> to make sure we have it in mind.
> -P.
>
> On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles  wrote:
>>
>> Agree that a coder URN defines the encoding. I see that string UTF-8 was 
>> added to the proto enum, but it needs a written spec of the encoding. 
>> Ideally some test data that different languages can use to drive compliance 
>> testing.
>>
>> Kenn
>>
>> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke  wrote:
>>>
>>> String UTF8 was recently added as a "standard coder " URN in the protos, 
>>> but I don't think that developed beyond Java, so adding it to Python would 
>>> be reasonable in my opinion.
>>>
>>> The Go SDK handles Strings as "custom coders" presently which for Go are 
>>> always length prefixed (and reported to the Runner as LP+CustomCoder). It 
>>> would be straight forward to add the correct handling for strings, as Go 
>>> natively treats strings as UTF8.
>>>
>>>
>>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee  wrote:

 Hi all,

 It looks like UTF-8 String Coder in Java and Python SDKs uses different 
 encoding schemes. StringUtf8Coder in Java SDK puts the varint length of 
 the input string before actual data bytes however StrUtf8Coder in Python 
 SDK directly encodes the input string to bytes value. For the last few 
 weeks, I've been testing and fixing cross-language IO transforms and this 
 discrepancy is a major blocker for me. IMO, we should unify the encoding 
 schemes of UTF8 strings across the different SDKs and make it a standard 
 coder. Any thoughts?

 Thanks,

Re: Deprecating Avro for fastavro on Python 3

2019-04-02 Thread Robert Bradshaw

I agree with Ahmet.

Fastavro seems to be well maintained and has good, tested
compatibility. Unless we expect significant performance improvements
in the standard Avro Python package (a significant undertaking, likely
not one we have the bandwidth to take on, and my impression is that
it's historically not a been priority) it's hard to justify using it
instead. Python 3 issues are just the trigger to consider finally
moving over, as I think that was the lonig-term intent back when
fastavro was added as an option. (Possibly if there are features
missing from fastavro, that could be a reason as well, at least to
keep the option around even if it's not the default.)

That being said, we should definitely not change the default and
remove the old version in the same release.

- Robert

On Tue, Apr 2, 2019 at 2:12 PM Robbe Sneyders  wrote:
>
> Hi all,
>
> Thank you for the feedback. Looking at the responses, it seems like there is 
> a consensus to move forward with fastavro as the default implementation on 
> Python 3.
>
> There are 2 questions left however:
> - Should fastavro also become the default implementation on Python 2?
> This is a trade-off between having a consistent API across Python versions, 
> or keeping the current behavior on Python 2.
>
> - Should we keep the avro-python3 dependency?
> With the proposed solution, we could remove the avro-python3 dependency, but 
> it might have to be re-added if we want to support Avro again on Python 3 in 
> a future version.
>
> Kind regards,
> Robbe
>
>
>
>
> Robbe Sneyders
>
> ML6 Gent
>
> M: +32 474 71 31 08
>
>
> On Thu, 28 Mar 2019 at 18:28, Ahmet Altay  wrote:
>>
>> Hi Ismaël,
>>
>> It is great to hear that Avro is planning to make a release soon.
>>
>> To answer your concerns, fastavro has a set of tests using regular avro 
>> files[1] and it also has a large set of users (with 675470 package 
>> downloads). This is in addition to it being a py2 & py3 compatible package 
>> and offering ~7x performance improvements [2]. Another data point, we were 
>> testing fastavro for a while behind an experimental flag and have not seen 
>> issues related compatibility.
>>
>> pyavro-rs sounds promising however I could not find a released version of it 
>> on pypi. The source code does not look like being maintained either with 
>> last commit on Jul 2, 2018. (for comparison last change on fastavro was on 
>> Mar 19, 2019).
>>
>> I think given the state of things, it makes sense to switch to fastavro as 
>> the default implementation to unblock python 3 changes. When avro offers a 
>> similar level of performance we could switch back without any visible user 
>> impact.
>>
>> Ahmet
>>
>> [1] https://github.com/fastavro/fastavro/tree/master/tests
>> [2] https://pypi.org/project/fastavro/
>>
>> On Thu, Mar 28, 2019 at 7:53 AM Ismaël Mejía  wrote:
>>>
>>> Hello,
>>>
>>> The problem of switching implementations is the risk of losing
>>> interoperability, and this is more important than performance. Does
>>> fastavro have tests that guarantee that it is fully compatible with
>>> Avro’s Java version? (given that it is the de-facto implementation
>>> used everywhere).
>>>
>>> If performance is a more important criteria maybe it is worth to check
>>> at pyavro-rs [1], you can take a look at its performance in the great
>>> talk of last year [2].
>>>
>>> I have been involved actively in the Avro community in the last months
>>> and I am now a committer there. Also Dan Kulp who has done multiple
>>> contributions in Beam is now a PMC member too. We are at this point
>>> working hard to get the next release of Avro out, actually the branch
>>> cut of Avro 1.9.0 is happening this week, and we plan to improve the
>>> release cadence. Please understand that the issue with Avro is that it
>>> is a really specific and ‘old‘ project (~10 years) so part of the
>>> active moved to other areas because it is stable, but we are still
>>> there working on it and we are eager to improve it for everyone’s
>>> needs (and of course Beam needs).
>>>
>>> I know that Python 3’s Avro implementation is still lacking and could
>>> be improved (views expressed here are clearly valid), but maybe this
>>> is a chance to contribute there too. Remember Apache projects are a
>>> family and we have a history of cross colaboration with other
>>> communities e.g. Flink, Calcite so why not give it a chance to Avro
>>> too.
>>>
>>> Regards,
>>> Ismaël
>>>
>>> [1] https://github.com/flavray/pyavro-rs
>>> [2] 
>>> https://ep2018.europython.eu/media/conference/slides/how-to-write-rust-instead-of-c-and-get-away-with-it-yes-its-a-python-talk.pdf
>>>
>>> On Wed, Mar 27, 2019 at 11:42 PM Chamikara Jayalath
>>>  wrote:
>>> >
>>> > +1 for making use_fastavro the default for Python3. I don't see any 
>>> > significant drawbacks in doing this from Beam's point of view. One 
>>> > concern is whether avro and fastavro can safely co-exist in the same 
>>> > environment so that Beam continues to work for users who alrea

Re: [PROPOSAL] Standardize Gradle structure in Python SDK

2019-03-29 Thread Robert Bradshaw

On Fri, Mar 29, 2019 at 12:54 PM Michael Luckey  wrote:
>
> Really like the idea of improving here.
>
> Unfortunately, I haven't worked with python on that scale yet, so bear with 
> my naive understandings in this regard. If I understand correctly, the 
> suggestion will result in a couple of projects consisting only of a 
> build,gradle file to kind of workaround on gradles decision not to 
> parallelize within projects, right? In consequence, this also kind of 
> decouples projects from their content - they stuff which constitutes the 
> project - and forces the build file to 'somehow reach out to content of other 
> (only python root?) projects. E.g couples projects. This somehow 'feels non 
> natural' to me. But, of course, might be the path to go. As I said before, 
> never worked on python on that scale.

It feels a bit odd to me as well. Is it possible to have multiple
projects per directory (e.g. a suite of testing ones) rather than
having to break things up like this, especially if the goal is
primarily to get parallel running of tests? Especially if we could
automatically create the cross-product rather than manually? There
also seems to be some redundancy with what tox is doing here.

> But I believe to remember Robert talking about using in project 
> parallelisation for his development. Is this something which could also work 
> on CI? Of course, that will not help with different python versions, but 
> maybe that could be solved also by gradles variants which are introduced in 
> 5.3 - definitely need some time to investigate the possibilities here. On 
> first sight it feels like lots of duplication to create 'builds' for any 
> python version. Or wouldn't that be the case?
>
> And another naive thought on my side, isn't that non parallelizability also 
> caused by the monolithic setup of the python code base? E.g. if I understand 
> correctly, java sdk is split into core/runners/ios etc, each encapsulate into 
> full blown projects, i.e. buckets of sources, tests and build file. Would it 
> technically possible to do something similar with python? I assume that being 
> discussed before and teared apart, but couldn't find on mailing list.

Neither the culture nor the tooling of Python supports lots of
interdependent "sub-packages" for a single project--at least not
something smaller than one would want to deploy to Pypi. So while one
could do this, it'd be going against the grain. There are also much
lower-hanging opportunities for parallelization (e.g. running the test
suites for separate python versions in parallel).

It's not very natural (as I understand it) with Go either. If we're
talking directory re-organization, I think it would make sense to
consider having top-level java, python, go, ... next to model,
website, etc.

> And as a last thought, will usage of pygradle help with better python/gradle 
> integration? Currently, we mainly use gradle to call into shell scripts, 
> which doesn't help gradle nor probably pythons tooling to do the job very 
> well? But deeper integration might cause problems on python dev side, dunno :(

Possibly.

Are there any Python developers that primarily use the gradle
commands? Personally, I only use them if I'm using Java (or sometimes
work that is a mix of Java and Python, e.g. the Python-on-Flink
tests). Otherwise I use tox, or "python setup.py test [-s ...]"
directly. Gradle primarily has value as a top-level orchestration (in
particular for CI) and easy way for those who only touch Python
occasionally to run all the tests. If that's the case, optimizing our
gradle scripts for CI seems best.

> On Thu, Mar 28, 2019 at 6:37 PM Mark Liu  wrote:
>>
>> Thank you Ahmet. Answer your questions below:
>>
>>>
>>> - Could you comment on what kind of parallelization we will gain by this? 
>>> In terms of real numbers, how would this affect build and test times?
>>
>>
>> The proposal is based on Gradle parallel execution: "you can force Gradle to 
>> execute tasks in parallel as long as those tasks are in different projects". 
>> In Beam, project is declared per build.gradle file and registered in 
>> settings.gradle. Tasks that are included in single Gradle execution will run 
>> in parallel only if they are declared in separate build.gradle files.
>>
>> An example of applying parallel is beam_PreCommit_Python job which runs 
>> :pythonPreCommit task that contains tasks distributed in 4 build.gradle. The 
>> execution graph looks like https://scans.gradle.com/s/4frpmto6o7hto/timeline:
>>
>> Without this proposal, all tasks will run in sequential which can be ~2x 
>> longer. If more py36 and py37 tests added in the future, things will be even 
>> worse.
>>
>>> - I am guessing this will reduce complexity. Is it possible to quantify the 
>>> improvement related to this?
>>
>>
>> The general code complexity of function/method/property may not change here 
>> since we basically group tasks in a different way without changing inside 
>> logic. I don't know if th

Re: Python SDK Arrow Integrations

2019-03-29 Thread Robert Bradshaw

First off, huge +1 to a good integration with Arrow and Beam. I think
to fully realize the benefits we need to have deeper integration than
arrow-frame-batches as elements, i.e. SDKs should be augmented to
understand arrow frames as batches of individual elements, each with
(possibly) their own timestamps and windows, correctly updating
element counts, and allowing user operations to operate on batches
rather than just elementwise. (IMHO this, along with a(n optional)
more pandas-like API for manipulating PCollections, is actually one of
the critical missing pieces in Python.) Some thought needs to go into
how to automatically handle batching and progress and liquid sharding
in this case.

For the most part, I don't think this impacts the model, though of
course we'd want to support using the arrow format to send batches of
elements across the FnAPI barriers.

There don't seem to be any Java libraries (yet) that have the
widespread use or maturity of Pandas, but that may come in the future.
Certainly it makes sense for the Beam primitives used in SQL (such as
projection, filtering, possibly even simple expression-based
computations) to have language-agnostic representations which could be
implemented in any SDK (and possibly even a runner) to maximize fusion
and minimize data transfer.

Also, I agree that support for large iterables makes a separate Beam
schema desirable. That being said, we shouldn't unnecessarily diverge,
and could possibly share implementations as well (for increased
interoperability with the larger ecosystems).

On Fri, Mar 29, 2019 at 5:48 AM Kenneth Knowles  wrote:
>
> On Thu, Mar 28, 2019 at 12:24 PM Brian Hulette  wrote:
>>
>> > - Presumably there is a pandas counterpart in Java. Is there? Do you know?
>> I think there are some dataframe libraries in Java we could look into. I'm 
>> not aware of anything that has the same popularity and arrow integration as 
>> pandas though. Within the arrow project there is Gandiva [1], which has Java 
>> bindings. It generates optimized LLVM code for processing arrow data based 
>> on an expression tree. I think that could be a valuable tool for SQL.
>
>
> Gandiva looks to be similar to what we get from Calcite today, but I wonder 
> if it is higher performance due to being lower level or more flexible (for 
> example Calcite's codegen is pretty hardcoded to millisecond precision 
> datetimes). Worth learning about. Since another big benefit of Calcite's 
> expression compiler is implementation of "all" builtin functions for free, 
> I'd look closely at how to provide a builtin function catalog to Gandiva.
>
>> > - Is it valuable for Beam to invent its own schemas? I'd love for Beam to 
>> > have identical schema affordances to either protobuf or arrow or avro, 
>> > with everything layered on that as logical types (including SQL types). 
>> > What would it look like if Beam schemas were more-or-less Arrow schemas?
>> As it stands right now there is a very clear mapping from Beam schemas to 
>> Arrow schemas. Both define similar primitive types, as well as nested types 
>> like row (beam) -> struct (arrow), array (beam) -> list (arrow). In addition 
>> Arrow schemas have a binary representation and implementations in many 
>> languages.
>>
>> I had some offline discussion with Reuven about this - and he pointed out 
>> that eventually we'd like Beam schemas to have a type for large iterables as 
>> well, so that even a PCollection>> can have a schema, and 
>> that's certainly a concept that wouldn't make sense for Arrow. So I think 
>> the answer is yes it is valuable for Beam to have its own schemas - that way 
>> we can represent Beam-only concepts, but still be able to map to other 
>> schemas when it makes sense (For example in the KV> case we 
>> could map V's beam schema to an arrow schema and encode it as arrow record 
>> batches).
>
> This convinces me that Beam should have its own schema definition. There are 
> things in Beam - and could be novelties created in Beam - that might not fit 
> Arrow. And we don't want to have such a tight coupling. If the mapping is 
> straightforward enough then there's not that much work to just convert 
> to/from. But the piece I would think about it is that any change to Beam or 
> Arrow could introduce something that doesn't translate well, so we just need 
> to be cognizant of that.
>
> Kenn
>
>>
>>
>> Brian
>>
>> [1] http://arrow.apache.org/blog/2018/12/05/gandiva-donation/
>>
>> On Wed, Mar 27, 2019 at 9:19 PM Kenneth Knowles  wrote:
>>>
>>> Thinking about Arrow + Beam SQL + schemas:
>>>
>>>  - Obviously many SQL operations could be usefully accelerated by arrow / 
>>> columnar. Especially in the analytical realm this is the new normal. For 
>>> ETL, perhaps less so.
>>>
>>>  - Beam SQL planner (pipeline construction) is implemented in Java, and so 
>>> the various DoFns/CombineFns that implement projection, filter, etc, are 
>>> also in Java.
>>> - Arrow is of course available in Java.
>>>

Re: [spark runner dataset POC] workCount works !

2019-03-22 Thread Robert Bradshaw

Nice!

Between this and the portability work
(https://github.com/apache/beam/pull/8115), hopefully we'll have a
modern Spark runner soon. Any idea on how hard (or easy?) it will be
to merge those two?


On Fri, Mar 22, 2019 at 9:29 AM Łukasz Gajowy  wrote:
>
> Cool. :) Congrats and thank you for your work!
>
> Łukasz
>
> czw., 21 mar 2019 o 18:51 Kenneth Knowles  napisał(a):
>>
>> Nice milestone!
>>
>> On Thu, Mar 21, 2019 at 10:49 AM Pablo Estrada  wrote:
>>>
>>> This is pretty cool. Thanks for working on this and for sharing:)
>>> Best
>>> -P.
>>>
>>> On Thu, Mar 21, 2019, 8:18 AM Alexey Romanenko  
>>> wrote:

 Good job! =)
 Congrats to all who was involved to move this forward!

 Btw, for all who is interested in a progress of work on this runner, I 
 wanted to remind that we have #beam-spark channel on Slack where we 
 discuss all ongoing questions. Feel free to join!

 Alexey

 > On 21 Mar 2019, at 15:51, Jean-Baptiste Onofré  wrote:
 >
 > Congrats and huge thanks !
 >
 > (I'm glad to be one of the little "launcher" to this effort ;) )
 >
 > Regards
 > JB
 >
 > On 21/03/2019 15:47, Ismaël Mejía wrote:
 >> This is excellent news. Congrats Etienne, Alexey and the others
 >> involved for the great work!
 >> On Thu, Mar 21, 2019 at 3:10 PM Etienne Chauchot  
 >> wrote:
 >>>
 >>> Hi guys,
 >>>
 >>> We are glad to announce that the spark runner POC that was re-written 
 >>> from scratch using the structured-streaming framework and the dataset 
 >>> API can now run WordCount !
 >>>
 >>> It is still embryonic. For now it only runs in batch mode and there is 
 >>> no fancy stuff like state, timer, SDF, metrics, ... but it is still a 
 >>> major step forward !
 >>>
 >>> Streaming support work has just started.
 >>>
 >>> You can find the branch here: 
 >>> https://github.com/apache/beam/tree/spark-runner_structured-streaming
 >>>
 >>> Enjoy,
 >>>
 >>> Etienne
 >>>
 >>>

Re: [PROPOSAL] commit granularity in master

2019-03-22 Thread Robert Bradshaw

On Fri, Mar 22, 2019 at 3:02 PM Ismaël Mejía  wrote:
>
> It is good to remind committers of their responsability on the
> 'cleanliness' of the merged code. Github sadly does not have an easy
> interface to do this and this should be done manually in many cases,
> sadly I have seen many committers just merging code with multiple
> 'fixup' style commits by clicking Github's merge button. Maybe it is
> time to find a way to automatically detect these cases and disallow
> the merge or maybe we should reconsider the policy altogether if they
> are people who don't see the value of this.

I agree about keeping our history clean and useful, and think those
four points summarize things well (but a clarification on fixup
commits would be good).

+1 to an automated check that there are many extraneous commits.
Anything the person hitting the merge button would easily see before
doing the merge.

> I would like to propose a small modification to the commit title style
> on that guide. We use two brackets to enclose the issue id, but that
> really does not improve much the readibility and uses 2 extra spaces
> of the already short space title, what about getting rid of them?
>
> Current style: "[BEAM-] Commit title"
> Proposed style: "BEAM- Commit title"
>
> Any ideas or opinons pro/con ?

I like the extra delimitation the brackets give, worth the two extra
characters to me. More importantly, it's nice to have consistency, and
the only way to be consistent with the past is leave them there.

> On Fri, Mar 22, 2019 at 2:32 PM Etienne Chauchot  wrote:
> >
> > Thanks Alexey to point this out. I did not know about these 4 points in the 
> > guide. I agree with them also. I would just add "Avoid keeping in history 
> > formatting messages such as checktyle or spotless fixes"
> > If it is ok, I'll submit a PR to add this point.
> > Le vendredi 22 mars 2019 à 11:33 +0100, Alexey Romanenko a écrit :
> >
> > Etienne, thanks for bringing this topic.
> >
> > I think it was already discussed several times before and we have finally 
> > came to what we have in the current Committer guide “Granularity of 
> > changes" [1].
> >
> > Personally, I completely agree with these 4 rules presented there. The main 
> > concern is that all committers should follow them as well, otherwise we 
> > still have sometimes a bunch of small commits with inexpressive messages (I 
> > believe they were added during review process and were not squashed before 
> > merging).
> >
> > In my opinion, the most important rule is that every commit should be 
> > atomic in terms of added/fixed functionality and rolling it back should not 
> > break master branch.
> >
> > [1] 
> > https://beam.apache.org/contribute/committer-guide/#pull-request-review-objectives
> >
> >
> > On 22 Mar 2019, at 10:16, Etienne Chauchot  wrote:
> >
> > Hi all,
> > It has already been discussed partially but I would like that we agree on 
> > the commit granularity that we want in our history.
> > Some features were squashed to only one commit which seems a bit too 
> > granular to me for a big feature.
> > On the contrary I see PRs with very small commits such as "apply spotless" 
> > or "fix checkstyle".
> >
> > IMHO I think a good commit size is an isolable portion of a feature such as 
> > for ex "implement Read part of Kudu IO" or "reduce concurrency in Test A". 
> > Such a granularity allows to isolate problems easily (git bisect for ex) 
> > and rollback only a part if necessary.
> > WDYT about:
> > - squashing non meaningful commits such as "apply review comments" (and 
> > rather state what they do and group them if needed), or "apply spotless" or 
> > "fix checkstyle"
> > - trying to stick to a commit size as described above
> >
> > => and of course update the contribution guide at the end
> > ?
> >
> > Best
> > Etienne
> >
> >

Re: What quick command to catch common issues before pushing a python PR?

2019-03-20 Thread Robert Bradshaw

I use tox as well. Actually, I use detox and retox (parallel versions
of tox, easily installable with pip) which can speed things up quite a
bit.

On Wed, Mar 20, 2019 at 1:33 AM Pablo Estrada  wrote:
>
> Correction  - the command is now: tox -e py35-gcp,py35-lint
>
> And it ran on my machine in 5min 40s. Not blazing fast, but at least 
> significantly faster than waiting for Jenkins : )
> Best
> -P.
>
> On Tue, Mar 19, 2019 at 5:22 PM Pablo Estrada  wrote:
>>
>> I use a selection of tox tasks. Here are the tox tasks that I use the most:
>> - py27-gcp
>> - py35-gcp
>> - py27-cython
>> - py35-cython
>> - py35-lint
>> - py27-lint
>>
>> Most recently, I'll run `tox -e py3-gcp,py3-lint`, which run fairly quickly. 
>> You can choose which subset works for you.
>> My insight is: Lints are pretty fast, so it's fine to add a couple different 
>> lints. Unittest runs are pretty slow, so I usually go for the one with most 
>> coverage for my change (x-gcp, or x-cython).
>> Best
>> -P.
>>
>> On Mon, Feb 25, 2019 at 4:33 PM Ruoyun Huang  wrote:
>>>
>>> nvm.  Don't take my previous non-scientific comparison (only ran it once) 
>>> too seriously. :-)
>>>
>>> I tried to repeat each for multiple times and now the difference 
>>> diminishes.  likely there was a transient error in caching.
>>>
>>> On Mon, Feb 25, 2019 at 3:38 PM Kenneth Knowles  wrote:

 Ah, that is likely caused by us having ill-defined tasks that cannot be 
 cached. Or is it that the configuration time is so significant?

 Kenn

 On Mon, Feb 25, 2019 at 11:05 AM Ruoyun Huang  wrote:
>
> Out of curiosity as a light gradle user, I did a side by side comparison, 
> and the readings confirm what Ken and Michael suggests.
>
> In the same repository, do gradle clean then followed by either of the 
> two commands. Measure their runtime respectively.  The latter one takes 
> 1/3 running time.
>
> time ./gradlew spotlessApply && ./gradlew checkstyleMain && ./gradlew 
> checkstyleTest && ./gradlew javadoc && ./gradlew findbugsMain && 
> ./gradlew compileTestJava && ./gradlew compileJava
> real 9m29.330s user 0m11.330s sys 0m1.239s
>
> time ./gradlew spotlessApply checkstyleMain checkstyleTest javadoc 
> findbugsMain compileJava compileTestJava
> real3m35.573s
> user0m2.701s
> sys 0m0.327s
>
>
>
>
>
>
>
> On Mon, Feb 25, 2019 at 10:47 AM Alex Amato  wrote:
>>
>> @Michael, no particular reason. I think Ken's suggestion makes more 
>> sense.
>>
>> On Mon, Feb 25, 2019 at 10:36 AM Udi Meiri  wrote:
>>>
>>> Talking about Python:
>>> I only know of "./gradlew lint", which include style and some py3 
>>> compliance checking.
>>> There is no auto-fix like spotlessApply AFAIK.
>>>
>>> As a side-note, I really dislike our python line continuation indent 
>>> rule, since pycharm can't be configured to adhere to it and I find 
>>> myself manually adjusting whitespace all the time.
>>>
>>>
>>> On Mon, Feb 25, 2019 at 10:22 AM Kenneth Knowles  
>>> wrote:

 FWIW gradle is a depgraph-based build system. You can gain a few 
 seconds by putting all but spotlessApply in one command.

 ./gradlew spotlessApply && ./gradlew checkstyleMain checkstyleTest 
 javadoc findbugsMain compileTestJava compileJava

 It might be clever to define a meta-task. Gradle "base plugin" has the 
 notable check (build and run tests), assemble (make artifacts), and 
 build (assemble + check, badly named!)

 I think something like "everything except running tests and building 
 artifacts" might be helpful.

 Kenn

 On Mon, Feb 25, 2019 at 10:13 AM Alex Amato  wrote:
>
> I made a thread about this a while back for java, but I don't think 
> the same commands like sptoless work for python.
>
> auto fixing lint issues
> running and quick checks which would fail the PR (without running the 
> whole precommit?)
> Something like findbugs to detect common issues (i.e. py3 compliance)
>
> FWIW, this is what I have been using for java. It will catch pretty 
> much everything except presubmit test failures.
>
> ./gradlew spotlessApply && ./gradlew checkstyleMain && ./gradlew 
> checkstyleTest && ./gradlew javadoc && ./gradlew findbugsMain && 
> ./gradlew compileTestJava && ./gradlew compileJava
>
>
>
> --
> 
> Ruoyun  Huang
>
>>>
>>>
>>> --
>>> 
>>> Ruoyun  Huang
>>>

Re: [PROPOSAL] Preparing for Beam 2.12.0 release

2019-03-18 Thread Robert Bradshaw

I agree with Kenn on both accounts. We can (and should) keep 2.7.x
alive with an immanent 2.7.1 release, and choose the next one at a
future date based on actual experience with an existing release.

On Fri, Mar 15, 2019 at 5:36 PM Ahmet Altay  wrote:
>
> +1 to extending 2.7.x LTS lifetime for a little longer and simultaneously 
> making a 2.7.1 release.
>
> On Fri, Mar 15, 2019 at 9:32 AM Kenneth Knowles  wrote:
>>
>> We actually have some issues queued up for 2.7.1, and IMO it makes sense to 
>> extend 2.7 since the 6 month period was just a pilot and like you say we 
>> haven't really exercised LTS.
>>
>> Re 2.12.0 I strongly feel LTS should be designated after a release has seen 
>> some use. If we extend 2.7 for another while then we will have some 
>> candidate by the time it expires. (2.8, 2.9, 2.10 all have major issues, 
>> while 2.11 and 2.12 are untried)
>>
>> Kenn
>>
>> On Fri, Mar 15, 2019 at 7:50 AM Thomas Weise  wrote:
>>>
>>> Given no LTS activity for 2.7.x - do we really need it?
>>>
>>>
>>> On Fri, Mar 15, 2019 at 6:54 AM Ismaël Mejía  wrote:

 After looking at the dates it seems that 2.12 should be the next LTS
 since it will be exactly 6 months after the release of 2.7.0. Anyone
 has comments, or prefer to do the LTS better for the next version
 (2.13) ?

 On Thu, Mar 14, 2019 at 12:13 PM Michael Luckey  
 wrote:
 >
 > @mxm
 >
 > Sure we should. Unfortunately the scripts to not have any '--dry-run' 
 > toggle. Implementing this seemed not too easy on first sight, as those 
 > release scripts do assume committed outputs of their predecessors and 
 > are not yet in the shape to be parameterised.
 >
 > So here is what I did:
 > 1. As I did not wanted the scripts to do 'sudo' installs on my machine, 
 > I first created a docker image with required prerequisites.
 > 2. Cloned beam to that machine (to get the release.scripts)
 > 3. Edited the places which seemed to call to the outside
 > - disabled any git push
 > - changed git url to point to some copy on local filesystem to pull 
 > required changes from there
 > - changed './gradlew' build to './gradlew assemble' as build will 
 > not work on docker anyway
 > - changed publish to publishToMavenLocal
 > - probably some more changes to ensure I do not write to remote
 > 4. run the scripts
 >
 > What I missed out:
 > 1. There is some communication with svn (signing artefacts downloaded 
 > from svn and committing). I just skipped those steps, as I was just too 
 > scared to miss some commit and doing an accidental push to some remote 
 > system (where I am hopefully not authorised anyway without doing proper 
 > authentication)
 >
 > If you believe I missed something which could be tested in advance, I d 
 > happily do more testing to ensure a smooth release process.
 >
 > michel
 >
 > On Thu, Mar 14, 2019 at 11:23 AM Maximilian Michels  
 > wrote:
 >>
 >> Hi Andrew,
 >>
 >> Sounds good. Thank you for being the release manager.
 >>
 >> @Michael Shall we perform some dry-run release testing for ensuring
 >> Gradle 5 compatibility?
 >>
 >> Thanks,
 >> Max
 >>
 >> On 14.03.19 00:28, Michael Luckey wrote:
 >> > Sounds good. Thanks for volunteering.
 >> >
 >> > Just as a side note: @aaltay had trouble releasing caused by the 
 >> > switch
 >> > to gradle5. Although that should be fixed now, you will be the first
 >> > using those changes in production. So if you encounter any issues. do
 >> > not hesitate to blame and contact me. Also I am currently looking into
 >> > some improvements to the process suggested by @kenn. So your feedback 
 >> > on
 >> > the current state would be greatly appreciated. I hope to get at least
 >> > https://issues.apache.org/jira/browse/BEAM-6798 done until then.
 >> >
 >> > Thanks again,
 >> >
 >> > michel
 >> >
 >> > On Wed, Mar 13, 2019 at 8:13 PM Ahmet Altay >>> >> > > wrote:
 >> >
 >> > Sounds great, thank you!
 >> >
 >> > On Wed, Mar 13, 2019 at 12:09 PM Andrew Pilloud 
 >> > >>> >> > > wrote:
 >> >
 >> > Hello Beam community!
 >> >
 >> > Beam 2.12 release branch cut date is March 27th according to 
 >> > the
 >> > release calendar [1]. I would like to volunteer myself to do
 >> > this release. I intend to cut the branch as planned on March
 >> > 27th and cherrypick fixes if needed.
 >> >
 >> > If you have releasing blocking issues for 2.12 please mark 
 >> > their
 >> > "Fix Version" as 2.12.0. Kenn created a 2.13.0 release in JIRA
 >> > in case you would like to move any non-blocking issues to that
 >> >

Re: Cross-language transform API

2019-03-11 Thread Robert Bradshaw

On Mon, Mar 11, 2019 at 6:05 PM Chamikara Jayalath  wrote:
>
> On Mon, Mar 11, 2019 at 9:27 AM Robert Bradshaw  wrote:
>>
>> On Mon, Mar 11, 2019 at 4:37 PM Maximilian Michels  wrote:
>> >
>> > > Just to clarify. What's the reason for including a PROPERTIES enum here 
>> > > instead of directly making beam_urn a field of ExternalTransformPayload ?
>> >
>> > The URN is supposed to be static. We always use the same URN for this
>> > type of external transform. We probably want an additional identifier to
>> > point to the resource we want to configure.
>>
>> It does feel odd to not use the URN to specify the transform itself,
>> and embed the true identity in an inner proto. The notion of
>> "external" is just how it happens to be invoked in this pipeline, not
>> part of its intrinsic definition. As we want introspection
>> capabilities in the service, we should be able to use the URN at a top
>> level and know what kind of payload it expects. I would also like to
>> see this kind of information populated for non-extern transforms which
>> could be good for visibility (substitution, visualization, etc.) for
>> runners and other pipeline-consuming tools.
>>
>> > Like so:
>> >
>> > message ExternalTransformPayload {
>> >enum Enum {
>> >  PROPERTIES = 0
>> >  [(beam_urn) = "beam:external:transform:external_transform:v1"];
>> >}
>> >// A fully-qualified identifier, e.g. Java package + class
>> >string identifier = 1;
>>
>> I'd rather the identifier have semantic rather than
>> implementation-specific meaning. e.g. one could imagine multiple
>> implementations of a given transform that different services could
>> offer.
>>
>> >// the format may change to map if types are supported
>> >map parameters = 2;
>> > }
>> >
>> > The identifier could also be a URN.
>> >
>> > > Can we change first version to map ? Otherwise the set of 
>> > > transforms we can support/test will be very limited.
>> >
>> > How do we do that? Do we define a set of standard coders for supported
>> > types? On the Java side we can lookup the coder by extracting the field
>> > from the Pojo, but we can't do that in Python.
>
>
> I'll let Reuven comment on exact relevance and timelines on Beam Schema 
> related work here but till we have that probably we can support the standard 
> set of coders that are well defined here ?
> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L542
>
> So in Python side the ExternalTransform can take a list of parameters (of 
> types that have standard coders) which will be converted to bytes to be sent 
> over the wire. In Java side corresponding standard coders (which are 
> determined by introspection of transform builder's payload POJO) can be used 
> to covert bytes to objects.

They also need to agree on the field types as well as names, so would
it be map>. I'm not sure the tradeoff
between going further down this road vs. getting schemas up to par in
Python (and, next, Go). And supporting this long term in parallel to
what we come up with schemas.

> Hopefully Beam schema work will give us a more generalized way to convert 
> objects across languages (for example, Python object -> Python Row + Schema 
> -> Java Row + Schema -> Java object). Note that we run into the same issue 
> when data tries to cross SDK boundaries when executing cross-language 
> pipelines.

+1, which is another reason I want to accelerate the language
independence of schemas.

>> > > Can we re-use some of the Beam schemas-related work/utilities here ?
>> >
>> > Yes, that was the plan.
>>
>> On this note, Reuven, what is the plan (and timeline) for a
>> language-independent representation of schemas? The crux of the
>> problem is that the user needs to specify some kind of configuration
>> (call it C) to construct the transform (call it T). This would be
>> handled by a TransformBuilder that provides (at least) a mapping
>> C -> T. (Possibly this interface could be offered on the transform
>> itself).
>>
>> The question we are trying to answer here is how to represent C, in
>> both the source and target language, and on the wire. The idea is that
>> we could leverage the schema infrastructure such that C could be a
>> POJO in Java (and perhaps a dict in Python). We would want to extend
>> Schemas and Row (or perhaps a sub/super/sibling class thereof) to
>> allow for Coder and UDF-typed fields. (Exactly how to represent UDFs
>> is still very TBD.) The payload for a external transform using this
>> format would be the tuple (schema, SchemaCoder(schema).encode(C)). The
>> goal is to not, yet again, invent a cross-language way of defining a
>> bag of named, typed parameters (aka fields) with language-idiomatic
>> mappings and some introspection capabilities, and significantly less
>> heavy-weight than users defining their own protos (plus generating
>> bindings to all languages).
>>
>> Does this seem a reasonable use of schemas?

Re: Cross-language transform API

2019-03-11 Thread Robert Bradshaw

On Mon, Mar 11, 2019 at 4:37 PM Maximilian Michels  wrote:
>
> > Just to clarify. What's the reason for including a PROPERTIES enum here 
> > instead of directly making beam_urn a field of ExternalTransformPayload ?
>
> The URN is supposed to be static. We always use the same URN for this
> type of external transform. We probably want an additional identifier to
> point to the resource we want to configure.

It does feel odd to not use the URN to specify the transform itself,
and embed the true identity in an inner proto. The notion of
"external" is just how it happens to be invoked in this pipeline, not
part of its intrinsic definition. As we want introspection
capabilities in the service, we should be able to use the URN at a top
level and know what kind of payload it expects. I would also like to
see this kind of information populated for non-extern transforms which
could be good for visibility (substitution, visualization, etc.) for
runners and other pipeline-consuming tools.

> Like so:
>
> message ExternalTransformPayload {
>enum Enum {
>  PROPERTIES = 0
>  [(beam_urn) = "beam:external:transform:external_transform:v1"];
>}
>// A fully-qualified identifier, e.g. Java package + class
>string identifier = 1;

I'd rather the identifier have semantic rather than
implementation-specific meaning. e.g. one could imagine multiple
implementations of a given transform that different services could
offer.

>// the format may change to map if types are supported
>map parameters = 2;
> }
>
> The identifier could also be a URN.
>
> > Can we change first version to map ? Otherwise the set of 
> > transforms we can support/test will be very limited.
>
> How do we do that? Do we define a set of standard coders for supported
> types? On the Java side we can lookup the coder by extracting the field
> from the Pojo, but we can't do that in Python.
>
> > Can we re-use some of the Beam schemas-related work/utilities here ?
>
> Yes, that was the plan.

On this note, Reuven, what is the plan (and timeline) for a
language-independent representation of schemas? The crux of the
problem is that the user needs to specify some kind of configuration
(call it C) to construct the transform (call it T). This would be
handled by a TransformBuilder that provides (at least) a mapping
C -> T. (Possibly this interface could be offered on the transform
itself).

The question we are trying to answer here is how to represent C, in
both the source and target language, and on the wire. The idea is that
we could leverage the schema infrastructure such that C could be a
POJO in Java (and perhaps a dict in Python). We would want to extend
Schemas and Row (or perhaps a sub/super/sibling class thereof) to
allow for Coder and UDF-typed fields. (Exactly how to represent UDFs
is still very TBD.) The payload for a external transform using this
format would be the tuple (schema, SchemaCoder(schema).encode(C)). The
goal is to not, yet again, invent a cross-language way of defining a
bag of named, typed parameters (aka fields) with language-idiomatic
mappings and some introspection capabilities, and significantly less
heavy-weight than users defining their own protos (plus generating
bindings to all languages).

Does this seem a reasonable use of schemas?

Re: Python precommit duration is above 1hr

2019-03-09 Thread Robert Bradshaw

Perhaps this is the duplication of all (or at least most) previously
existing tests for running under Python 3. I agree that this is excessive;
we should probably split out Py2, Py3, and the linters into separate
 targets.

We could look into using detox or retox to parallelize the testing as well.
(The issue last time was suppression of output on timeout, but that can be
worked around by adding timeouts to the individual tox targets.)

On Fri, Mar 8, 2019 at 11:26 PM Mikhail Gryzykhin  wrote:

> Hi everyone,
>
> Seems that our python pre-commits grow up in time really fast
> 
> .
>
> Did anyone follow trend or know what are the biggest changes that happened
> with python lately?
>
> I don't see a single jump, but duration of pre-commits almost doubled
> since new year.
>
> [image: image.png]
>
> Regards,
> --Mikhail
>
> Have feedback ?
>

Re: Signing artefacts during release

2019-03-08 Thread Robert Bradshaw

On Fri, Mar 8, 2019 at 2:42 AM Ahmet Altay  wrote:
>
> This sounds good to me.
>
> On Thu, Mar 7, 2019 at 3:32 PM Michael Luckey  wrote:
>>
>> Thanks for your comments.
>>
>> So to continue here, I ll prepare a PR implementing C:
>>
>> Pass the sign key to the relevant scripts and use that for signing. There is 
>> something similar already implemented [1]
>>
>> We might discuss on that, whether it will work for us or if we need to 
>> implement something different.
>>
>> This should affect at least 'build_release_candidate.sh' and 
>> 'sign_hash_python_wheels.sh'. The release manager is responsible for 
>> selecting the proper key. Currently there is no 'state passed between the 
>> scripts', so the release manager will have to specify this repeatedly. This 
>> could probably be improved later on.
>
> This might become a problem. Is it possible for us to tackle this sooner than 
> later?

Requiring a key seems to be a good first step. (Personally, I like to
be very explicit about what I sign.) Supporting defaults (e.g. in a
~/.beam-release config file) is a nice to have.

>> @Ahmet Altay Could you elaborate which global state you are referring to? Is 
>> it only that git global configuration of the signing key? [2]
>
> I was referring to things not related to signing. I do not want to digress 
> this thread but briefly I was referring to global installations of binaries 
> with sudo and changes to bashrc file. We can work on those improvements 
> separately.

That's really bad. +1 to fixing these (as a separate bug).

Re: Signing artefacts during release

2019-03-06 Thread Robert Bradshaw

I would not be opposed to make the choice of signing key a required
argument for the relevant release script(s).

On Wed, Mar 6, 2019 at 3:44 PM Michael Luckey  wrote:
>
> Hi,
>
> After upgrade to gradle 5 @altay (volunteering/selected as release manager) 
> was hit by an issue [1] which prevented signing of artefacts. He was 
> unfortunately forced to rollback to gradle 4 to do the release.
>
> After fixing a configuration issue within beam it seemingly revealed an 
> underlying regression in gradle's signing plugin itself [2].
>
> If I understand correctly, beams current setup works along the following 
> line: On initial configuration any release manager will setup the to be used 
> key for git only [3], but we never did something similar on gradles behalf. 
> Which results in the signing plugin (delegating to gpg cmd line) using 
> whatever gpg considers to be the default key, whether explicitly configured 
> with gpg.conf or implicitly.
>
> 1. Am I right in assuming that these keys not necessarily have to match? I.e. 
> that the key used for signing the release tag ('git tag -s') is not 
> necessarily the key used for signing the artifacts?
> 2. Am I right to assume, that we want/require them to be the same? I.e. the 
> key which is uploaded to Beam KEYS file?
>
> Now after grade 5 stopped defaulting to gpg's default key, we somehow need to 
> explicitly specify the key to use for the signing plugin. The simplest 
> solution would be to just add a note to the release guide how to solve that 
> issue on dev side. Which likely will lead to some frustration as it is easy 
> to miss.
>
> So I would like to integrate something into the used scripting.
>
> Options:
>
> A: During one time setup, developer is forced to select the proper key. This 
> key will be set to (global) git configuration [4]. We could add this key also 
> to gradle.gradleHome gradle.properties file as 'signing.gnupg.keyName' which 
> then would be used by gradles signing plugin.
>
> Obvious drawback here would be that this is a global configuration (ok, the 
> same problem we have already for git), which might not be appropriate for all 
> devs.
>
> B: Read the 'git config user.signingkey' on script execution and pass this as 
> '-Psigning.gnupg.keyName' parameter to the gradle run. Of course this will 
> only work, iff the git config is set. So would it be save to assume such?
>
> Drawback here, of course, is someone not using the release script missing to 
> set the signing key.
>
> Of course, both will not solve any issue with source.zip releases and pythons 
> signing key, which, as far as I can tell also rely on gpgs default key which 
> might conflict with 1. above?
>
> Any thoughts about this?
>
> michel
>
>
>
>
> [1] https://issues.apache.org/jira/browse/BEAM-6726
> [2] https://github.com/gradle/gradle/issues/8657
> [3] 
> https://github.com/apache/beam/blob/master/website/src/contribute/release-guide.md
> [4] 
> https://github.com/apache/beam/blob/master/release/src/main/scripts/preparation_before_release.sh#L44-L48

Re: [VOTE] Release 2.11.0, release candidate #2

2019-03-04 Thread Robert Bradshaw

I see the vote has passed, but +1 (binding) from me as well.

On Mon, Mar 4, 2019 at 11:51 AM Jean-Baptiste Onofré  wrote:
>
> +1 (binding)
>
> Tested with beam-samples.
>
> Regards
> JB
>
> On 26/02/2019 10:40, Ahmet Altay wrote:
> > Hi everyone,
> >
> > Please review and vote on the release candidate #2 for the version
> > 2.11.0, as follows:
> >
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> > The complete staging area is available for your review, which includes:
> > * JIRA release notes [1],
> > * the official Apache source release to be deployed to dist.apache.org
> >  [2], which is signed with the key with
> > fingerprint 64B84A5AD91F9C20F5E9D9A7D62E71416096FA00 [3],
> > * all artifacts to be deployed to the Maven Central Repository [4],
> > * source code tag "v2.11.0-RC2" [5],
> > * website pull request listing the release [6] and publishing the API
> > reference manual [7].
> > * Python artifacts are deployed along with the source release to the
> > dist.apache.org  [2].
> > * Validation sheet with a tab for 2.11.0 release to help with validation
> > [8].
> >
> > The vote will be open for at least 72 hours. It is adopted by majority
> > approval, with at least 3 PMC affirmative votes.
> >
> > Thanks,
> > Ahmet
> >
> > [1]
> > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12344775
> > [2] https://dist.apache.org/repos/dist/dev/beam/2.11.0/
> > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> > [4] https://repository.apache.org/content/repositories/orgapachebeam-1064/
> > [5] https://github.com/apache/beam/tree/v2.11.0-RC2
> > [6] https://github.com/apache/beam/pull/7924
> > [7] https://github.com/apache/beam-site/pull/587
> > [8]
> > https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=542393513
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com

Re: [VOTE] Release 2.11.0, release candidate #1

2019-02-22 Thread Robert Bradshaw

+1 (binding)

I verified the artifacts for correctness, as well as one of the wheels
on simple pipelines (Python 3).


On Sat, Feb 23, 2019 at 1:01 AM Kenneth Knowles  wrote:
>
> +1 (binding)
>
> Kenn
>
> On Fri, Feb 22, 2019 at 3:51 PM Ahmet Altay  wrote:
>>
>>
>>
>> On Fri, Feb 22, 2019 at 3:46 PM Kenneth Knowles  wrote:
>>>
>>> I believe you need to sign & hash the Python wheels. The instructions is 
>>> unfortunately a bit hidden in the release guide without an entry in the 
>>> table of contents:
>>
>>
>> Done, thank you for the pointer.
>>
>>>
>>>
>>> "Once all python wheels have been staged dist.apache.org, please run 
>>> ./sign_hash_python_wheels.sh to sign and hash python wheels."
>>>
>>> On Fri, Feb 22, 2019 at 8:40 AM Ahmet Altay  wrote:
>>>>
>>>>
>>>>
>>>> On Fri, Feb 22, 2019 at 1:32 AM Robert Bradshaw  
>>>> wrote:
>>>>>
>>>>> It looks like 
>>>>> https://github.com/apache/beam/blob/release-2.11.0/build.gradle
>>>>> differs from the copy in the release source tarball (line 22, and some
>>>>> whitespace below). Other than that, the artifacts and signatures look
>>>>> good.
>>>>
>>>>
>>>> Thank you. I fixed the issue (please take a look again). The difference 
>>>> was due to https://issues.apache.org/jira/browse/BEAM-6726.
>>>>
>>>>>
>>>>>
>>>>> On Fri, Feb 22, 2019 at 9:50 AM Ahmet Altay  wrote:
>>>>> >
>>>>> > Hi everyone,
>>>>> >
>>>>> > Please review and vote on the release candidate #1 for the version 
>>>>> > 2.11.0, as follows:
>>>>> >
>>>>> > [ ] +1, Approve the release
>>>>> > [ ] -1, Do not approve the release (please provide specific comments)
>>>>> >
>>>>> > The complete staging area is available for your review, which includes:
>>>>> > * JIRA release notes [1],
>>>>> > * the official Apache source release to be deployed to dist.apache.org 
>>>>> > [2], which is signed with the key with fingerprint 
>>>>> > 64B84A5AD91F9C20F5E9D9A7D62E71416096FA00 [3],
>>>>> > * all artifacts to be deployed to the Maven Central Repository [4],
>>>>> > * source code tag "v2.11.0-RC1" [5],
>>>>> > * website pull request listing the release [6] and publishing the API 
>>>>> > reference manual [7].
>>>>> > * Python artifacts are deployed along with the source release to the 
>>>>> > dist.apache.org [2].
>>>>> > * Validation sheet with a tab for 2.11.0 release to help with 
>>>>> > validation [8].
>>>>> >
>>>>> > The vote will be open for at least 72 hours. It is adopted by majority 
>>>>> > approval, with at least 3 PMC affirmative votes.
>>>>> >
>>>>> > Thanks,
>>>>> > Ahmet
>>>>> >
>>>>> > [1] 
>>>>> > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12344775
>>>>> > [2] https://dist.apache.org/repos/dist/dev/beam/2.11.0/
>>>>> > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>>>> > [4] 
>>>>> > https://repository.apache.org/content/repositories/orgapachebeam-1061/
>>>>> > [5] https://github.com/apache/beam/tree/v2.11.0-RC1
>>>>> > [6] https://github.com/apache/beam/pull/7924
>>>>> > [7] https://github.com/apache/beam-site/pull/587
>>>>> > [8] 
>>>>> > https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=542393513
>>>>> >

Re: [VOTE] Release 2.11.0, release candidate #1

2019-02-22 Thread Robert Bradshaw

It looks like https://github.com/apache/beam/blob/release-2.11.0/build.gradle
differs from the copy in the release source tarball (line 22, and some
whitespace below). Other than that, the artifacts and signatures look
good.

On Fri, Feb 22, 2019 at 9:50 AM Ahmet Altay  wrote:
>
> Hi everyone,
>
> Please review and vote on the release candidate #1 for the version 2.11.0, as 
> follows:
>
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
> The complete staging area is available for your review, which includes:
> * JIRA release notes [1],
> * the official Apache source release to be deployed to dist.apache.org [2], 
> which is signed with the key with fingerprint 
> 64B84A5AD91F9C20F5E9D9A7D62E71416096FA00 [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "v2.11.0-RC1" [5],
> * website pull request listing the release [6] and publishing the API 
> reference manual [7].
> * Python artifacts are deployed along with the source release to the 
> dist.apache.org [2].
> * Validation sheet with a tab for 2.11.0 release to help with validation [8].
>
> The vote will be open for at least 72 hours. It is adopted by majority 
> approval, with at least 3 PMC affirmative votes.
>
> Thanks,
> Ahmet
>
> [1] 
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12344775
> [2] https://dist.apache.org/repos/dist/dev/beam/2.11.0/
> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> [4] https://repository.apache.org/content/repositories/orgapachebeam-1061/
> [5] https://github.com/apache/beam/tree/v2.11.0-RC1
> [6] https://github.com/apache/beam/pull/7924
> [7] https://github.com/apache/beam-site/pull/587
> [8] 
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=542393513
>

Re: Hazelcast Jet Runner

2019-02-15 Thread Robert Bradshaw

On Fri, Feb 15, 2019 at 7:36 AM Can Gencer  wrote:
>
> We at Hazelcast are looking into writing a Beam runner for Hazelcast Jet 
> (https://github.com/hazelcast/hazelcast-jet). I wanted to introduce myself as 
> we'll likely have questions as we start development.

Welcome!

Hazelcast looks interesting, a Beam runner for it would be very cool.

> Some of the things I'm wondering about currently:
>
> * Currently there seems to be a guide available at 
> https://beam.apache.org/contribute/runner-guide/ , is this up to date? Is 
> there anything in specific to be aware of when starting with a new runner 
> that's not covered here?

That looks like a pretty good starting point. At a quick glance, I
don't see anything that looks out of date. Another resource that might
be helpful is a talk from last year on writing an SDK (but as it
mostly covers the runner-sdk interaction, it's also quite useful for
understanding the runner side:
https://docs.google.com/presentation/d/1Cso0XP9dmj77OD9Bd53C1M3W1sPJF0ZnA20gzb2BPhE/edit#slide=id.p
And please feel free to ask any questions on this list as well; we'd
be happy to help.

> * Should we be targeting the latest master which is at 2.12-SNAPSHOT or a 
> stable version?

I would target the latest master.

> * After a runner is developed, how is the maintenance typically handled, as 
> the runners seems to be part of Beam codebase?

Either is possible. Several runner adapters are part of the Beam
codebase, but for example the IMB Streams Beam runner is not. There
are certainly pros and cons (certainly early on when the APIs
themselves were under heavy development it was easier to keep things
in sync in the same codebase, but things have mostly stabilized now).
A runner only becomes part of the Beam codebase if there are members
of the community committed to maintaining it (which could include
you). Both approaches are fine.

- Robert

Re: Thoughts on a reference runner to invest in?

2019-02-14 Thread Robert Bradshaw

I think it's good to distinguish between direct runners (which would
be good to have in every language, and can grow in sophistication with
the userbase) and a fully universal reference runner. We should of
course continue to grow and maintain the java-runners-core shared
library, possibly as driven by the various production runners which
has been the most productive to date. (The point about community is a
good one. Unfortunately over the past 1.5 years the bigger Java
community has not resulted in a more complete Java ULR (in terms of
number of contributors or features/maturity), and it's unclear what
would change that in the future.)

It would be really great to have (at least) two completely separate
implementations, but (at the moment at least) I see that as lower
value than accelerating the efforts to get existing production runners
onto portability.

On Thu, Feb 14, 2019 at 2:01 PM Ismaël Mejía  wrote:
>
> This is a really interesting and important discussion. Having multiple
> reference runners can have its pros and cons. It is all about
> tradeoffs. From the end user point of view it can feel weird to deal
> with tools and packaging of a different ecosystem, e.g. python devs
> dealing with all the quirkiness of Java packaging, or the viceversa
> Java developers dealing with pip and friends. So having a reference
> runner per language would be more natural and help also valídate the
> portability concept, however having multiple reference runners sounds
> harder from the maintenance point of view.
>
> Most of the software in the domain of beam have been traditionally
> written in Java so there is a BIG advantage of ready to use (and
> mature) libraries and reusable components (also the reference runner
> may profit of the librarires that Thomas and others in the community
> have developed for multi runner s). This is a big win, but more
> important, we can have more eyes looking and contributing improvemetns
> and fixes that will benefit the reference runner and others.
>
> Having a reference runner per language would be nice but if we must
> choose only one language I prefer it to be Java just because we have a
> bigger community that can contribute and improve it. We may work on
> making the distribution of such runner more easier or friendly for
> users of different languages.
>
> On Wed, Feb 13, 2019 at 3:47 AM Robert Bradshaw  wrote:
> >
> > I agree, it's useful for runners that are used for tests (including testing 
> > SDKs) to push into the dark corners of what's allowed by the spec. I think 
> > this can be added (where they don't already exist) to existing 
> > non-production runners. (Whether a direct runner should be considered 
> > production or not depends on who you ask...)
> >
> > On Wed, Feb 13, 2019 at 2:49 AM Daniel Oliveira  
> > wrote:
> >>
> >> +1 to Kenn's point. Regardless of whether we go with a Python runner or a 
> >> Java runner, I think we should have at least one portable runner that 
> >> isn't a production runner for the reasons he outlined.
> >>
> >> As for the rest of the discussion, it sounds like people are generally 
> >> supportive of having the Python FnApiRunner as that runner, and using 
> >> Flink as a reference implementation for portability in Java.
> >>
> >> On Tue, Feb 12, 2019 at 4:37 PM Kenneth Knowles  wrote:
> >>>
> >>>
> >>> On Tue, Feb 12, 2019 at 8:59 AM Thomas Weise  wrote:
> >>>>
> >>>> The Java ULR initially provided some value for the portability effort as 
> >>>> Max mentions. It helped to develop the shared library for all Java 
> >>>> runners and the job server functionality.
> >>>>
> >>>> However, I think the same could have been accomplished by developing the 
> >>>> Flink runner instead of the Java ULR from the get go. This is also what 
> >>>> happened later last year when support for state, timers and metrics was 
> >>>> added to the portable Flink runner first and the ULR still does not 
> >>>> support those features [1].
> >>>>
> >>>> Since all (or most) Java based runners that are based on another ASF 
> >>>> project support embedded execution, I think it might make sense to 
> >>>> discontinue separate direct runners for Java and instead focus efforts 
> >>>> on making the runners that folks would also use in production better?
> >>>
> >>>
> >>> Caveat: if people only test using embedded execution of a production 
> >>> runner, they are quite likely to

Re: Is there a reason why these are error logs? Missing required coder_id on grpc_port

2019-02-13 Thread Robert Bradshaw

We should fix the offending runner(s?). I think this is BEAM-4150.

On Wed, Feb 13, 2019 at 2:47 AM Alex Amato  wrote:

> These errors are very spammy in certain jobs, I was wondering if we could
> reduce the log level. Or put some conditions around this?
>
>
> https://github.com/apache/beam/search?q=Missing+required+coder_id+on+grpc_port&unscoped_q=Missing+required+coder_id+on+grpc_port
>
>
>

Bintray account

2019-02-13 Thread Robert Bradshaw

I've been looking at updating our release scripts to resolve
https://issues.apache.org/jira/browse/BEAM-6544 and have a setup that
pushes to bintray (and then the release script downloads and signs them
before pushing to svn). Does anyone know if we already have an apache beam
organization already set up on bintray, and if so, who should be contacted
for access tokens?

Re: Thoughts on a reference runner to invest in?

2019-02-13 Thread Robert Bradshaw

I agree, it's useful for runners that are used for tests (including testing
SDKs) to push into the dark corners of what's allowed by the spec. I think
this can be added (where they don't already exist) to existing
non-production runners. (Whether a direct runner should be considered
production or not depends on who you ask...)

On Wed, Feb 13, 2019 at 2:49 AM Daniel Oliveira 
wrote:

> +1 to Kenn's point. Regardless of whether we go with a Python runner or a
> Java runner, I think we should have at least one portable runner that isn't
> a production runner for the reasons he outlined.
>
> As for the rest of the discussion, it sounds like people are generally
> supportive of having the Python FnApiRunner as that runner, and using Flink
> as a reference implementation for portability in Java.
>
> On Tue, Feb 12, 2019 at 4:37 PM Kenneth Knowles  wrote:
>
>>
>> On Tue, Feb 12, 2019 at 8:59 AM Thomas Weise  wrote:
>>
>>> The Java ULR initially provided some value for the portability effort as
>>> Max mentions. It helped to develop the shared library for all Java runners
>>> and the job server functionality.
>>>
>>> However, I think the same could have been accomplished by developing the
>>> Flink runner instead of the Java ULR from the get go. This is also what
>>> happened later last year when support for state, timers and metrics was
>>> added to the portable Flink runner first and the ULR still does not support
>>> those features [1].
>>>
>>> Since all (or most) Java based runners that are based on another ASF
>>> project support embedded execution, I think it might make sense to
>>> discontinue separate direct runners for Java and instead focus efforts on
>>> making the runners that folks would also use in production better?
>>>
>>
>> Caveat: if people only test using embedded execution of a production
>> runner, they are quite likely to depend on quirks of that runner, such as
>> bundle size, fusion, whether shuffle is also checkpoint, etc. I think
>> there's a lot of value in an antagonistic testing runner, which is
>> something the Java DirectRunner tried to do with GBK random ordering,
>> checking illegal mutations, checking encodability. These were all driven by
>> real user needs and each caught a lot of user bugs. That said, I wouldn't
>> want to maintain an extra runner, but would like to put these into a
>> portable runner, whichever it is.
>>
>> Kenn
>>
>>
>>>
>>> As for Python (and hopefully soon Go), it makes a lot of sense to have a
>>> simple to use and stable runner that can be used for local development. At
>>> the moment, the Py FnApiRunner seems the best candidate to serve as
>>> reference for portability.
>>>
>>> On a related note, we should probably also consider making pure Java
>>> pipeline execution via portability framework on a Java runner simpler and
>>> more efficient. We already use embedded environment for testing. If we also
>>> inline/embed the job server and this becomes readily available and easy to
>>> use, it might improve chances of other runners migrating to portability
>>> sooner.
>>>
>>> Thomas
>>>
>>> [1] https://s.apache.org/apache-beam-portability-support-table
>>>
>>>
>>>
>>> On Tue, Feb 12, 2019 at 3:34 AM Maximilian Michels 
>>> wrote:
>>>
>>>> Do you consider job submission and artifact staging part of the
>>>> ReferenceRunner? If so, these parts have been reused or served as a
>>>> model for the portable FlinkRunner. So they had some value.
>>>>
>>>> A reference implementation helps Runner authors to understand and reuse
>>>> the code. However, I agree that the Flink implementation is more
>>>> helpful
>>>> to Runners authors than a ReferenceRunner which was designed for single
>>>> node testing.
>>>>
>>>> I think there are three parts which help to push forward portability:
>>>>
>>>> 1) Good library support for new portable Runners (Java)
>>>> 2) A reference implementation of a distributed Runner (Flink)
>>>> 3) An easy way for users to run/test portable Pipelines (Python via
>>>> FnApiRunner)
>>>>
>>>> The main motivation for the portability layer is supporting additional
>>>> language to Java. Most users will be using Python, so focusing on a
>>>> good
>>>> reference Runner in Python is key.
>>>>
>&g

Re: Thoughts on a reference runner to invest in?

2019-02-12 Thread Robert Bradshaw

This is certainly an interesting question, and I definitely have my
opinions, but am curious as to what others think as well.

One thing that I think wasn't as clear from the outset is distinguishing
between the development of runners/core-java and development of a Java
reference runner itself. With the work on work on moving Flink to
portability, it turned out that work on the latter was not a prerequisite
for work on the former, and runners/core-java is the artifact that other
runners want to build on. I think that it is also the case, as suggested,
that a distributed runner's use of this shared library is a better
reference point (for other distributed runners) than one using the direct
runner (e.g. there is a much more obvious delineation between the runner's
responsibility and Beam code than in the direct runner where the boundaries
between orchestration, execution, and other concerns are not as clear).

As well as serving as a reference to runner implementers, the reference
runner can also be useful for prototyping (here I think Python holds an
advantage, but we're getting into subjective areas now), documenting (or
ideally augmenting the documentation of) the spec (here I'd say a smaller
advantage to Python, but neither runner clean, straightforward, and
documented enough to serve this purpose well yet), and serving as a
lightweight universal local runner against which to develop (and, possibly
use long term in place of a direct runner) new SDKs (here you'll get a wide
variety of answers whether Python or Java is easier to take on as a
dependency for a third language, or we could just package it up in a docker
image and take docker as a dependency).

Another more pragmatic note is that one thing that helped both the Flink
and FnApiRunner forwards is that they were driven forward by actual
usecases--Lyft has actual Python (necessitating portable) pipelines they
want to run on Flink, and the FnApiRunner is the direct runner for Python.
The Java ULR (at least where it is now) sits in an awkward place where its
only role is to be a reference rather than be used, which (in a world of
limited resources) makes it harder to justify investment.

- Robert

On Tue, Feb 12, 2019 at 3:53 AM Kenneth Knowles  wrote:

> Interesting silence here. You've got it right that the reason we initially
> chose Java was because of the cross-runner sharing. The reference runner
> could be the first target runner for any new feature and then its work
> could be directly (or indirectly via copy/paste/modify if it works better)
> be used in other runners. Examples:
>
>  - The implementations of (pre-portability) state & timers in
> runners/core-java and prototyped in the Java DirectRunner made it a matter
> of a couple of days to implement on other runners, and they saw pretty
> quick adoption.
>  - Probably the same could be said for the first drafts of the runners,
> which re-used a bunch of runners/core-java and had each others' translation
> code as a reference.
>
> I'm interested if anyone would be willing to confirm if it is because the
> FlinkRunner has forged ahead and the Dataflow worker is open source. It
> makes sense that the code from a distributed runner is an even better
> reference point if you are building another distributed runner. From the
> look of it, the SamzaRunner had no trouble getting started on portability.
>
> Kenn
>
> On Mon, Feb 11, 2019 at 6:04 PM Daniel Oliveira 
> wrote:
>
>> Yeah, the FnApiRunner is what I'm leaning towards too. I wasn't sure how
>> much demand there was for an actual reference implementation in Java
>> though, so I was hoping there were runner authors that would want to chime
>> in.
>>
>> On the other hand, the Flink runner could serve as a reference
>> implementation for portable features since it's further along, so maybe
>> it's not an issue regardless.
>>
>> On Mon, Feb 11, 2019 at 1:09 PM Sam Rohde  wrote:
>>
>>> Thanks for starting this thread. If I had to guess, I would say there is
>>> more of a demand for Python as it's more widely used for data scientists/
>>> analytics. Being pragmatic, the FnApiRunner already has more feature work
>>> than the Java so we should go with that.
>>>
>>> -Sam
>>>
>>> On Fri, Feb 8, 2019 at 10:07 AM Daniel Oliveira 
>>> wrote:
>>>
 Hello Beam dev community,

 For those who don't know me, I work for Google and I've been working on
 the Java reference runner, which is a portable, local Java runner (it's
 basically the direct runner with the portability APIs implemented). Our
 goal in working on this was to have a portable runner which ran locally so
 it could be used by users for testing portable pipelines, devs for testing
 new features with portability, and for runner authors to provide a simple
 reference implementation of a portable runner.

 Due to various circumstances though, progress on the Java reference
 runner has been pretty slow, and a Python runner which does pretty much the
>

Re: pipeline steps

2019-02-11 Thread Robert Bradshaw

In terms of performance, it would likely be minimal overhead if (as is
likely) the step consuming the filename gets fused with the read. There's
still overhead constructing this composite, object, etc. but that's (again
likely) smaller than the cost of doing the read itself.

On Sun, Feb 10, 2019 at 7:03 AM Reuven Lax  wrote:

> I think we could definitely add an option to FileIO to add the filename to
> every record. It would come at a (performance) cost - often the filename is
> much larger than the actual record..
>
> On Thu, Feb 7, 2019 at 6:29 AM Kenneth Knowles  wrote:
>
>> This comes up a lot, wanting file names alongside the data that came from
>> the file. It is a historical quirk that none of our connectors used to have
>> the file names. What is the change needed for FileIO + parse Avro to be
>> really easy to use?
>>
>> Kenn
>>
>> On Thu, Feb 7, 2019 at 6:18 AM Jeff Klukas  wrote:
>>
>>> I haven't needed to do this with Beam before, but I've definitely had
>>> similar needs in the past. Spark, for example, provides an input_file_name
>>> function that can be applied to a dataframe to add the input file as an
>>> additional column. It's not clear to me how that's implemented, though.
>>>
>>> Perhaps others have suggestions, but I'm not aware of a way to do this
>>> conveniently in Beam today. To my knowledge, today you would have to use
>>> FileIO.match() and FileIO.readMatches() to get a collection of
>>> ReadableFile. You'd then have to FlatMapElements to pull out the metadata
>>> and the bytes of the file, and you'd be responsible for parsing those bytes
>>> into avro records. You'd  be able to output something like a KV
>>> that groups the file name together with the parsed avro record.
>>>
>>> Seems like something worth providing better support for in Beam itself
>>> if this indeed doesn't already exist.
>>>
>>> On Thu, Feb 7, 2019 at 7:29 AM Chaim Turkel  wrote:
>>>
 Hi,
   I am working on a pipeline that listens to a topic on pubsub to get
 files that have changes in the storage. Then i read avro files, and
 would like to write them to bigquery based on the file name (to
 different tables).
   My problem is that the transformer that reads the avro does not give
 me back the files name (like a tuple or something like that). I seem
 to have this pattern come back a lot.
 Can you think of any solutions?

 Chaim

 --


 Loans are funded by
 FinWise Bank, a Utah-chartered bank located in Sandy,
 Utah, member FDIC, Equal
 Opportunity Lender. Merchant Cash Advances are
 made by Behalf. For more
 information on ECOA, click here
 . For important information about
 opening a new
 account, review Patriot Act procedures here
 .
 Visit Legal
  to
 review our comprehensive program terms,
 conditions, and disclosures.

>>>

Re: [VOTE] Release 2.10.0, release candidate #3

2019-02-08 Thread Robert Bradshaw

+1 (binding)

I have verified that the artifacts and their checksums/signatures look
good, and also checked the Python wheels against simple pipelines.

On Fri, Feb 8, 2019 at 4:29 PM Etienne Chauchot 
wrote:

> Hi,
> I did the same visual checks of Nexmark that I did on RC2 for both
> functional regressions (output size) and performance regressions (execution
> time) on all the runners/modes for RC3 cut date (02/06) and I saw no
> regression except the one that I already mentioned (end of october perf
> degradation on Q7 in spark batch mode) but it was already in previous
> version.
>
> Though I did not have time to check the artifacts. +1 (binding) provided
> that artifacts are correct
>
> Etienne
>
> Le jeudi 07 février 2019 à 10:25 -0800, Scott Wegner a écrit :
>
> +1
>
> I validated running:
> * Java Quickstart (Direct)
> * Java Quickstart (Apex local)
> * Java Quickstart (Flink local)
> * Java Quickstart (Spark local)
> * Java Quickstart (Dataflow)
> * Java Mobile Game (Dataflow)
>
> On Wed, Feb 6, 2019 at 2:28 PM Kenneth Knowles  wrote:
>
> Hi everyone,
>
> Please review and vote on the release candidate #3 for the version 2.10.0,
> as follows:
>
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
> The complete staging area is available for your review, which includes:
> * JIRA release notes [1],
> * the official Apache source release to be deployed to dist.apache.org [2],
> which is signed with the key with fingerprint 6ED551A8AE02461C [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "v2.10.0-RC3" [5],
> * website pull request listing the release [6] and publishing the API
> reference manual [7].
> * Python artifacts are deployed along with the source release to the
> dist.apache.org [2].
> * Validation sheet with a tab for 2.10.0 release to help with validation
> [7].
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> Thanks,
> Kenn
>
> [1]
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12344540
> [2] https://dist.apache.org/repos/dist/dev/beam/2.10.0/
> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> [4] https://repository.apache.org/content/repositories/orgapachebeam-1058/
> [5] https://github.com/apache/beam/tree/v2.10.0-RC3
> [6] https://github.com/apache/beam/pull/7651/files
> [7] https://github.com/apache/beam-site/pull/586
> [8]
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=2053422529
>
>
>
>

Re: 2.7.1 (LTS) release?

2019-02-08 Thread Robert Bradshaw

+1, I've always found it odd that our build process creates and then
reverts commits in the branch (and had the same issue when I was doing the
release that restarting if something went wrong was painful). If I
understand correctly, a, b, and c would be tags in the github repository,
but not live on any particular branch? I think this is much nicer.

On Fri, Feb 8, 2019 at 4:03 PM Maximilian Michels  wrote:

> Looks like a good improvement.
>
> It makes sense to have the snapshot version on the release branch and
> only change it to a proper version before creating the RC.
>
> Do we still revert a, b, c after creating the RC? Otherwise the bash
> script which replaces "-SNAPSHOT" won't work correctly.
>
> -Max
>
> On 06.02.19 20:21, Kenneth Knowles wrote:
> > Having gone through the release process, I have a couple of git drawings
> > to share. Currently the release process looks like this (you'll have to
> > view in fixed width font if it is stripped by the mail manager).
> >
> > -X master
> > \
> >  ---Y-a--b---c- release-2.10.0
> >
> > *   X: commit that updates master from 2.10.0-SNAPSHOT to
> > 2.11.0-SNAPSHOT (Python calls it 2.10.0dev, etc per lang, and we wrote a
> > script for it)
> > *   The release branch starts the release branch from parent of X
> > *   Y: changes Python version to 2.10.0 (no dev) and you'll see why
> > *   On release branch, version is still 2.10.0-SNAPSHOT for Java
> > *   a, b, c: the gradle release plugin commits a change for Java to
> > 2.10.0 then reverts it, and tags with RC1, RC2, RC3, etc. If the RC
> > fails you have to force reset and delete the tag.
> > *   The release script also builds from fresh clones, so this is all
> > pushed to GitHub. It can really clutter the history but is otherwise
> > probably harmless. Because of issues with scripting and gpg set up I had
> > to build maybe 10 "RCs" to roll RC2.
> >
> > I think git can make this simpler. I would propose:
> >
> > -X master
> > \
> >  --- release-2.10.0
> >   \  \  \
> >a  b  c
> > *X: same
> > *Y: gone
> > *On release branch, both Java and Python are -SNAPSHOT or dev, etc.
> > (and it could be release-2.10 that advances minor version in the commit
> > after a succesful RC)
> > *To build an RC, add the commits like a, b, c which remove -SNAPSHOT
> > and tag; we have a bash script that collects all the places that need
> > editing, the one that built commit X.
> > *Whether to push the commit and tag first or build the RC first
> > doesn't matter that much but anyhow now it is off the history so it is
> > fine to push.
> >
> > Have I missed something vital about the current process?
> >
> > Kenn
> >
> >
> >
> > On Thu, Jan 31, 2019 at 8:49 PM Thomas Weise  > > wrote:
> >
> > Either looks fine to me. Same content, different label :)
> >
> >
> > On Thu, Jan 31, 2019 at 6:32 PM Michael Luckey  > > wrote:
> >
> > Thx Thomas for that clarification. I tried to express, I d
> > slightly prefer to have branches
> >
> > 2.7.x
> > 2.8.x
> > 2.9.x
> >
> > and tags:
> > 2.7.0
> > 2.7.1
> > ...
> >
> > So only difference would be to be more explicit on the branch
> > name, i.e. that it embraces all the patch versions. (I do not
> > know how to better express, that '2.7.x' is a literal string and
> > should not be confused as some placeholder.)
> >
> > Regarding the versioning, I always prefer the explicit version
> > including patch version. It might make it easier to help and
> > resolve issues if it is known on which patch level a user is
> > running. I spent lot of lifetime assuming some version and
> > realising later it was 'just another snapshot' version...
> >
> > Just my 2 ct... Also fine with the previous suggestion.
> >
> >
> >
> > On Fri, Feb 1, 2019 at 3:18 AM Thomas Weise  > > wrote:
> >
> > Hi,
> >
> > As Kenn had already examplified, the suggestion was to have
> > branches:
> >
> > 2.7
> > 2.8
> > 2.9
> > ...
> >
> > and tags:
> >
> > 2.7.0
> > 2.7.1
> > ...
> > 2.8.0
> > ...
> >
> > Changes would go to the 2.7 branch, at some point release
> > 2.7.1 is created. Then more changes may accrue on the same
> > branch, maybe at some point 2.7.2 is released and so on.
> >
> > We could also consider changing the snapshot version to
> > 2.7-SNAPSHOT, instead of 2.7.{0,1,...}-SNAPSHOT.
> >
> > With that it wouldn't even be necessary to change the
> > version

Re: [VOTE] Release 2.10.0, release candidate #2

2019-02-06 Thread Robert Bradshaw

+1.

I verified the source artifacts look good, and tried the Python wheels.

On Tue, Feb 5, 2019 at 11:57 PM Kenneth Knowles  wrote:
>
> Hi everyone,
>
> Please review and vote on the release candidate #2 for the version 2.10.0, as 
> follows:
>
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
> The complete staging area is available for your review, which includes:
> * JIRA release notes [1],
> * the official Apache source release to be deployed to dist.apache.org [2], 
> which is signed with the key with fingerprint 6ED551A8AE02461C [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "v2.10.0-RC1" [5],
> * website pull request listing the release [6] and publishing the API 
> reference manual [7].
> * Python artifacts are deployed along with the source release to the 
> dist.apache.org [2].
> * Validation sheet with a tab for 2.10.0 release to help with validation [7].
>
> The vote will be open for at least 72 hours. It is adopted by majority 
> approval, with at least 3 PMC affirmative votes.
>
> Thanks,
> Kenn
>
> [1] 
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12344540
> [2] https://dist.apache.org/repos/dist/dev/beam/2.10.0/
> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> [4] https://repository.apache.org/content/repositories/orgapachebeam-1057/
> [5] https://github.com/apache/beam/tree/v2.10.0-RC2
> [6] https://github.com/apache/beam/pull/7651/files
> [7] https://github.com/apache/beam-site/pull/586
> [8] 
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=2053422529

Re: Beam Python streaming pipeline on Flink Runner

2019-02-05 Thread Robert Bradshaw

On Tue, Feb 5, 2019 at 5:11 PM Maximilian Michels  wrote:
>
> Good points Cham.
>
> JSON seemed like the most intuitive way to specify a configuration map.
> We already use JSON in other places, e.g. to specify the environment
> configuration. It is not necessarily a contradiction to have JSON inside
> Protobuf. From the perspective of IO authors, the user-friendliness
> plays a role because they wouldn't have to deal with Protobuf.
>
> I agree that the configuration format is an implementation detail that
> will be hidden to users via easy-to-use wrappers.

JSON has the advantage that one need not know (let alone compile) the
schema to use it. I see this as a particular advantage for a "generic"
layer where the client SDK may not know all the transforms a server
may be serving. It's less useful when the client SDK already knows
about a source and has nice wrappers for it. (Nicer and less
error-prone than manually constructing JSON at least.)

For a particular source, its choice of payload is specific to that
transform. Protos are natural, and could arguably be encouraged for
things we ship with Beam, but by no means required (and, for example,
DoFn's payloads are serialized as raw bytes, not protos).

> Do we have to support UDFs for expanding existing IO? Users would still
> be able to apply UDFs via ParDo on the IO output collections. Generally
> speaking, I can see how for cross-language transforms UDF support would
> be good. For example, a Combine implementation in Java, where the
> combine UDFs come from Python.

The idea of URNs is that one would develop a large body of URNs that
are supported in many languages (e.g. SumInt64s) and then the runner
could pick the environment that goes best (e.g. according to
performance and/or opportunity for fusion or possibly inlining).

UDFs that are called from within an IO as part of its operation is
still an open question.

> I suppose the question is, do we try to solve the general case, or do we
> go with a simpler approach for enabling the use of existing IO first?
> Lack of IO seems to be the most pressing issue for the adoption of Beam
> Python. I imagine that a backwards-compatible incremental support for
> cross-language transforms (IOs first, later other transforms) would be
> possible.

I think what we have is backwards compatible. One can define and
register as many (urn, payload -> PTransform) pairs as one wants.

> On 05.02.19 03:07, Chamikara Jayalath wrote:
> >
> >
> > On Fri, Feb 1, 2019 at 6:12 AM Maximilian Michels  > <mailto:m...@apache.org>> wrote:
> >
> > Yes, I imagine sources to implement a JsonConfigurable interface (e.g.
> > on their builders):
> >
> > JsonConfigurable {
> > // Either a json string or Map
> > apply(String jsonConfig);
> > }
> >
> > In Python we would create this transform:
> >
> > URN: JsonConfiguredSource:v1
> > payload: {
> >  environment: environment_id, // Java/Python/Go
> >  resourceIdentifier: string,  // "org.apache.beam.io.PubSubIO"
> >  configuration: json config,  // { "topic" : "my_pubsub_topic" }
> > }
> >
> >
> > Thanks Max, this is a great first step towards defining to API for
> > cross-language transforms.
> > Is there a reason why you would want to use JSON instead of a proto
> > here. I guess we'll be providing a more user friendly language wrapper
> > (for example, Python) for end-users here, so user-friendliness-wise, the
> > format we choose won't matter much (for pipeline authors).
> > If we don't support UDFs, performance difference will be negligible, but
> > UDFs might require a callback to original SDK (per-element worst case).
> > So might make sense to choose the more efficient format.
> >
> > Also, probably we need to define a more expanded definition (proto/JSON)
> > to support UDFs. For example, a payload + a set of parameter definitions
> > so that the target SDK (for example, Java) can call back the original
> > SDK where the pipeline was authored in (for example, Python) to resolve
> > UDFs at runtime.
> >
> > Thanks,
> > Cham
> >
> > That's more generic and could be used for other languages where we
> > might
> > have sources/sinks.
> >
> >  > (FWIW, I was imagining PubSubIO already had a translation into
> > BeamFnApi protos that fully specified it, and we use that same
> > format to translate back out.)
> >
> > Not that I know of.
> >
> > On 01.02.19 14:02, Robert Bradshaw wrote:
> >  > Are you

Re: [Proposal] Get Metrics API: Metric Extraction via proto RPC API.

2019-02-04 Thread Robert Bradshaw

To summarize for the list, the plan of record is:

The MonitoringInfo proto will be used again in this querying API, so the
metric format SDKs report will also be used when extracting metrics for a
job.

// Job Service for running RunnerAPI pipelines
service JobService {
  ...
  rpc GetJobMetrics (GetJobMetricsRequest) returns (GetJobMetricsResponse);
}

message GetJobMetricsRequest {
  string job_id = 1; // (required)
}

message GetJobMetricsResponse {
   // (Optional) The aggregated value of the metric based on the in
flight work.
   // SDKs optionally report these metrics in the
ProcessBundleProgressResponse.
   MonitoringInfo attempted_metric_result = 1;
   // (Required) The aggregated value of the metric based on the
completed work.
   // SDKs report these metrics in the ProcessBundleResponse.
   MonitoringInfo committed_metric_result = 2;
}
The new rpc in beam_job_metrics.proto.

SDKs will continue to implement filtering of Metrics, by providing their
own language specific convenience functions to filter and obtain metrics.
In Java for example, the MetricResult and MetricFilter format interfaces
will continue to exist as interface to filter metrics for a specific step,
user namespace, username, etc.

Looking forward to seeing this happen.

On Mon, Feb 4, 2019 at 8:17 PM Alex Amato  wrote:

> Done,it's on the website now.
>
> Ryan and I will move forward with the plan in this plan. If there are any
> major objections to this plan, please let us know by weds. Suggestions will
> be welcome later as well, as we are happy to iterate on this. But we will
> be proceeding with some of Ryan's PRs based on this design.
>
> On Thu, Jan 31, 2019 at 12:54 PM Ismaël Mejía  wrote:
>
>> Please don't forget to add this document to the design documents webpage.
>>
>> On Thu, Jan 31, 2019 at 8:46 PM Alex Amato  wrote:
>> >
>> > Hello Beam,
>> >
>> > Robert Ryan and I have been designing a metric extraction API for Beam.
>> Please take a look at this design, I would love to get more feedback on
>> this to improve the design.
>> >
>> > https://s.apache.org/get-metrics-api
>> >
>> > The primary goal of this proposal is to offer a simple way to obtain
>> all the metrics for a job. The following issues are addressed:
>> >
>> > The current design requires implementing metric querying for every
>> runner+language combination.
>> >
>> > Duplication of MetricResult related classes in each language.
>> >
>> > The existing MetricResult format only allows querying metrics defined
>> by a namespace, name and step, and does not allow generalized labelling as
>> used by MonitoringInfos.
>> >
>> > Enhance Beam’s ability to integration test new metrics
>> >
>> >
>> > Thank for taking a look,
>> > Alex
>>
>

Re: Beam Python streaming pipeline on Flink Runner

2019-02-01 Thread Robert Bradshaw

On Fri, Feb 1, 2019 at 5:42 PM Thomas Weise  wrote:

>
> On Fri, Feb 1, 2019 at 6:17 AM Maximilian Michels  wrote:
>
>> > Max thanks for your summary. I would like to add that we agree that
>> > the runner specific translation via URN is a temporal solution until
>> > the wrappers transforms are written, is this correct? In any case this
>> > alternative standard expansion approach deserves a discussion of their
>> > own as you mention.
>>
>> Correct. Wrapping existing Beam transforms should always be preferred
>> over Runner-specific translation because the latter is not portable.
>>
>>
> From a Python user perspective, this can still be exposed as a stub,
> without having to know about the URN.
>

Yep. In the long run, I'd expect many sources to be offered as their own
easy-to-use stubs.


> Also, isn't how we expose this is orthogonal to how it is being translated?
>

Yes.


> It may even be possible to switch the stub to SDF based translation once
> that is ready.
>

Yep. The expansion would change, but that's all an internal detail iside
the composite the user doesn't care about.


>
>
>> On 01.02.19 14:25, Ismaël Mejía wrote:
>> > Thanks for the explanation Robert it makes much more sense now. (Sorry
>> > for the confusion in the mapping I mistyped the direction SDF <->
>> > Source).
>> >
>> > Status of SDF:
>> > - Support for Dynamic Work Rebalancing is WIP.
>> > - Bounded version translation is supported by all non-portable runners
>> > in a relatively naive way.
>> > - Unbounded version translation is not supported in the non-portable
>> > runners. (Let's not forget that this case may make sense too).
>> > - Portable runners translation of SDF is WIP
>> > - There is only one IO that is written based on SDF:
>> >- HBaseIO
>> > - Some other IOs should work out of the box (those based on
>> > non-splittable DoFn):
>> >- ClickhouseIO
>> >- File-based ones: TextIO, AvroIO, ParquetIO
>> >- JdbcIO
>> >- SolrIO
>> >
>> > Max thanks for your summary. I would like to add that we agree that
>> > the runner specific translation via URN is a temporal solution until
>> > the wrappers transforms are written, is this correct? In any case this
>> > alternative standard expansion approach deserves a discussion of their
>> > own as you mention.
>> >
>> > On Fri, Feb 1, 2019 at 2:02 PM Robert Bradshaw 
>> wrote:
>> >>
>> >> Are you suggesting something akin to a generic
>> >>
>> >>  urn: JsonConfiguredJavaSource
>> >>  payload: some json specifying which source and which parameters
>> >>
>> >> which would expand to actually constructing and applying that source?
>> >>
>> >> (FWIW, I was imagining PubSubIO already had a translation into
>> BeamFnApi protos that fully specified it, and we use that same format to
>> translate back out.)
>> >>
>> >> On Fri, Feb 1, 2019 at 1:44 PM Maximilian Michels 
>> wrote:
>> >>>
>> >>> Recaping here:
>> >>>
>> >>> We all agree that SDF is the way to go for future implementations of
>> >>> sources. It enables us to get rid of the source interfaces. However,
>> SDF
>> >>> does not solve the lack of streaming sources in Python.
>> >>>
>> >>> The expansion PR (thanks btw!) solves the problem of
>> >>> expanding/translating URNs known to an ExpansionService. That is a
>> more
>> >>> programmatic way of replacing language-specific performs, instead of
>> >>> relying on translators directly in the Runner.
>> >>>
>> >>> What is unsolved is the configuration of sources from a foreign
>> >>> environment. In my opinion this is the most pressing issue for Python
>> >>> sources, because what is PubSubIO worth in Python if you cannot
>> >>> configure it?
>> >>>
>> >>> What about this:
>> >>>
>> >>> I think it is worth adding a JSON configuration option for all
>> existing
>> >>> Java sources. That way, we could easily configure them as part of the
>> >>> expansion request (which would contain a JSON configuration). I'll
>> >>> probably fork a thread to discuss this in more detail, but would like
>> to
>> >>> hear your thoughts.
>> >

Re: Beam Python streaming pipeline on Flink Runner

2019-02-01 Thread Robert Bradshaw

Are you suggesting something akin to a generic

urn: JsonConfiguredJavaSource
payload: some json specifying which source and which parameters

which would expand to actually constructing and applying that source?

(FWIW, I was imagining PubSubIO already had a translation into BeamFnApi
protos that fully specified it, and we use that same format to translate
back out.)

On Fri, Feb 1, 2019 at 1:44 PM Maximilian Michels  wrote:

> Recaping here:
>
> We all agree that SDF is the way to go for future implementations of
> sources. It enables us to get rid of the source interfaces. However, SDF
> does not solve the lack of streaming sources in Python.
>
> The expansion PR (thanks btw!) solves the problem of
> expanding/translating URNs known to an ExpansionService. That is a more
> programmatic way of replacing language-specific performs, instead of
> relying on translators directly in the Runner.
>
> What is unsolved is the configuration of sources from a foreign
> environment. In my opinion this is the most pressing issue for Python
> sources, because what is PubSubIO worth in Python if you cannot
> configure it?
>
> What about this:
>
> I think it is worth adding a JSON configuration option for all existing
> Java sources. That way, we could easily configure them as part of the
> expansion request (which would contain a JSON configuration). I'll
> probably fork a thread to discuss this in more detail, but would like to
> hear your thoughts.
>
> -Max
>
> On 01.02.19 13:08, Robert Bradshaw wrote:
> > On Thu, Jan 31, 2019 at 6:25 PM Maximilian Michels  > <mailto:m...@apache.org>> wrote:
> >
> > Ah, I thought you meant native Flink transforms.
> >
> > Exactly! The translation code is already there. The main challenge
> > is how to
> > programmatically configure the BeamIO from Python. I suppose that is
> > also an
> > unsolved problem for cross-language transforms in general.
> >
> >
> > This is what https://github.com/apache/beam/pull/7316 does.
> >
> > For a particular source, one would want to define a URN and
> > corresponding payload, then (probably) a CompositeTransform in Python
> > that takes the users arguments, packages them into the payload, applies
> > the ExternalTransform, and returns the results. How to handle arbitrary
> > UDFs embedded in sources is still TBD.
> >
> > For Matthias' pipeline with PubSubIO we can build something
> > specific, but for
> > the general case there should be way to initialize a Beam IO via a
> > configuration
> > map provided by an external environment.
> >
> >
> > I thought quite a bit about how we could represent expansions statically
> > (e.g. have some kind of expansion template that could be used, at least
> > in many cases, as data without firing up a separate process. May be
> > worth doing eventually, but we run into the same issues that were
> > discussed at
> > https://github.com/apache/beam/pull/7316#discussion_r249996455 ).
> >
> > If one is already using a portable runner like Flink, having the job
> > service process automatically also serve up an expansion service for
> > various URNs it knows and cares about is probably a pretty low bar.
> > Flink could serve up things it would rather get back untouched in a
> > transform with a special flink runner urn.
> >
> > As Ahmet mentions, SDF is better solution. I hope it's not that far
> > away, but even once it comes we'll likely want the above framework to
> > invoke the full suite of Java IOs even after they're running on SDF
> > themselves.
> >
> > - Robert
> >
> > On 31.01.19 17:36, Thomas Weise wrote:
> >  > Exactly, that's what I had in mind.
> >  >
> >  > A Flink runner native transform would make the existing unbounded
> > sources
> >  > available, similar to:
> >  >
> >  >
> >
> https://github.com/apache/beam/blob/2e89c1e4d35e7b5f95a622259d23d921c3d6ad1f/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkStreamingTransformTranslators.java#L167
> >  >
> >  >
> >  >
> >  >
> >  > On Thu, Jan 31, 2019 at 8:18 AM Maximilian Michels
> > mailto:m...@apache.org>
> >  > <mailto:m...@apache.org <mailto:m...@apache.org>>> wrote:
> >  >
> >  > Wouldn't it be even more useful for the transition period if
> > we enabled Beam IO
> >  > to be used via Fl

Re: Beam Python streaming pipeline on Flink Runner

2019-02-01 Thread Robert Bradshaw

On Thu, Jan 31, 2019 at 6:25 PM Maximilian Michels  wrote:

> Ah, I thought you meant native Flink transforms.
>
> Exactly! The translation code is already there. The main challenge is how
> to
> programmatically configure the BeamIO from Python. I suppose that is also
> an
> unsolved problem for cross-language transforms in general.
>

This is what https://github.com/apache/beam/pull/7316 does.

For a particular source, one would want to define a URN and corresponding
payload, then (probably) a CompositeTransform in Python that takes the
users arguments, packages them into the payload, applies the
ExternalTransform, and returns the results. How to handle arbitrary UDFs
embedded in sources is still TBD.


> For Matthias' pipeline with PubSubIO we can build something specific, but
> for
> the general case there should be way to initialize a Beam IO via a
> configuration
> map provided by an external environment.
>

I thought quite a bit about how we could represent expansions statically
(e.g. have some kind of expansion template that could be used, at least in
many cases, as data without firing up a separate process. May be worth
doing eventually, but we run into the same issues that were discussed at
https://github.com/apache/beam/pull/7316#discussion_r249996455 ).

If one is already using a portable runner like Flink, having the job
service process automatically also serve up an expansion service for
various URNs it knows and cares about is probably a pretty low bar. Flink
could serve up things it would rather get back untouched in a transform
with a special flink runner urn.

As Ahmet mentions, SDF is better solution. I hope it's not that far away,
but even once it comes we'll likely want the above framework to invoke the
full suite of Java IOs even after they're running on SDF themselves.

- Robert



> On 31.01.19 17:36, Thomas Weise wrote:
> > Exactly, that's what I had in mind.
> >
> > A Flink runner native transform would make the existing unbounded
> sources
> > available, similar to:
> >
> >
> https://github.com/apache/beam/blob/2e89c1e4d35e7b5f95a622259d23d921c3d6ad1f/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkStreamingTransformTranslators.java#L167
> >
> >
> >
> >
> > On Thu, Jan 31, 2019 at 8:18 AM Maximilian Michels  > > wrote:
> >
> > Wouldn't it be even more useful for the transition period if we
> enabled Beam IO
> > to be used via Flink (like in the legacy Flink Runner)? In this
> particular
> > example, Matthias wants to use PubSubIO, which is not even available
> as a
> > native
> > Flink transform.
> >
> > On 31.01.19 16:21, Thomas Weise wrote:
> >  > Until SDF is supported, we could also add Flink runner native
> transforms for
> >  > selected unbounded sources [1].
> >  >
> >  > That might be a reasonable option to unblock users that want to
> try Python
> >  > streaming on Flink.
> >  >
> >  > Thomas
> >  >
> >  > [1]
> >  >
> >
> https://github.com/lyft/beam/blob/release-2.10.0-lyft/runners/flink/src/main/java/org/apache/beam/runners/flink/LyftFlinkStreamingPortableTranslations.java
> >  >
> >  >
> >  > On Thu, Jan 31, 2019 at 6:51 AM Maximilian Michels <
> m...@apache.org
> > 
> >  > >> wrote:
> >  >
> >  >  > I have a hard time to imagine how can we map in a generic
> way
> >  > RestrictionTrackers into the existing
> Bounded/UnboundedSource, so I would
> >  > love to hear more about the details.
> >  >
> >  > Isn't it the other way around? The SDF is a generalization of
> > UnboundedSource.
> >  > So we would wrap UnboundedSource using SDF. I'm not saying it
> is
> > trivial, but
> >  > SDF offers all the functionality that UnboundedSource needs.
> >  >
> >  > For example, the @GetInitialRestriction method would call
> split on the
> >  > UnboundedSource and the restriction trackers would then be
> used to
> > process the
> >  > splits.
> >  >
> >  > On 31.01.19 15:16, Ismaël Mejía wrote:
> >  >  >> Not necessarily. This would be one way. Another way is
> build an SDF
> >  > wrapper for UnboundedSource. Probably the easier path for
> migration.
> >  >  >
> >  >  > That would be fantastic, I have heard about such wrapper
> multiple
> >  >  > times but so far there is not any realistic proposal. I
> have a hard
> >  >  > time to imagine how can we map in a generic way
> RestrictionTrackers
> >  >  > into the existing Bounded/UnboundedSource, so I would love
> to hear
> >  >  > more about the details.
> >  >  >
> >  >  > On Thu, Jan 31, 2019 at 3:07 PM Maximilian Michels <
> m...@apache.org
> > 
> >  > >> wrote:
> >

Re: [DISCUSS] Should File based IOs implement readAll() or just readFiles()

2019-01-30 Thread Robert Bradshaw

Yes, this is precisely the goal of SDF.


On Wed, Jan 30, 2019 at 8:41 PM Kenneth Knowles  wrote:
>
> So is the latter is intended for splittable DoFn but not yet using it? The 
> promise of SDF is precisely this composability, isn't it?
>
> Kenn
>
> On Wed, Jan 30, 2019 at 10:16 AM Jeff Klukas  wrote:
>>
>> Reuven - Is TextIO.read().from() a more complex case than the topic Ismaël 
>> is bringing up in this thread? I'm surprised to hear that the two examples 
>> have different performance characteristics.
>>
>> Reading through the implementation, I guess the fundamental difference is 
>> whether a given configuration expands to TextIO.ReadAll or to io.Read. 
>> AFAICT, that detail and the subsequent performance impact is not documented.
>>
>> If the above is correct, perhaps it's an argument for IOs to provide 
>> higher-level methods in cases where they can optimize performance compared 
>> to what a user might naively put together.
>>
>> On Wed, Jan 30, 2019 at 12:35 PM Reuven Lax  wrote:
>>>
>>> Jeff, what you did here is not simply a refactoring. These two are quite 
>>> different, and will likely have different performance characteristics.
>>>
>>> The first evaluates the wildcard, and allows the runner to pick appropriate 
>>> bundling. Bundles might contain multiple files (if they are small), and the 
>>> runner can split the files as appropriate. In the case of the Dataflow 
>>> runner, these bundles can be further split dynamically.
>>>
>>> The second chops of the files inside the the PTransform, and processes each 
>>> chunk in a ParDo. TextIO.readFiles currently chops up each file into 64mb 
>>> chunks (hardcoded), and then processes each chunk in a ParDo.
>>>
>>> Reuven
>>>
>>>
>>> On Wed, Jan 30, 2019 at 9:18 AM Jeff Klukas  wrote:

 I would prefer we move towards option [2]. I just tried the following 
 refactor in my own code from:

   return input
   .apply(TextIO.read().from(fileSpec));

 to:

   return input
   .apply(FileIO.match().filepattern(fileSpec))
   .apply(FileIO.readMatches())
   .apply(TextIO.readFiles());

 Yes, the latter is more verbose but not ridiculously so, and it's also 
 more instructive about what's happening.

 When I first started working with Beam, it took me a while to realize that 
 TextIO.read().from() would accept a wildcard. The more verbose version 
 involves a method called "filepattern" which makes this much more obvious. 
 It also leads me to understand that I could use the same FileIO.match() 
 machinery to do other things with filesystems other than read file 
 contents.

 On Wed, Jan 30, 2019 at 11:26 AM Ismaël Mejía  wrote:
>
> Hello,
>
> A ‘recent’ pattern of use in Beam is to have in file based IOs a
> `readAll()` implementation that basically matches a `PCollection` of
> file patterns and reads them, e.g. `TextIO`, `AvroIO`. `ReadAll` is
> implemented by a expand function that matches files with FileIO and
> then reads them using a format specific `ReadFiles` transform e.g.
> TextIO.ReadFiles, AvroIO.ReadFiles. So in the end `ReadAll` in the
> Java implementation is just an user friendly API to hide FileIO.match
> + ReadFiles.
>
> Most recent IOs do NOT implement ReadAll to encourage the more
> composable approach of File + ReadFiles, e.g. XmlIO and ParquetIO.
>
> Implementing ReadAll as a wrapper is relatively easy and is definitely
> user friendly, but it has an  issue, it may be error-prone and it adds
> more code to maintain (mostly ‘repeated’ code). However `readAll` is a
> more abstract pattern that applies not only to File based IOs so it
> makes sense for example in other transforms that map a `Pcollection`
> of read requests and is the basis for SDF composable style APIs like
> the recent `HBaseIO.readAll()`.
>
> So the question is should we:
>
> [1] Implement `readAll` in all file based IOs to be user friendly and
> assume the (minor) maintenance cost
>
> or
>
> [2] Deprecate `readAll` from file based IOs and encourage users to use
> FileIO + `readFiles` (less maintenance and encourage composition).
>
> I just checked quickly in the python code base but I did not find if
> the File match + ReadFiles pattern applies, but it would be nice to
> see what the python guys think on this too.
>
> This discussion comes from a recent slack conversation with Łukasz
> Gajowy, and we wanted to settle into one approach to make the IO
> signatures consistent, so any opinions/preferences?

Re: Portable metrics work and open questions

2019-01-30 Thread Robert Bradshaw

I think v1 of the querying API should be just "give me *all* the
metrics." Shortly thereafter, we should have a v2 that allows for
requesting just a subset metrics, possibly pre-aggregated. (My
preference would be a filter like {URN: regex, label: [label_name:
regex]} and all matching counters would be returned. A second
parameter could control how counters are aggregated across distinct
labels (specifically some kind of many-to-one mapping of labels onto a
smaller set, also could be regex-based, useful for things like "given
me the msecs of all steps "FirstTransform/.*" or ".*/WriteToShuffle).)

On Wed, Jan 30, 2019 at 6:19 PM Alex Amato  wrote:
>
> Okay, yeah I was tossing and turning last night thinking the same thing.
>
> The querying API needs to be relatively simple, not use a structure similar 
> to URNs/MonitoringInfo structure. But there there should be a way to pass 
> through metrics so that they can be queried out. I think that is missing from 
> the doc right now. I'll iterate on that a bit. For sum_int64 and 
> distribution_int_64 this will be possible. but we should document the 
> translation formal
>
> On Wed, Jan 30, 2019 at 5:33 AM Robert Bradshaw  wrote:
>>
>> Thanks for writing this up. I left some comments in the doc, but at a
>> high level I am in favor of the "more deeply overhaul SDKs'
>> metrics/querying structures to use MonitoringInfos / URNs" option, at
>> least over the Jobs API, for consistency and completeness. The SDK can
>> provide whatever convenience wrappers over these it wants.
>>
>> On Tue, Jan 29, 2019 at 6:30 PM Ryan Williams  wrote:
>> >
>> > Alex and I have PRs out related to supporting metrics in portable-runner 
>> > code-paths:
>> >
>> > #7624 associates metrics in the SDK harness with the (pre-fusion) 
>> > PTransforms the user defined them in.
>> > #7641 sends metrics over the "Job API" (between job server and portable 
>> > runner):
>> >
>> > Flink portable-VR metrics tests pass (Java)
>> > metrics print()s work in portable wordcount (Python)
>> >
>> > Open Questions:
>> >
>> > What to do with type-specific protos (e.g. IntDistributionData vs. 
>> > DoubleDistributionData)?
>> >
>> > I think Alex and I were leaning toward only supporting the "int"-cases for 
>> > now
>> > That's what Java does in its existing metrics
>> >
>> > "MetricKey" and "MetricName" semantics:
>> >
>> > These exist in Java and Python, and I added proto versions in #7641.
>> > MetricName wraps "namespace" and "name" strings, and MetricKey wraps a 
>> > "step (ptransform) name" and a MetricName.
>> > PCollection-scoped metrics (e.g. element count) are identified by a null 
>> > "step name" in #7624 and #7641.
>> > Alex and I discussed using URNs as the source of this information instead:
>> >
>> > "step name" can instead come from a MonitoringInfo's PTRANSFORM label, 
>> > while "namespace" and "name" can be parsed from its URN.
>> > URNs could encode these over the wire, then SDKs could convert to existing 
>> > MetricKey/MetricNames for use in querying / MetricResults
>> > or: we could more deeply overhaul SDKs' metrics/querying structures to use 
>> > MonitoringInfos / URNs.
>> >
>> > at the least, SDKs should get helpers for querying for Alex's new "system 
>> > metrics" (e.g. element count, various timings) that are associated with 
>> > specific URNs
>> >
>> > Gauges: the protos have a nod to sending gauges over the wire as counters
>> >
>> > are there problems with that?
>> > #7641 should support this, for now.
>> >
>> > ExtremaData: the protos contain these, but SDKs don't support them (afaik).
>> >
>> > Alex likely has more to add, and we plan to make a doc about these 
>> > changes, but I wanted to post here first in case others have thoughts or 
>> > we are overlooking anything.
>> >
>> > Thanks!

Re: Portable metrics work and open questions

2019-01-30 Thread Robert Bradshaw

Thanks for writing this up. I left some comments in the doc, but at a
high level I am in favor of the "more deeply overhaul SDKs'
metrics/querying structures to use MonitoringInfos / URNs" option, at
least over the Jobs API, for consistency and completeness. The SDK can
provide whatever convenience wrappers over these it wants.

On Tue, Jan 29, 2019 at 6:30 PM Ryan Williams  wrote:
>
> Alex and I have PRs out related to supporting metrics in portable-runner 
> code-paths:
>
> #7624 associates metrics in the SDK harness with the (pre-fusion) PTransforms 
> the user defined them in.
> #7641 sends metrics over the "Job API" (between job server and portable 
> runner):
>
> Flink portable-VR metrics tests pass (Java)
> metrics print()s work in portable wordcount (Python)
>
> Open Questions:
>
> What to do with type-specific protos (e.g. IntDistributionData vs. 
> DoubleDistributionData)?
>
> I think Alex and I were leaning toward only supporting the "int"-cases for now
> That's what Java does in its existing metrics
>
> "MetricKey" and "MetricName" semantics:
>
> These exist in Java and Python, and I added proto versions in #7641.
> MetricName wraps "namespace" and "name" strings, and MetricKey wraps a "step 
> (ptransform) name" and a MetricName.
> PCollection-scoped metrics (e.g. element count) are identified by a null 
> "step name" in #7624 and #7641.
> Alex and I discussed using URNs as the source of this information instead:
>
> "step name" can instead come from a MonitoringInfo's PTRANSFORM label, while 
> "namespace" and "name" can be parsed from its URN.
> URNs could encode these over the wire, then SDKs could convert to existing 
> MetricKey/MetricNames for use in querying / MetricResults
> or: we could more deeply overhaul SDKs' metrics/querying structures to use 
> MonitoringInfos / URNs.
>
> at the least, SDKs should get helpers for querying for Alex's new "system 
> metrics" (e.g. element count, various timings) that are associated with 
> specific URNs
>
> Gauges: the protos have a nod to sending gauges over the wire as counters
>
> are there problems with that?
> #7641 should support this, for now.
>
> ExtremaData: the protos contain these, but SDKs don't support them (afaik).
>
> Alex likely has more to add, and we plan to make a doc about these changes, 
> but I wanted to post here first in case others have thoughts or we are 
> overlooking anything.
>
> Thanks!

Re: [VOTE] Release 2.10.0, release candidate #1

2019-01-29 Thread Robert Bradshaw

The artifacts and signatures look good. But we're missing Python wheels.


On Tue, Jan 29, 2019 at 6:08 AM Kenneth Knowles  wrote:
>
> Ah, I did not close the staging repository. Thanks for letting me know. Try 
> now.
>
> Kenn
>
> On Mon, Jan 28, 2019 at 2:31 PM Ismaël Mejía  wrote:
>>
>> I think there is an issue, [4] does not open?
>>
>> On Mon, Jan 28, 2019 at 6:24 PM Kenneth Knowles  wrote:
>> >
>> > Hi everyone,
>> >
>> > Please review and vote on the release candidate #1 for the version 2.10.0, 
>> > as follows:
>> >
>> > [ ] +1, Approve the release
>> > [ ] -1, Do not approve the release (please provide specific comments)
>> >
>> > The complete staging area is available for your review, which includes:
>> > * JIRA release notes [1],
>> > * the official Apache source release to be deployed to dist.apache.org 
>> > [2], which is signed with the key with fingerprint 6ED551A8AE02461C [3],
>> > * all artifacts to be deployed to the Maven Central Repository [4],
>> > * source code tag "v2.10.0-RC1" [5],
>> > * website pull request listing the release [6] and publishing the API 
>> > reference manual [7].
>> > * Python artifacts are deployed along with the source release to the 
>> > dist.apache.org [2].
>> > * Validation sheet with a tab for 2.10.0 release to help with validation 
>> > [7].
>> >
>> > The vote will be open for at least 72 hours. It is adopted by majority 
>> > approval, with at least 3 PMC affirmative votes.
>> >
>> > Thanks,
>> > Kenn
>> >
>> > [1] 
>> > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12344540
>> > [2] https://dist.apache.org/repos/dist/dev/beam/2.10.0/
>> > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>> > [4] https://repository.apache.org/content/repositories/orgapachebeam-1056/
>> > [5] https://github.com/apache/beam/tree/v2.10.0-RC1
>> > [6] https://github.com/apache/beam/pull/7651/files
>> > [7] https://github.com/apache/beam-site/pull/585
>> > [8] 
>> > https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=2053422529

Re: [DISCUSSION] UTests and embedded backends

2019-01-28 Thread Robert Bradshaw

I strongly agree with your original assessment "IMHO I believe that
having embedded backend for UTests are a lot better than mocks." Mocks
are sometimes necessary, but in my experience they are often an
expensive (in production and maintenance) way to get what amounts to
low true coverage.

On Mon, Jan 28, 2019 at 11:16 AM Etienne Chauchot  wrote:
>
> Guys,
>
> I will try using mocks where I see it is needed. As there is a current PR 
> opened on Cassandra, I will take this opportunity to add the embedded 
> cassandra server (https://github.com/jsevellec/cassandra-unit) to the UTests.
> Ticket was opened while ago: https://issues.apache.org/jira/browse/BEAM-4164
>
> Etienne
>
> Le mardi 22 janvier 2019 à 09:26 +0100, Robert Bradshaw a écrit :
>
> On Mon, Jan 21, 2019 at 10:42 PM Kenneth Knowles  wrote:
>
>
> Robert - you meant this as a mostly-automatic thing that we would engineer, 
> yes?
>
>
> Yes, something like TestPipeline that buffers up the pipelines and
>
> then executes on class teardown (details TBD).
>
>
> A lighter-weight fake, like using something in-process sharing a Java 
> interface (versus today a locally running service sharing an RPC interface) 
> is still much better than a mock.
>
>
> +1
>
>
>
> Kenn
>
>
> On Mon, Jan 21, 2019 at 7:17 AM Jean-Baptiste Onofré  
> wrote:
>
>
> Hi,
>
>
> it makes sense to use embedded backend when:
>
>
> 1. it's possible to easily embed the backend
>
> 2. when the backend is "predictable".
>
>
> If it's easy to embed and the backend behavior is predictable, then it
>
> makes sense.
>
> In other cases, we can fallback to mock.
>
>
> Regards
>
> JB
>
>
> On 21/01/2019 10:07, Etienne Chauchot wrote:
>
> Hi guys,
>
>
> Lately I have been fixing various Elasticsearch flakiness issues in the
>
> UTests by: introducing timeouts, countdown latches, force refresh,
>
> embedded cluster size decrease ...
>
>
> These flakiness issues are due to the embedded Elasticsearch not coping
>
> well with the jenkins overload. Still, IMHO I believe that having
>
> embedded backend for UTests are a lot better than mocks. Even if they
>
> are less tolerant to load, I prefer having UTests 100% representative of
>
> real backend and add countermeasures to protect against jenkins overload.
>
>
> WDYT ?
>
>
> Etienne
>
>
>
>
> --
>
> Jean-Baptiste Onofré
>
> jbono...@apache.org
>
> http://blog.nanthrax.net
>
> Talend - http://www.talend.com

Re: [ANNOUNCE] New PMC member: Etienne Chauchot

2019-01-28 Thread Robert Bradshaw

Thanks for all your great work. Congratulations and welcome!

On Mon, Jan 28, 2019 at 10:21 AM Alexey Romanenko
 wrote:
>
> Great job! Congrats, Etienne!
>
> On 28 Jan 2019, at 07:18, Ahmet Altay  wrote:
>
> Congratulations Etienne!
>
> On Sun, Jan 27, 2019 at 7:15 PM Reza Ardeshir Rokni  wrote:
>>
>> Congratulations Etienne!
>>
>> On Sat, 26 Jan 2019 at 14:16, Ismaël Mejía  wrote:
>>>
>>> Congratulations Etienne!
>>>
>>> Le sam. 26 janv. 2019 à 06:42, Reuven Lax  a écrit :

 Welcome!

 On Fri, Jan 25, 2019 at 9:30 PM Pablo Estrada  wrote:
>
> Congrats Etienne :)
>
> On Fri, Jan 25, 2019, 9:24 PM Trần Thành Đạt  wrote:
>>
>> Congratulations Etienne!
>>
>> On Sat, Jan 26, 2019 at 12:08 PM Thomas Weise  wrote:
>>>
>>> Congrats, félicitations!
>>>
>>>
>>> On Fri, Jan 25, 2019 at 3:06 PM Scott Wegner  wrote:

 Congrats Etienne!

 On Fri, Jan 25, 2019 at 2:34 PM Tim  wrote:
>
> Congratulations Etienne!
>
> Tim
>
> > On 25 Jan 2019, at 23:00, Kenneth Knowles  wrote:
> >
> > Hi all,
> >
> > Please join me and the rest of the Beam PMC in welcoming Etienne 
> > Chauchot to join the PMC.
> >
> > Etienne introduced himself to dev@ in September of 2017 and over 
> > the years has contributed to Beam in many ways - connectors, 
> > performance, design discussion, talks, code reviews, and I'm sure I 
> > cannot list them all. He already has a major impact on the 
> > direction of Beam.
> >
> > Thanks for being a part of Beam, Etienne!
> >
> > Kenn



 --




 Got feedback? tinyurl.com/swegner-feedback
>
>

Re: BEAM-6324 / #7340: "I've pretty much given up on the PR being merged. I use my own fork for my projects"

2019-01-28 Thread Robert Bradshaw

On Mon, Jan 28, 2019 at 10:37 AM Etienne Chauchot  wrote:
>
> Sure it's a pity than this PR got unnoticed and I think it is a combination 
> of factors (PR date around Christmas, the fact that the author forgot - AFAIK 
> - to ping a reviewer in either the PR or the ML).
>
> I agree with Rui's proposal to enhance visibility of the "how to get a 
> reviewed" process.
>
> IMHO, I don't think committers spend time watching new PRs coming up, but 
> they more likely act when pinged. So, we may need some automation in case a 
> contributor do not use github reviewed proposal. Auto reviewer assignment 
> seem too much but modifying the PR template to add a sentence such as "please 
> pickup a reviewer in the proposed list" could be enough.
> WDYT ?

+1

I see two somewhat separable areas of improvement:

(1) Getting a reviewer assigned to a PR, and
(2) Expectations of feedback/timeliness from a reviewer once it has
been assigned.

The first is the motivation for this thread, but I think we're
suffering on the second as well.

Given the reactions here, it sounds like most of us are just as
unhappy this happened as the author, and would be happy to pitch in
and improve things.

I agree with Kenn that just adding to more the contributor guide
always help because a contributor guide with everything one might need
to know is the least likely to actually be read in its entirety.
Rather it's useful to provide such guidance at the point that it's
needed. Specifically, I would like to see

(1) A sentence in the PR template suggesting adding a reviewer. (easy)
(2) An automated recommendation for suggesting good candidate
reviewers (if we deem Github's suggestions insufficient). (harder)
(3) A bot that follows up on PRs after 1 week(?) noting the lack of
action and suggesting (and, implicitly but importantly
permission/expectation) that the author bring the PR to the list.
(medium)

We could also have automated emails like the Beam Dependency Check
Report, but automated emails are much easier to ignore than personal
ones. Having the author ping dev@ has the added advantage that it
gives the author something they can do to move the PR forward, and
provides a clear signal that this is a PR someone cares enough about
to prioritize it above others. (It's certainly disheartening as a
reviewer to put time into reviewing a PR and then the author doesn't
bother to even respond, or (as has happened to me) be told "hey, this
wasn't ready for review yet." On the other hand it's rewarding to help
someone, especially a first time contributor, to see their change
actually get in. Improving this ratio will I think both increase the
productivity of reviews and the motivation to do them.)

> Also, I started to review the PR on Friday (thx Kenn for pinging me).
>
> Etienne
>
> Le vendredi 25 janvier 2019 à 10:21 -0800, Rui Wang a écrit :
>
> We have code contribution guidelines [1] and it says useful tips to make PR 
> reviewed and merged. But I guess it hides in Beam website so new contributors 
> are likely to ignore it. In order to make the guidance easy to find and read 
> for new contributors, we probably can
>
> a. Move number 5 item from [1] to a separate section and name it "Tips to get 
> your PR reviewed and merged"
> b. Put the link to the Github pull request template, so when a contributor 
> creates the first PR, the contributor could see the link (or even paste text 
> from contribution guide). It will be a good chance that new contributors read 
> what's in pull request template.
>
>
> -Rui
>
> [1] https://beam.apache.org/contribute/#make-your-change
>
> On Fri, Jan 25, 2019 at 9:24 AM Alexey Romanenko  
> wrote:
>
> For sure, it’s a pity that this PR has not been addressed for a long time (I 
> guess, we probably have other ones like this) but, as I can see from this PR 
> history, review has not been requested explicitly by author (and this is one 
> of the our recommendations for code contribution [1]).
>
> What are the options to improve this:
>
> 1) Make it more clearly for new contributors that they need to ask for a 
> review explicitly (with a help of recommendations that already provided in 
> top-right corner on PR page)
> 2) Create a bot (like “stale” bot that we have) to check for non-addressed 
> PRs that are more than, say, 7 days, and send notification to dev@ (or 
> dedicated, see n.3) mailing list if they are starving for review.
> 3) (Optionally) Create new mailing list called pr@ for new coming and 
> non-addressed PRs
>
> [1] https://beam.apache.org/contribute/#make-your-change
>
>
> On 25 Jan 2019, at 17:50, Ismaël Mejía  wrote:
>
> The fact that this happened is a real pity. However it is clearly an
> exception and not the rule. Really few PRs have been long time without
> review. Can we somehow automatically send a notification if a PR has
> no assigned reviewers, or if it has not been reviewed after some time
> as Tim suggested?
>
> On Fri, Jan 25, 2019 at 9:43 AM Tim Robertson  
> wrote:
>
>
> Thanks

Re: Cross-language pipelines

2019-01-24 Thread Robert Bradshaw

On Fri, Jan 25, 2019 at 12:18 AM Reuven Lax  wrote:
>
> On Thu, Jan 24, 2019 at 2:38 PM Robert Bradshaw  wrote:
>>
>> On Thu, Jan 24, 2019 at 6:43 PM Reuven Lax  wrote:
>> >
>> > Keep in mind that these user-supplied lambdas are commonly used in our 
>> > IOs. One common usage is in Sink IOs, to allow dynamic destinations. e.g. 
>> > in BigQueryIO.Write, a user-supplied lambda determines what table a record 
>> > should be written to.
>>
>> This can probably be pre-computed upstream (as part of the wrapping
>> composite that does take a language-native lamdba) and placed in a
>> standard format (e.g. a tuple or other schema) to be extracted by the
>> "core" sink.
>
> I'm not quite sure what you mean. How will you express a lambda as a tuple? 
> Or are you suggesting that we preapply all the lambdas and pass the result 
> down?

Exactly.

> That might work, but would be _far_ more expensive.

Calling back to the SDK on each application would likely be (a
different kind of) expensive.

> The result of the lambda is sometimes must larger than the input (e.g. the 
> result could be a fully-qualified  output location string), so these IOs try 
> and delay application as much as possible; as a result, the actual 
> application is often deep inside the graph.

Batching such PRCs gets messy (though perhaps we'll have to go there).
Some hybrid approach where we compute the truly dynamic part eagerly
and do some "boring" (known URN) application like prefixing with a
prefix delayed may sometimes be possible. Some applications may lend
themselves to interleaving (e.g. so the large lambda output is never
shuffled, but still crosses the data plane).

Worst case there are features that simply wouldn't be available, or at
least not cheaply, until an SDK-native source is written, but it could
still be a huge win for a lot of usecases.

As I said, we just don't have any good answers for this bit yet :).

>>
>> > Given that IOs are one of the big selling points of cross-language 
>> > support, we should think about how we can support this functionality.
>>
>> Yes. There are user-supplied lambdas that can't be as easily pre- or
>> post-applied, and though we had some brainstorming sessions (~ a year
>> ago) we're far from a (good) answer to that.
>>
>> > On Thu, Jan 24, 2019 at 8:34 AM Robert Bradshaw  
>> > wrote:
>> >>
>> >> On Thu, Jan 24, 2019 at 5:08 PM Thomas Weise  wrote:
>> >> >
>> >> > Exciting to see the cross-language train gathering steam :)
>> >> >
>> >> > It may be useful to flesh out the user facing aspects a bit more before 
>> >> > going too deep on the service / expansion side or maybe that was done 
>> >> > elsewhere?
>> >>
>> >> It's been discussed, but no resolution yet.
>> >>
>> >> > A few examples (of varying complexity) of how the shim/proxy transforms 
>> >> > would look like in the other SDKs. Perhaps Java KafkaIO in Python and 
>> >> > Go would be a good candidate?
>> >>
>> >> The core implementation would, almost by definition, be
>> >>
>> >> input.apply(ExternalTransform(URN, payload, service_address).
>> >>
>> >> Nicer shims would just be composite transforms that call this, filling
>> >> in the URNs, payloads, and possibly service details from more
>> >> user-friendly parameters.
>> >>
>> >> > One problem we discovered with custom Flink native transforms for 
>> >> > Python was handling of lambdas / functions. An example could be a user 
>> >> > defined watermark timestamp extractor that the user should be able to 
>> >> > supply in Python and the JVM cannot handle.
>> >>
>> >> Yes, this has never been resolved satisfactorily. For now, if UDFs can
>> >> be reified in terms of a commonly-understood URN + payload, it'll
>> >> work. A transform could provide a wide range of "useful" URNs for its
>> >> internal callbacks, more than that would require significant design if
>> >> it can't be pre- or post-fixed.
>> >>
>> >> > On Wed, Jan 23, 2019 at 7:04 PM Chamikara Jayalath 
>> >> >  wrote:
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Wed, Jan 23, 2019 at 1:03 PM Robert Bradshaw  
>> >> >> wrote:
>> >> >>>
>> >> >>> On Wed, Jan 23, 2019

Re: [DISCUSSION] ParDo Async Java API

2019-01-24 Thread Robert Bradshaw

That's a good point that this "IO" time should be tracked differently.

For a single level, a wrapper/utility that correctly and completely
(and transparently) implements the "naive" bit I sketched above under
the hood may be sufficient and implementable purely in user-space, and
quite useful.

On Thu, Jan 24, 2019 at 7:38 PM Scott Wegner  wrote:
>
> Makes sense to me. We should make it easier to write DoFn's in this pattern 
> that has emerged as common among I/O connectors.
>
> Enabling asynchronous task chaining across a fusion tree is more complicated 
> but not necessary for this scenario.
>
> On Thu, Jan 24, 2019 at 10:13 AM Steve Niemitz  wrote:
>>
>> It's also important to note that in many (most?) IO frameworks (gRPC, 
>> finagle, etc), asynchronous IO is typically completely non-blocking, so 
>> there generally won't be a large number of threads waiting for IO to 
>> complete.  (netty uses a small pool of threads for the Event Loop Group for 
>> example).
>>
>> But in general I agree with Reuven, runners should not count threads in use 
>> in other thread pools for IO for the purpose of autoscaling (or most kinds 
>> of accounting).
>>
>> On Thu, Jan 24, 2019 at 12:54 PM Reuven Lax  wrote:
>>>
>>> As Steve said, the main rationale for this is so that asynchronous IOs (or 
>>> in general, asynchronous remote calls) call be made. To some degree this 
>>> addresses Scott's concern: the asynchronous threads should be, for the most 
>>> part, simply waiting for IOs to complete; the reason to do the waiting 
>>> asynchronously is so that the main threadpool does not become blocked, 
>>> causing the pipeline to become IO bound. A runner like Dataflow should not 
>>> be tracking these threads for the purpose of autoscaling, as adding more 
>>> workers will (usually) not cause these calls to complete any faster.
>>>
>>> Reuven
>>>
>>> On Thu, Jan 24, 2019 at 7:28 AM Steve Niemitz  wrote:
>>>>
>>>> I think I agree with a lot of what you said here, I'm just going to 
>>>> restate my initial use-case to try to make it more clear as well.
>>>>
>>>> From my usage of beam, I feel like the big benefit of async DoFns would be 
>>>> to allow batched IO to be implemented more simply inside a DoFn.  Even in 
>>>> the Beam SDK itself, there are a lot of IOs that batch up IO operations in 
>>>> ProcessElement and wait for them to complete in FinishBundle ([1][2], 
>>>> etc).  From my experience, things like error handling, emitting outputs as 
>>>> the result of an asynchronous operation completing (in the correct window, 
>>>> with the correct timestamp, etc) get pretty tricky, and it would be great 
>>>> for the SDK to provide support natively for it.
>>>>
>>>> It's also probably good to point out that really only DoFns that do IO 
>>>> should be asynchronous, normal CPU bound DoFns have no reason to be 
>>>> asynchronous.
>>>>
>>>> A really good example of this is an IO I had written recently for 
>>>> Bigtable, it takes an input PCollection of ByteStrings representing row 
>>>> keys, and returns a PCollection of the row data from bigtable.  Naively 
>>>> this could be implemented by simply blocking on the Bigtable read inside 
>>>> the ParDo, however this would limit throughput substantially (even 
>>>> assuming an avg read latency is 1ms, thats still only 1000 QPS / instance 
>>>> of the ParDo).  My implementation batches many reads together (as they 
>>>> arrive at the DoFn), executes them once the batch is big enough (or some 
>>>> time passes), and then emits them once the batch read completes.  Emitting 
>>>> them in the correct window and handling errors gets tricky, so this is 
>>>> certainly something I'd love the framework itself to handle.
>>>>
>>>> I also don't see a big benefit of making a DoFn receive a future, if all a 
>>>> user is ever supposed to do is attach a continuation to it, that could 
>>>> just as easily be done by the runner itself, basically just invoking the 
>>>> entire ParDo as a continuation on the future (which then assumes the 
>>>> runner is even representing these tasks as futures internally).
>>>>
>>>> Making the DoFn itself actually return a future could be an option, even 
>>>> if the language itself doesn't support something like `await`, you could 
>>

Re: Cross-language pipelines

2019-01-24 Thread Robert Bradshaw

On Thu, Jan 24, 2019 at 6:43 PM Reuven Lax  wrote:
>
> Keep in mind that these user-supplied lambdas are commonly used in our IOs. 
> One common usage is in Sink IOs, to allow dynamic destinations. e.g. in 
> BigQueryIO.Write, a user-supplied lambda determines what table a record 
> should be written to.

This can probably be pre-computed upstream (as part of the wrapping
composite that does take a language-native lamdba) and placed in a
standard format (e.g. a tuple or other schema) to be extracted by the
"core" sink.

> Given that IOs are one of the big selling points of cross-language support, 
> we should think about how we can support this functionality.

Yes. There are user-supplied lambdas that can't be as easily pre- or
post-applied, and though we had some brainstorming sessions (~ a year
ago) we're far from a (good) answer to that.

> On Thu, Jan 24, 2019 at 8:34 AM Robert Bradshaw  wrote:
>>
>> On Thu, Jan 24, 2019 at 5:08 PM Thomas Weise  wrote:
>> >
>> > Exciting to see the cross-language train gathering steam :)
>> >
>> > It may be useful to flesh out the user facing aspects a bit more before 
>> > going too deep on the service / expansion side or maybe that was done 
>> > elsewhere?
>>
>> It's been discussed, but no resolution yet.
>>
>> > A few examples (of varying complexity) of how the shim/proxy transforms 
>> > would look like in the other SDKs. Perhaps Java KafkaIO in Python and Go 
>> > would be a good candidate?
>>
>> The core implementation would, almost by definition, be
>>
>> input.apply(ExternalTransform(URN, payload, service_address).
>>
>> Nicer shims would just be composite transforms that call this, filling
>> in the URNs, payloads, and possibly service details from more
>> user-friendly parameters.
>>
>> > One problem we discovered with custom Flink native transforms for Python 
>> > was handling of lambdas / functions. An example could be a user defined 
>> > watermark timestamp extractor that the user should be able to supply in 
>> > Python and the JVM cannot handle.
>>
>> Yes, this has never been resolved satisfactorily. For now, if UDFs can
>> be reified in terms of a commonly-understood URN + payload, it'll
>> work. A transform could provide a wide range of "useful" URNs for its
>> internal callbacks, more than that would require significant design if
>> it can't be pre- or post-fixed.
>>
>> > On Wed, Jan 23, 2019 at 7:04 PM Chamikara Jayalath  
>> > wrote:
>> >>
>> >>
>> >>
>> >> On Wed, Jan 23, 2019 at 1:03 PM Robert Bradshaw  
>> >> wrote:
>> >>>
>> >>> On Wed, Jan 23, 2019 at 6:38 PM Maximilian Michels  
>> >>> wrote:
>> >>> >
>> >>> > Thank you for starting on the cross-language feature Robert!
>> >>> >
>> >>> > Just to recap: Each SDK runs an ExpansionService which can be 
>> >>> > contacted during
>> >>> > pipeline translation to expand transforms that are unknown to the SDK. 
>> >>> > The
>> >>> > service returns the Proto definitions to the querying process.
>> >>>
>> >>> Yep. Technically it doesn't have to be the SDK, or even if it is there
>> >>> may be a variety of services (e.g. one offering SQL, one offering
>> >>> different IOs).
>> >>>
>> >>> > There will be multiple environments such that during execution 
>> >>> > cross-language
>> >>> > pipelines select the appropriate environment for a transform.
>> >>>
>> >>> Exactly. And fuses only those steps with compatible environments 
>> >>> together.
>> >>>
>> >>> > It's not clear to me, should the expansion happen during pipeline 
>> >>> > construction
>> >>> > or during translation by the Runner?
>> >>>
>> >>> I think it need to happen as part of construction because the set of
>> >>> outputs (and their properties) can be dynamic based on the expansion.
>> >>
>> >>
>> >> Also, without expansion at pipeline construction, we'll have to define 
>> >> all composite cross-language transforms as runner-native transforms which 
>> >> won't be practical ?
>> >>
>> >>>
>> >>>
>> >>> > Thanks,
>> >>&

Re: Cross-language pipelines

2019-01-24 Thread Robert Bradshaw

On Thu, Jan 24, 2019 at 5:08 PM Thomas Weise  wrote:
>
> Exciting to see the cross-language train gathering steam :)
>
> It may be useful to flesh out the user facing aspects a bit more before going 
> too deep on the service / expansion side or maybe that was done elsewhere?

It's been discussed, but no resolution yet.

> A few examples (of varying complexity) of how the shim/proxy transforms would 
> look like in the other SDKs. Perhaps Java KafkaIO in Python and Go would be a 
> good candidate?

The core implementation would, almost by definition, be

input.apply(ExternalTransform(URN, payload, service_address).

Nicer shims would just be composite transforms that call this, filling
in the URNs, payloads, and possibly service details from more
user-friendly parameters.

> One problem we discovered with custom Flink native transforms for Python was 
> handling of lambdas / functions. An example could be a user defined watermark 
> timestamp extractor that the user should be able to supply in Python and the 
> JVM cannot handle.

Yes, this has never been resolved satisfactorily. For now, if UDFs can
be reified in terms of a commonly-understood URN + payload, it'll
work. A transform could provide a wide range of "useful" URNs for its
internal callbacks, more than that would require significant design if
it can't be pre- or post-fixed.

> On Wed, Jan 23, 2019 at 7:04 PM Chamikara Jayalath  
> wrote:
>>
>>
>>
>> On Wed, Jan 23, 2019 at 1:03 PM Robert Bradshaw  wrote:
>>>
>>> On Wed, Jan 23, 2019 at 6:38 PM Maximilian Michels  wrote:
>>> >
>>> > Thank you for starting on the cross-language feature Robert!
>>> >
>>> > Just to recap: Each SDK runs an ExpansionService which can be contacted 
>>> > during
>>> > pipeline translation to expand transforms that are unknown to the SDK. The
>>> > service returns the Proto definitions to the querying process.
>>>
>>> Yep. Technically it doesn't have to be the SDK, or even if it is there
>>> may be a variety of services (e.g. one offering SQL, one offering
>>> different IOs).
>>>
>>> > There will be multiple environments such that during execution 
>>> > cross-language
>>> > pipelines select the appropriate environment for a transform.
>>>
>>> Exactly. And fuses only those steps with compatible environments together.
>>>
>>> > It's not clear to me, should the expansion happen during pipeline 
>>> > construction
>>> > or during translation by the Runner?
>>>
>>> I think it need to happen as part of construction because the set of
>>> outputs (and their properties) can be dynamic based on the expansion.
>>
>>
>> Also, without expansion at pipeline construction, we'll have to define all 
>> composite cross-language transforms as runner-native transforms which won't 
>> be practical ?
>>
>>>
>>>
>>> > Thanks,
>>> > Max
>>> >
>>> > On 23.01.19 04:12, Robert Bradshaw wrote:
>>> > > No, this PR simply takes an endpoint address as a parameter, expecting
>>> > > it to already be up and available. More convenient APIs, e.g. ones
>>> > > that spin up and endpoint and tear it down, or catalog and locate code
>>> > > and services offering these endpoints, could be provided as wrappers
>>> > > on top of or extensions of this.
>>> > >
>>> > > On Wed, Jan 23, 2019 at 12:19 AM Kenneth Knowles  
>>> > > wrote:
>>> > >>
>>> > >> Nice! If I recall correctly, there was mostly concern about how to 
>>> > >> launch and manage the expansion service (Docker? Vendor-specific? 
>>> > >> Etc). Does this PR a position on that question?
>>> > >>
>>> > >> Kenn
>>> > >>
>>> > >> On Tue, Jan 22, 2019 at 1:44 PM Chamikara Jayalath 
>>> > >>  wrote:
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>> On Tue, Jan 22, 2019 at 11:35 AM Udi Meiri  wrote:
>>> > >>>>
>>> > >>>> Also debugability: collecting logs from each of these systems.
>>> > >>>
>>> > >>>
>>> > >>> Agree.
>>> > >>>
>>> > >>>>
>>> > >>>>
>>> > >>>> On Tue, Jan 22, 2019 at 10:53 AM Chamikara Jayalath 
>>>

Re: [DISCUSSION] ParDo Async Java API

2019-01-24 Thread Robert Bradshaw

If I understand correctly, the end goal is to process input elements
of a DoFn asynchronously. Were I to do this naively, I would implement
DoFns that simply take and receive [Serializable?]CompletionStages as
element types, followed by a DoFn that adds a callback to emit on
completion (possibly via a queue to avoid being-on-the-wrong-thread
issues) and whose finalize forces all completions. This would, of
course, interact poorly with processing time tracking, fusion breaks,
watermark tracking, counter attribution, window propagation, etc. so
it is desirable to make it part of the system itself.

Taking a OutputReceiver> seems like a decent
API. The invoking of the downstream process could be chained onto
this, with all the implicit tracking and tracing set up correctly.
Taking a CompletionStage as input means a DoFn would not have to
create its output CompletionStage ex nihilo and possibly allow for
better chaining (depending on the asynchronous APIs used).

Even better might be to simply let the invocation of all
DoFn.process() methods be asynchronous, but as Java doesn't offer an
await primitive to relinquish control in the middle of a function body
this might be hard.

I think for correctness, completion would have to be forced at the end
of each bundle. If your bundles are large enough, this may not be that
big of a deal. In this case you could also start executing subsequent
bundles while waiting for prior ones to complete.




On Wed, Jan 23, 2019 at 11:58 PM Bharath Kumara Subramanian
 wrote:
>>
>> I'd love to see something like this as well.  Also +1 to process(@Element 
>> InputT element, @Output OutputReceiver>). I don't 
>> know if there's much benefit to passing a future in, since the framework 
>> itself could hook up the process function to complete when the future 
>> completes.
>
>
> One benefit we get by wrapping the input with CompletionStage is to 
> mandate[1] users to chain their processing logic to the input future; 
> thereby, ensuring asynchrony for the most part. However, it is still possible 
> for users to go out of their way and write blocking code.
>
> Although, I am not sure how counter intuitive it is for the runners to wrap 
> the input element into a future before passing it to the user code.
>
> Bharath
>
> [1] CompletionStage interface does not define methods for initially creating, 
> forcibly completing normally or exceptionally, probing completion status or 
> results, or awaiting completion of a stage. Implementations of 
> CompletionStage may provide means of achieving such effects, as appropriate
>
>
> On Wed, Jan 23, 2019 at 11:31 AM Kenneth Knowles  wrote:
>>
>> I think your concerns are valid but i want to clarify about "first class 
>> async APIs". Does "first class" mean that it is a well-encapsulated 
>> abstraction? or does it mean that the user can more or less do whatever they 
>> want? These are opposite but both valid meanings for "first class", to me.
>>
>> I would not want to encourage users to do explicit multi-threaded 
>> programming or control parallelism. Part of the point of Beam is to gain big 
>> data parallelism without explicit multithreading. I see asynchronous 
>> chaining of futures (or their best-approximation in your language of choice) 
>> as a highly disciplined way of doing asynchronous dependency-driven 
>> computation that is nonetheless conceptually, and readably, straight-line 
>> code. Threads are not required nor the only way to execute this code. In 
>> fact you might often want to execute without threading for a reference 
>> implementation to provide canonically correct results. APIs that leak 
>> lower-level details of threads are asking for trouble.
>>
>> One of our other ideas was to provide a dynamic parameter of type 
>> ExecutorService. The SDK harness (pre-portability: the runner) would control 
>> and observe parallelism while the user could simply register tasks. 
>> Providing a future/promise API is even more disciplined.
>>
>> Kenn
>>
>> On Wed, Jan 23, 2019 at 10:35 AM Scott Wegner  wrote:
>>>
>>> A related question is how to make execution observable such that a runner 
>>> can make proper scaling decisions. Runners decide how to schedule bundles 
>>> within and across multiple worker instances, and can use information about 
>>> execution to make dynamic scaling decisions. First-class async APIs seem 
>>> like they would encourage DoFn authors to implement their own 
>>> parallelization, rather than deferring to the runner that should be more 
>>> capable of providing the right level of parallelism.
>>>
>>> In the Dataflow worker harness, we estimate execution time to PTransform 
>>> steps by sampling execution time on the execution thread and attributing it 
>>> to the currently invoked method. This approach is fairly simple and 
>>> possible because we assume that execution happens within the thread 
>>> controlled by the runner. Some DoFn's already implement their own async 
>>> logic and break this assumptio

Re: Cross-language pipelines

2019-01-23 Thread Robert Bradshaw

On Wed, Jan 23, 2019 at 6:38 PM Maximilian Michels  wrote:
>
> Thank you for starting on the cross-language feature Robert!
>
> Just to recap: Each SDK runs an ExpansionService which can be contacted during
> pipeline translation to expand transforms that are unknown to the SDK. The
> service returns the Proto definitions to the querying process.

Yep. Technically it doesn't have to be the SDK, or even if it is there
may be a variety of services (e.g. one offering SQL, one offering
different IOs).

> There will be multiple environments such that during execution cross-language
> pipelines select the appropriate environment for a transform.

Exactly. And fuses only those steps with compatible environments together.

> It's not clear to me, should the expansion happen during pipeline construction
> or during translation by the Runner?

I think it need to happen as part of construction because the set of
outputs (and their properties) can be dynamic based on the expansion.

> Thanks,
> Max
>
> On 23.01.19 04:12, Robert Bradshaw wrote:
> > No, this PR simply takes an endpoint address as a parameter, expecting
> > it to already be up and available. More convenient APIs, e.g. ones
> > that spin up and endpoint and tear it down, or catalog and locate code
> > and services offering these endpoints, could be provided as wrappers
> > on top of or extensions of this.
> >
> > On Wed, Jan 23, 2019 at 12:19 AM Kenneth Knowles  wrote:
> >>
> >> Nice! If I recall correctly, there was mostly concern about how to launch 
> >> and manage the expansion service (Docker? Vendor-specific? Etc). Does this 
> >> PR a position on that question?
> >>
> >> Kenn
> >>
> >> On Tue, Jan 22, 2019 at 1:44 PM Chamikara Jayalath  
> >> wrote:
> >>>
> >>>
> >>>
> >>> On Tue, Jan 22, 2019 at 11:35 AM Udi Meiri  wrote:
> >>>>
> >>>> Also debugability: collecting logs from each of these systems.
> >>>
> >>>
> >>> Agree.
> >>>
> >>>>
> >>>>
> >>>> On Tue, Jan 22, 2019 at 10:53 AM Chamikara Jayalath 
> >>>>  wrote:
> >>>>>
> >>>>> Thanks Robert.
> >>>>>
> >>>>> On Tue, Jan 22, 2019 at 4:39 AM Robert Bradshaw  
> >>>>> wrote:
> >>>>>>
> >>>>>> Now that we have the FnAPI, I started playing around with support for
> >>>>>> cross-language pipelines. This will allow things like IOs to be shared
> >>>>>> across all languages, SQL to be invoked from non-Java, TFX tensorflow
> >>>>>> transforms to be invoked from non-Python, etc. and I think is the next
> >>>>>> step in extending (and taking advantage of) the portability layer
> >>>>>> we've developed. These are often composite transforms whose inner
> >>>>>> structure depends in non-trivial ways on their configuration.
> >>>>>
> >>>>>
> >>>>> Some additional benefits of cross-language transforms are given below.
> >>>>>
> >>>>> (1) Current large collection of Java IO connectors will be become 
> >>>>> available to other languages.
> >>>>> (2) Current Java and Python transforms will be available for Go and any 
> >>>>> other future SDKs.
> >>>>> (3) New transform authors will be able to pick their language of choice 
> >>>>> and make their transform available to all Beam SDKs. For example, this 
> >>>>> can be the language the transform author is most familiar with or the 
> >>>>> only language for which a client library is available for connecting to 
> >>>>> an external data store.
> >>>>>
> >>>>>>
> >>>>>> I created a PR [1] that basically follows the "expand via an external
> >>>>>> process" over RPC alternative from the proposals we came up with when
> >>>>>> we were discussing this last time [2]. There are still some unknowns,
> >>>>>> e.g. how to handle artifacts supplied by an alternative SDK (they
> >>>>>> currently must be provided by the environment), but I think this is a
> >>>>>> good incremental step forward that will already be useful in a large
> >>>>>> number of cases. It would be good to validate the general direction
> >>>>>> and I

Re: Dealing with expensive jenkins + Dataflow jobs

2019-01-23 Thread Robert Bradshaw

I like the idea of creating separate project(s) for load tests so as
to not compete with other tests and the standard development cycle.

As for how many workers is too many, I would take the track "what is
it we're trying to test?" Unless your stress-testing the shuffle
itself, much of what Beam does is linearly parallizable with the
number of machines. Of course one will still want to run over real,
large data sets, but not every load test needs this every time. More
interesting could be to try out running at 2x and 4x the data, with 2x
and 4x the machines, and seeing where we fail to be linear.

(As an aside, 4 hours x 10 workers seems like a lot for 23GB of
data...or is it 230GB once you've fanned out?)

On Wed, Jan 23, 2019 at 3:33 PM Łukasz Gajowy  wrote:
>
> Hi,
>
> pinging this thread (maybe some folks missed it). What do you think about 
> those concerns/ideas?
>
> Łukasz
>
> pon., 14 sty 2019 o 17:11 Łukasz Gajowy  napisał(a):
>>
>> Hi all,
>>
>> one problem we need to solve while working with load tests we currently 
>> develop is that we don't really know how much GCP/Jenkins resources can we 
>> occupy. We did some initial testing with 
>> beam_Java_LoadTests_GroupByKey_Dataflow_Small[1] and it seems that for:
>>
>> - 1 000 000 000 (~ 23 GB) synthetic record
>> - 10 fanouts
>> - 10 dataflow workers (--maxNumWorkers)
>>
>> the total job time exceeds 4 hours. It seems too much for such a small load 
>> test. Additionally, we plan to add much bigger tests for other core 
>> operations too. The proposal [2] describes only few of them.
>>
>> The questions are:
>> 1. how many workers can we assign to this job without starving the other 
>> jobs? Are 32 workers for a single Dataflow job fine? Would 64 workers for 
>> such job be fine either?
>> 2. given the plans that we are going to add more and more load tests soon, 
>> do you think it is a good idea to create a separate GCP project + separate 
>> Jenkins workers for load testing purposes only? This would avoid starvation 
>> of critical tests (post commits, pre-commits, etc). Or maybe there is 
>> another solution that will bring such isolation? Is such isolation needed?
>>
>> Ad 2: Please note that we will also need to host Flink/Spark clusters later 
>> on GKE/Dataproc (not decided yet).
>>
>> [1] 
>> https://builds.apache.org/view/A-D/view/Beam/view/All/job/beam_Java_LoadTests_GroupByKey_Dataflow_Small_PR/
>> [2] https://s.apache.org/load-test-basic-operations
>>
>>
>> Thanks,
>> Łukasz
>>

Re: How to use "PortableRunner" in Python SDK?

2019-01-23 Thread Robert Bradshaw

We should probably make the job endpoint mandatory for PortableRunner,
and offer a separate FlinkRunner (and others) that provides a default
endpoint and otherwise delegates everything down.

On Thu, Nov 15, 2018 at 12:07 PM Maximilian Michels  wrote:
>
> > 1) The default behavior, where PortableRunner starts a flink server. It is 
> > confusing to new users
> It does that only if no JobServer endpoint is specified. AFAIK there a
> problems with the bootstrapping, it can definitely be improved.
>
> > 2) All the related docs and inline comments.  Similarly, it could be very 
> > confusing connecting PortableRunner to Flink server.
> +1 We definitely need to improve docs and usability.
>
> > 3) [Probably no longer an issue].   I couldn't make the flink server 
> > example working.  And I could not make example working on Java-ULR either.
> AFAIK Java URL hasn't received love for a long time.
>
> -Max
>
> On 14.11.18 20:57, Ruoyun Huang wrote:
> > To answer Maximilian's question.
> >
> > I am using Linux, debian distribution.
> >
> > It probably sounded too much when I used the word 'planned merge'. What
> > I really meant entails less change than it sounds. More specifically:
> >
> > 1) The default behavior, where PortableRunner starts a flink server.  It
> > is confusing to new users.
> > 2) All the related docs and inline comments.  Similarly, it could be
> > very confusing connecting PortableRunner to Flink server.
> > 3) [Probably no longer an issue].   I couldn't make the flink server
> > example working.  And I could not make example working on Java-ULR
> > either.  Both will require debugging for resolutions.  Thus I figured
> > maybe let us only focus on one single thing: the java-ULR part, without
> > worrying about Flink-server.   Again, looks like this may not be a valid
> > concern, given flink part is most likely due to my setup.
> >
> >
> > On Wed, Nov 14, 2018 at 3:30 AM Maximilian Michels  > > wrote:
> >
> > Hi Ruoyun,
> >
> > I just ran the wordcount locally using the instructions on the page.
> > I've tried the local file system and GCS. Both times it ran
> > successfully
> > and produced valid output.
> >
> > I'm assuming there is some problem with your setup. Which platform are
> > you using? I'm on MacOS.
> >
> > Could you expand on the planned merge? From my understanding we will
> > always need PortableRunner in Python to be able to submit against the
> > Beam JobServer.
> >
> > Thanks,
> > Max
> >
> > On 14.11.18 00:39, Ruoyun Huang wrote:
> >  > A quick follow-up on using current PortableRunner.
> >  >
> >  > I followed the exact three steps as Ankur and Maximilian shared in
> >  > https://beam.apache.org/roadmap/portability/#python-on-flink  ;
> >   The
> >  > wordcount example keeps hanging after 10 minutes.  I also tried
> >  > specifying explicit input/output args, either using gcs folder or
> > local
> >  > file system, but none of them works.
> >  >
> >  > Spent some time looking into it but conclusion yet.  At this point
> >  > though, I guess it does not matter much any more, given we
> > already have
> >  > the plan of merging PortableRunner into using java reference runner
> >  > (i.e. :beam-runners-reference-job-server).
> >  >
> >  > Still appreciated if someone can try out the python-on-flink
> >  >
> > 
> > instructions
> >
> >  > in case it is just due to my local machine setup.  Thanks!
> >  >
> >  >
> >  >
> >  > On Thu, Nov 8, 2018 at 5:04 PM Ruoyun Huang  > 
> >  > >> wrote:
> >  >
> >  > Thanks Maximilian!
> >  >
> >  > I am working on migrating existing PortableRunner to using
> > java ULR
> >  > (Link to Notes
> >  >
> >   
> > ).
> >  > If this issue is non-trivial to solve, I would vote for removing
> >  > this default behavior as part of the consolidation.
> >  >
> >  > On Thu, Nov 8, 2018 at 2:58 AM Maximilian Michels
> > mailto:m...@apache.org>
> >  > >> wrote:
> >  >
> >  > In the long run, we should get rid of the
> > Docker-inside-Docker
> >  > approach,
> >  > which was only intended for testing anyways. It would be
> > cleaner to
> >  > start the SDK harness container alongside with JobServer
> > container.
> >  >
> >  > Short term, I think it should be easy to either fix the
> >  > permissions of
> >  > the mounted "docker" executable or use a Docker image for the
> >  > JobServer
> >  > which com

Re: Cross-language pipelines

2019-01-23 Thread Robert Bradshaw

No, this PR simply takes an endpoint address as a parameter, expecting
it to already be up and available. More convenient APIs, e.g. ones
that spin up and endpoint and tear it down, or catalog and locate code
and services offering these endpoints, could be provided as wrappers
on top of or extensions of this.

On Wed, Jan 23, 2019 at 12:19 AM Kenneth Knowles  wrote:
>
> Nice! If I recall correctly, there was mostly concern about how to launch and 
> manage the expansion service (Docker? Vendor-specific? Etc). Does this PR a 
> position on that question?
>
> Kenn
>
> On Tue, Jan 22, 2019 at 1:44 PM Chamikara Jayalath  
> wrote:
>>
>>
>>
>> On Tue, Jan 22, 2019 at 11:35 AM Udi Meiri  wrote:
>>>
>>> Also debugability: collecting logs from each of these systems.
>>
>>
>> Agree.
>>
>>>
>>>
>>> On Tue, Jan 22, 2019 at 10:53 AM Chamikara Jayalath  
>>> wrote:
>>>>
>>>> Thanks Robert.
>>>>
>>>> On Tue, Jan 22, 2019 at 4:39 AM Robert Bradshaw  
>>>> wrote:
>>>>>
>>>>> Now that we have the FnAPI, I started playing around with support for
>>>>> cross-language pipelines. This will allow things like IOs to be shared
>>>>> across all languages, SQL to be invoked from non-Java, TFX tensorflow
>>>>> transforms to be invoked from non-Python, etc. and I think is the next
>>>>> step in extending (and taking advantage of) the portability layer
>>>>> we've developed. These are often composite transforms whose inner
>>>>> structure depends in non-trivial ways on their configuration.
>>>>
>>>>
>>>> Some additional benefits of cross-language transforms are given below.
>>>>
>>>> (1) Current large collection of Java IO connectors will be become 
>>>> available to other languages.
>>>> (2) Current Java and Python transforms will be available for Go and any 
>>>> other future SDKs.
>>>> (3) New transform authors will be able to pick their language of choice 
>>>> and make their transform available to all Beam SDKs. For example, this can 
>>>> be the language the transform author is most familiar with or the only 
>>>> language for which a client library is available for connecting to an 
>>>> external data store.
>>>>
>>>>>
>>>>> I created a PR [1] that basically follows the "expand via an external
>>>>> process" over RPC alternative from the proposals we came up with when
>>>>> we were discussing this last time [2]. There are still some unknowns,
>>>>> e.g. how to handle artifacts supplied by an alternative SDK (they
>>>>> currently must be provided by the environment), but I think this is a
>>>>> good incremental step forward that will already be useful in a large
>>>>> number of cases. It would be good to validate the general direction
>>>>> and I would be interested in any feedback others may have on it.
>>>>
>>>>
>>>> I think there are multiple semi-dependent problems we have to tackle to 
>>>> reach the final goal of supporting fully-fledged cross-language transforms 
>>>> in Beam. I agree with taking an incremental approach here with overall 
>>>> vision in mind. Some other problems we have to tackle involve following.
>>>>
>>>> * Defining a user API that will allow pipelines defined in a SDK X to use 
>>>> transforms defined in SDK Y.
>>>> * Update various runners to use URN/payload based environment definition 
>>>> [1]
>>>> * Updating various runners to support starting containers for multiple 
>>>> environments/languages for the same pipeline and supporting executing 
>>>> pipeline steps in containers started for multiple environments.
>>
>>
>> I've been working with +Heejong Lee to add some of the missing pieces 
>> mentioned above.
>>
>> We created following doc that captures some of the ongoing work related to 
>> cross-language transforms and which will hopefully serve as a knowledge base 
>> for anybody who wish to quickly learn context related to this.
>> Feel free to refer to this and/or add to this.
>>
>> https://docs.google.com/document/d/1H3yCyVFI9xYs1jsiF1GfrDtARgWGnLDEMwG5aQIx2AU/edit?usp=sharing
>>
>>
>>>>
>>>>
>>>> Thanks,
>>>> Cham
>>>>
>>>> [1] 
>>>> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L952
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>> - Robert
>>>>>
>>>>> [1] https://github.com/apache/beam/pull/7316
>>>>> [2] https://s.apache.org/beam-mixed-language-pipelines

Cross-language pipelines

2019-01-22 Thread Robert Bradshaw

Now that we have the FnAPI, I started playing around with support for
cross-language pipelines. This will allow things like IOs to be shared
across all languages, SQL to be invoked from non-Java, TFX tensorflow
transforms to be invoked from non-Python, etc. and I think is the next
step in extending (and taking advantage of) the portability layer
we've developed. These are often composite transforms whose inner
structure depends in non-trivial ways on their configuration.

I created a PR [1] that basically follows the "expand via an external
process" over RPC alternative from the proposals we came up with when
we were discussing this last time [2]. There are still some unknowns,
e.g. how to handle artifacts supplied by an alternative SDK (they
currently must be provided by the environment), but I think this is a
good incremental step forward that will already be useful in a large
number of cases. It would be good to validate the general direction
and I would be interested in any feedback others may have on it.

- Robert

[1] https://github.com/apache/beam/pull/7316
[2] https://s.apache.org/beam-mixed-language-pipelines

Re: [DISCUSSION] UTests and embedded backends

2019-01-22 Thread Robert Bradshaw

On Mon, Jan 21, 2019 at 10:42 PM Kenneth Knowles  wrote:
>
> Robert - you meant this as a mostly-automatic thing that we would engineer, 
> yes?

Yes, something like TestPipeline that buffers up the pipelines and
then executes on class teardown (details TBD).

> A lighter-weight fake, like using something in-process sharing a Java 
> interface (versus today a locally running service sharing an RPC interface) 
> is still much better than a mock.

+1

>
> Kenn
>
> On Mon, Jan 21, 2019 at 7:17 AM Jean-Baptiste Onofré  
> wrote:
>>
>> Hi,
>>
>> it makes sense to use embedded backend when:
>>
>> 1. it's possible to easily embed the backend
>> 2. when the backend is "predictable".
>>
>> If it's easy to embed and the backend behavior is predictable, then it
>> makes sense.
>> In other cases, we can fallback to mock.
>>
>> Regards
>> JB
>>
>> On 21/01/2019 10:07, Etienne Chauchot wrote:
>> > Hi guys,
>> >
>> > Lately I have been fixing various Elasticsearch flakiness issues in the
>> > UTests by: introducing timeouts, countdown latches, force refresh,
>> > embedded cluster size decrease ...
>> >
>> > These flakiness issues are due to the embedded Elasticsearch not coping
>> > well with the jenkins overload. Still, IMHO I believe that having
>> > embedded backend for UTests are a lot better than mocks. Even if they
>> > are less tolerant to load, I prefer having UTests 100% representative of
>> > real backend and add countermeasures to protect against jenkins overload.
>> >
>> > WDYT ?
>> >
>> > Etienne
>> >
>> >
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com

Re: gradle clean causes long-running python installs

2019-01-21 Thread Robert Bradshaw

Just some background, grpcio-tools is what's used to do the proto
generation. Unfortunately it's expensive to compile and doesn't
provide very many wheels, so we want to install it once, not every
time. (It's also used in more than just tests; one needs it every time
the .proto files change.)

That being said, we could probably do a much cheaper clean.

On Fri, Jan 18, 2019 at 8:56 PM Udi Meiri  wrote:
>
> grpcio-tools could probably be moved under the "test" tag in setup.py. Not 
> sure why it has to be specified in gradle configs.
>
> On Fri, Jan 18, 2019 at 11:43 AM Kenneth Knowles  wrote:
>>
>> Can you `setupVirtualEnv` just enough to run `setup.py clean` without 
>> installing gcpio-tools, etc?
>>
>> Kenn
>>
>> On Fri, Jan 18, 2019 at 11:20 AM Udi Meiri  wrote:
>>>
>>> setup.py has requirements like setuptools, which are installed in the 
>>> virtual environment.
>>> So even running the clean command requires the virtualenv to be set up.
>>>
>>> A possible fix could be to skip :beam-sdks-python:cleanPython if 
>>> setupVirtualenv has not been run. (perhaps by checking for the existence of 
>>> its output directory)
>>>
>>> On Wed, Jan 16, 2019 at 7:03 PM Kenneth Knowles  wrote:

 Filed https://issues.apache.org/jira/browse/BEAM-6459 to record the 
 conclusion. Doesn't require Beam knowledge so I labeled "starter".

 Kenn

 On Wed, Jan 16, 2019 at 12:14 AM Michael Luckey  
 wrote:
>
> This seems to be on purpose [1]
>
> AFAIU setup is done to be able to call into setup.py clean. We probably 
> should work around that.
>
> [1] 
> https://github.com/apache/beam/blob/master/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L1600-L1610
>
> On Wed, Jan 16, 2019 at 7:01 AM Manu Zhang  
> wrote:
>>
>> I have the same question. Sometimes even `./gradlew clean` fails due to 
>> failure of `setupVirtualEnv` tasks.
>>
>> Manu Zhang
>> On Jan 16, 2019, 12:22 PM +0800, Kenneth Knowles , 
>> wrote:
>>
>> A global `./gradlew clean` runs various `setupVirtualEnv` tasks that 
>> invoke things such as `setup.py bdist_wheel for grpcio-tools`. Overall 
>> it took 4 minutes. Is this intended?
>>
>> Kenn

Re: [DISCUSSION] UTests and embedded backends

2019-01-21 Thread Robert Bradshaw

I am of the same opinion, this is the approach we're taking for Flink
as well. Various mitigations (e.g. capping the parallelism at 2 rather
than the default of num cores) have helped.

Several times the idea has been proposed to group unit tests together
for "expensive" backends. E.g. for self-contained tests one can create
a single pipeline that contains all the tests with their asserts, and
then run that once to amortize the overhead (which is quite
significant when you're only manipulating literally bytes of data).
Only on failure would it exercise them individually (either
sequentially, or via a binary search).

On Mon, Jan 21, 2019 at 10:07 AM Etienne Chauchot  wrote:
>
> Hi guys,
>
> Lately I have been fixing various Elasticsearch flakiness issues in the 
> UTests by: introducing timeouts, countdown latches, force refresh, embedded 
> cluster size decrease ...
>
> These flakiness issues are due to the embedded Elasticsearch not coping well 
> with the jenkins overload. Still, IMHO I believe that having embedded backend 
> for UTests are a lot better than mocks. Even if they are less tolerant to 
> load, I prefer having UTests 100% representative of real backend and add 
> countermeasures to protect against jenkins overload.
>
> WDYT ?
>
> Etienne
>
>

< 4 5 6 7 8 9 10 11 12 13 >

801 - 900 of 1407 matches

Mail list logo