Re: Java 11 compatibility question

2019-10-18 Thread Luke Cwik
There are some changes with Java where the system class loader is no longer
a URL class loader[1].

Also reflection is changing such that non-public fields/methods aren't
accessible which we (or our dependencies) may be doing. Not sure how our
usage of bytecode generation/proxies will need to change.

Finally, for JPMS support to actually be usable, you'll need to update our
deps to JPMS compatible as well.

1: https://issues.apache.org/jira/browse/BEAM-5495

On Fri, Oct 18, 2019 at 6:14 AM Łukasz Gajowy  wrote:

> Hi all,
>
> I want to contribute more actively to this and push Beam as close as
> currently possible towards Java11 both in terms of running and compiling
> the project with it.
>
> I needed a bigger picture so I created a spreadsheet to have a clear
> roadmap for the whole process. It starts with testing existing java 8
> artifacts (part of this is already done) and continues with providing
> compile support and later JPMS support for the project. I figured that
> before I storm JIRA with some new subtasks of BEAM-2530 it's good to have
> something like this thought through. I hope this is also helpful for others
> if they want to help to migrate the project to Java 11. Here's the
> spreadsheet:
>
> https://s.apache.org/java11-support-roadmap
>
> Any comments highly appreciated. :)
>
> FWIW, grpc devs ’’will be looking into options" for resolving the
> above-mentioned gprc issue "this quarter":
> https://github.com/grpc/grpc-java/issues/3522
>
> Thanks!
> Łukasz
>
> śr., 21 sie 2019 o 20:46 Kenneth Knowles  napisał(a):
>
>>
>>
>> On Tue, Aug 20, 2019 at 8:37 AM Elliotte Rusty Harold 
>> wrote:
>>
>>>
>>>
>>> On Tue, Aug 20, 2019 at 7:51 AM Ismaël Mejía  wrote:
>>>
 a per case approach (the exception could be portable runners not based
 on Java).

 Of course other definitions of being Java 11 compatible are interesting
 but probably not part of our current scope. Actions like change the
 codebase to use Java 11 specific APIs / idioms, publish Java 11 specific
 artifacts or use Java Platform Modules (JPM). All of these may be nice to
 have but are probably less important for end users who may just want to be
 able to use Beam in its current form in Java 11 VMs.

 What do others think? Is this enough to announce Java 11 compatibility
 and add the documentation to the webpage?

>>>
>>> No, it isn't, I fear. We don't have to use JPMS in Beam, but Beam really
>>> does need to be compatible with JPMS-using apps. The bare minimum here is
>>> avoiding split packages, and that needs to include all transitive
>>> dependencies, not just Beam itself. I don't think we meet that bar now.
>>>
>>
>> We definitely don't meet the basic bar ourselves, unless someone has done
>> a lot of clean up. We've had classes shuffled from jar to jar quite a lot
>> without changing their namespace appropriately. It may be mostly limited to
>> runner-facing pieces, but I expect for a number of runners (notably the
>> Direct Runner) that is enough to bite users.
>>
>> Kenn
>>
>>
>>>
>>> --
>>> Elliotte Rusty Harold
>>> elh...@ibiblio.org
>>>
>>


Re: Python SDK timestamp precision

2019-10-18 Thread Luke Cwik
Robert it seems like your for Plan A. Assuming we go forward with
nanosecond and based upon your analysis in 3), wouldn't that mean we would
have to make a breaking change to the Java SDK to swap to nanosecond
precision?


On Fri, Oct 18, 2019 at 11:35 AM Robert Bradshaw 
wrote:

> TL;DR: We should just settle on nanosecond precision ubiquitously for
> timestamp/windowing in Beam.
>
>
> Re-visiting this discussion in light of cross-language transforms and
> runners, and trying to tighten up testing. I've spent some more time
> thinking about how we could make these operations granularity-agnostic, but
> just can't find a good solution. In particular, the sticklers seem to be:
>
> (1) Windows are half-open intervals, and the timestamp associated with a
> window coming out of a GBK is (by default) as large as possible but must
> live in that window. (Otherwise WindowInto + GBK + WindowInto would have
> the unforunate effect of moving aggregate values into subsequent windows,
> which is clearly not the intent.) In other words, the timestamp of a
> grouped value is basically End(Window) - epsilon. Unless we choose a
> representation able to encode "minus epsilon" we must agree on a
> granularity.
>
> (2) Unless we want to have multiple vairants of all our WindowFns (e.g.
> FixedWindowMillis, FixedWindowMicros, FixedWindowNanos) we must agree on a
> granularity with which to parameterize these well-known operations. There
> are cases (e.g. side input window mapping, merging) where these Fns may be
> used downstream in contexts other than where they are applied/defined.
>
> (3) Reification of the timestamp into user-visible data, and the other way
> around, require a choice of precision to expose to the user. This means
> that the timestamp is actual data, and truncating/rounding cannot be done
> implicitly. Also round trip of reification and application of timestamps
> should hopefully be idempotent no matter the SDK.
>
> The closest I've come is possibly parameterizing the timestamp type, where
> encoding, decoding (including pulling the end out of a window?), comparison
> (against each other and a watermark), "minus epsilon", etc could be UDFs.
> Possibly we'd need the full set of arithmetic operations to implement
> FixedWindows on an unknown timestamp type. Reification would simply be
> dis-allowed (or return an opaque rather than SDK-native) type if the SDK
> did not know that window type. The fact that one might need comparison
> between timestamps of different types, or (lossless) coercion from one type
> to another, means that timestamp types need to know about each other, or
> another entity needs to know about the full cross-product, unless there is
> a common base-type (at which point we might as well always choose that).
>
> An intermediate solution is to settle on floating (decimal) point
> representation, plus a "minus-epsiloin" bit. It wouldn't quite solve the
> mapping through SDK-native types (which could require rounding or errors or
> a new opaque type, and few date librarys could faithfully expose the minus
> epsilon part). It might also be more expensive (compute and storage), and
> would not allow us to use the protofuf timestamp/duration fields (or any
> standard date/time libraries).
>
> Unless we can come up with a clean solution to the issues above shortly, I
> think we should fix a precision and move forward. If this makes sense to
> everyone, then we can start talking about the specific choice of precision
> and a migration path (possibly only for portability).
>
>
> For reference, the manipulations we do on timestamps are:
>
> WindowInto: Timestamp -> Window
> TimestampCombine: Window, [Timestamp] -> Timestamp
> End(Window)
> Min(Timestamps)
> Max(Timestamps)
> PastEndOfWindow: Watermark, Window -> {True, False}
>
> [SideInput]WindowMappingFn: Window -> Window
> WindowInto(End(Window))
>
> GetTimestamp: Timestamp -> SDK Native Object
> EmitAtTimestamp: SDK Native Object -> Timestamp
>
>
>
>
>
>
> On Fri, May 10, 2019 at 1:33 PM Robert Bradshaw 
> wrote:
>
>> On Thu, May 9, 2019 at 9:32 AM PM Kenneth Knowles 
>> wrote:
>>
>> > From: Robert Bradshaw 
>> > Date: Wed, May 8, 2019 at 3:00 PM
>> > To: dev
>> >
>> >> From: Kenneth Knowles 
>> >> Date: Wed, May 8, 2019 at 6:50 PM
>> >> To: dev
>> >>
>> >> >> The end-of-window, for firing, can be approximate, but it seems it
>> >> >> should be exact for timestamp assignment of the result (and
>> similarly
>> >> >> with the other timestamp combiners).
>> >> >
>> >> > I was thinking that the window itself should be stored as exact
>> data, while just the firing itself is approximated, since it already is,
>> because of watermarks and timers.
>> >>
>> >> I think this works where we can compare encoded windows, but some
>> >> portable interpretation of windows is required for runner-side
>> >> implementation of merging windows (for example).
>> >
>> > But in this case, you've recognized the URN of the WindowFn anyhow, so
>> you 

Re: Python SDK timestamp precision

2019-10-18 Thread Robert Bradshaw
TL;DR: We should just settle on nanosecond precision ubiquitously for
timestamp/windowing in Beam.


Re-visiting this discussion in light of cross-language transforms and
runners, and trying to tighten up testing. I've spent some more time
thinking about how we could make these operations granularity-agnostic, but
just can't find a good solution. In particular, the sticklers seem to be:

(1) Windows are half-open intervals, and the timestamp associated with a
window coming out of a GBK is (by default) as large as possible but must
live in that window. (Otherwise WindowInto + GBK + WindowInto would have
the unforunate effect of moving aggregate values into subsequent windows,
which is clearly not the intent.) In other words, the timestamp of a
grouped value is basically End(Window) - epsilon. Unless we choose a
representation able to encode "minus epsilon" we must agree on a
granularity.

(2) Unless we want to have multiple vairants of all our WindowFns (e.g.
FixedWindowMillis, FixedWindowMicros, FixedWindowNanos) we must agree on a
granularity with which to parameterize these well-known operations. There
are cases (e.g. side input window mapping, merging) where these Fns may be
used downstream in contexts other than where they are applied/defined.

(3) Reification of the timestamp into user-visible data, and the other way
around, require a choice of precision to expose to the user. This means
that the timestamp is actual data, and truncating/rounding cannot be done
implicitly. Also round trip of reification and application of timestamps
should hopefully be idempotent no matter the SDK.

The closest I've come is possibly parameterizing the timestamp type, where
encoding, decoding (including pulling the end out of a window?), comparison
(against each other and a watermark), "minus epsilon", etc could be UDFs.
Possibly we'd need the full set of arithmetic operations to implement
FixedWindows on an unknown timestamp type. Reification would simply be
dis-allowed (or return an opaque rather than SDK-native) type if the SDK
did not know that window type. The fact that one might need comparison
between timestamps of different types, or (lossless) coercion from one type
to another, means that timestamp types need to know about each other, or
another entity needs to know about the full cross-product, unless there is
a common base-type (at which point we might as well always choose that).

An intermediate solution is to settle on floating (decimal) point
representation, plus a "minus-epsiloin" bit. It wouldn't quite solve the
mapping through SDK-native types (which could require rounding or errors or
a new opaque type, and few date librarys could faithfully expose the minus
epsilon part). It might also be more expensive (compute and storage), and
would not allow us to use the protofuf timestamp/duration fields (or any
standard date/time libraries).

Unless we can come up with a clean solution to the issues above shortly, I
think we should fix a precision and move forward. If this makes sense to
everyone, then we can start talking about the specific choice of precision
and a migration path (possibly only for portability).


For reference, the manipulations we do on timestamps are:

WindowInto: Timestamp -> Window
TimestampCombine: Window, [Timestamp] -> Timestamp
End(Window)
Min(Timestamps)
Max(Timestamps)
PastEndOfWindow: Watermark, Window -> {True, False}

[SideInput]WindowMappingFn: Window -> Window
WindowInto(End(Window))

GetTimestamp: Timestamp -> SDK Native Object
EmitAtTimestamp: SDK Native Object -> Timestamp






On Fri, May 10, 2019 at 1:33 PM Robert Bradshaw  wrote:

> On Thu, May 9, 2019 at 9:32 AM PM Kenneth Knowles  wrote:
>
> > From: Robert Bradshaw 
> > Date: Wed, May 8, 2019 at 3:00 PM
> > To: dev
> >
> >> From: Kenneth Knowles 
> >> Date: Wed, May 8, 2019 at 6:50 PM
> >> To: dev
> >>
> >> >> The end-of-window, for firing, can be approximate, but it seems it
> >> >> should be exact for timestamp assignment of the result (and similarly
> >> >> with the other timestamp combiners).
> >> >
> >> > I was thinking that the window itself should be stored as exact data,
> while just the firing itself is approximated, since it already is, because
> of watermarks and timers.
> >>
> >> I think this works where we can compare encoded windows, but some
> >> portable interpretation of windows is required for runner-side
> >> implementation of merging windows (for example).
> >
> > But in this case, you've recognized the URN of the WindowFn anyhow, so
> you understand its windows. Remembering that IntervalWindow is just one
> choice, and that windows themselves are totally user-defined and that
> merging logic is completely arbitrary per WindowFn (we probably should have
> some restrictions, but see https://issues.apache.org/jira/browse/BEAM-654).
> So I file this use case in the "runner knows everything about the WindowFn
> and Window type and window encoding anyhow".
>
> Being able to merge 

Re: contributor permission for Beam Jira tickets

2019-10-18 Thread Kenneth Knowles
Welcome & thanks!

On Fri, Oct 18, 2019 at 5:55 AM Ismaël Mejía  wrote:

> Hello, you were added as a contributor. Please create and assign the
> ticket. Welcome!
>
> On Fri, Oct 18, 2019 at 7:05 AM Noah Goodrich  wrote:
> >
> > Hi,
> >
> > My name is Noah Goodrich (username ngoodrich). I've recently started
> using the Beam Python SDK. I've found a first issue with the
> BigQueryFileLoads Transform and the data type of the schema parameter (see
> https://the-asf.slack.com/archives/CBDNLQZM1/p1570810453025500). I would
> like to log a bug ticket and then submit a PR with a proposed fix. Can
> someone add me as a contributor for Beam's Jira issue tracker? I would like
> to create/assign tickets for my work.
>


Re: Timeouting jobs do not notify builds@

2019-10-18 Thread Pablo Estrada
Cool. Thanks Lukasz!
-P.

On Fri, Oct 18, 2019 at 3:00 AM Łukasz Gajowy 
wrote:

>
> I agree that jobs that timeout should be notified to builds@. Is there
>> any other source of data (e.g. dashboards) that gets tagged with that info?
>>
>
> AFAIK no - builds@ is the only place that gets notified about failures
> (and soon about timoeuts/aborts once the pr gets merged).
>
>
>> I'm just thinking that we probably want to be able to separate failed
>> jobs from jobs that timeout.
>>
>
> We'll be able to distinguish that by the email title/content (failed vs
> aborted state in the email subject). We can also customize subject and
> content for aborted job emails if needed - we can specify this using
> jobDSL
> 
>  extendedEmail
> configuration.
>
>
>
>


Re: beam.io.BigQuerySource does not accept value providers

2019-10-18 Thread Pablo Estrada
Hi Theodore!
Display data is what's throwing the error, but the BigQuerySource does not
support value providers even despite that issue because it's a Dataflow
native source. Unfortunately, this is not currently possible.
Currently, you could do this executing a BQ export job (using a DoFn), and
use fileio to consume those newly exported files. We may prioritize
building a source for that, but it is not there ATM.
Best
-P.

On Fri, Oct 18, 2019 at 6:42 AM Theodore Siu  wrote:

> Additionally, for reference
> https://stackoverflow.com/questions/46595149/dynamic-bigquery-query-in-dataflow-template
>
> On Fri, Oct 18, 2019 at 9:34 AM Theodore Siu  wrote:
>
>> Hi,
>>
>> We are attempting to build a Dataflow template in Beam Python and are
>> running into issues with using a value provider specifically
>> with beam.io.BigQuerySource which throws the following error.ValueError:
>> Invalid DisplayDataItem. Value RuntimeValueProvider(option: input, type:
>> str, default_value: 'test') is of an unsupported type.
>>
>> Tracing the code in Github it seems like the main culprits are the
>> following:
>>
>>
>> https://github.com/apache/beam/blob/d9add564c1c34065829f712074bdd3856b2b0982/sdks/python/apache_beam/io/gcp/bigquery.py#L470
>>
>>
>> https://github.com/apache/beam/blob/d9add564c1c34065829f712074bdd3856b2b0982/sdks/python/apache_beam/transforms/display.py#L244
>>
>>
>> Do we have any idea when a fix can be made?
>>
>> -Theo
>>
>>
>>
>>


Flink fails validatesRunner on master

2019-10-18 Thread Jan Lukavský

Hi,

I'm experiencing errors in validates runner of flink (1.8) on current 
master. The failure is


java.lang.AssertionError: processing bundle should have been called before 
finish bundle
Expected: is 
 but: was 
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
at 
org.apache.beam.sdk.transforms.ParDoLifecycleTest$ExceptionThrowingFn.postBundle(ParDoLifecycleTest.java:379)


Does anyone else see this? Seems to affect PRs as well.

Jan



Re: beam.io.BigQuerySource does not accept value providers

2019-10-18 Thread Theodore Siu
Additionally, for reference
https://stackoverflow.com/questions/46595149/dynamic-bigquery-query-in-dataflow-template

On Fri, Oct 18, 2019 at 9:34 AM Theodore Siu  wrote:

> Hi,
>
> We are attempting to build a Dataflow template in Beam Python and are
> running into issues with using a value provider specifically
> with beam.io.BigQuerySource which throws the following error.ValueError:
> Invalid DisplayDataItem. Value RuntimeValueProvider(option: input, type:
> str, default_value: 'test') is of an unsupported type.
>
> Tracing the code in Github it seems like the main culprits are the
> following:
>
>
> https://github.com/apache/beam/blob/d9add564c1c34065829f712074bdd3856b2b0982/sdks/python/apache_beam/io/gcp/bigquery.py#L470
>
>
> https://github.com/apache/beam/blob/d9add564c1c34065829f712074bdd3856b2b0982/sdks/python/apache_beam/transforms/display.py#L244
>
>
> Do we have any idea when a fix can be made?
>
> -Theo
>
>
>
>


beam.io.BigQuerySource does not accept value providers

2019-10-18 Thread Theodore Siu
Hi,

We are attempting to build a Dataflow template in Beam Python and are
running into issues with using a value provider specifically
with beam.io.BigQuerySource which throws the following error.ValueError:
Invalid DisplayDataItem. Value RuntimeValueProvider(option: input, type:
str, default_value: 'test') is of an unsupported type.

Tracing the code in Github it seems like the main culprits are the
following:

https://github.com/apache/beam/blob/d9add564c1c34065829f712074bdd3856b2b0982/sdks/python/apache_beam/io/gcp/bigquery.py#L470

https://github.com/apache/beam/blob/d9add564c1c34065829f712074bdd3856b2b0982/sdks/python/apache_beam/transforms/display.py#L244


Do we have any idea when a fix can be made?

-Theo


Re: contributor permission for Beam Jira tickets

2019-10-18 Thread Ismaël Mejía
Hello, you were added as a contributor. Please create and assign the
ticket. Welcome!

On Fri, Oct 18, 2019 at 7:05 AM Noah Goodrich  wrote:
>
> Hi,
>
> My name is Noah Goodrich (username ngoodrich). I've recently started using 
> the Beam Python SDK. I've found a first issue with the BigQueryFileLoads 
> Transform and the data type of the schema parameter (see 
> https://the-asf.slack.com/archives/CBDNLQZM1/p1570810453025500). I would like 
> to log a bug ticket and then submit a PR with a proposed fix. Can someone add 
> me as a contributor for Beam's Jira issue tracker? I would like to 
> create/assign tickets for my work.


Re: Timeouting jobs do not notify builds@

2019-10-18 Thread Łukasz Gajowy
> I agree that jobs that timeout should be notified to builds@. Is there
> any other source of data (e.g. dashboards) that gets tagged with that info?
>

AFAIK no - builds@ is the only place that gets notified about failures (and
soon about timoeuts/aborts once the pr gets merged).


> I'm just thinking that we probably want to be able to separate failed jobs
> from jobs that timeout.
>

We'll be able to distinguish that by the email title/content (failed vs
aborted state in the email subject). We can also customize subject and
content for aborted job emails if needed - we can specify this using jobDSL

extendedEmail
configuration.