Graal instead of docker?

2018-05-03 Thread Romain Manni-Bucau
Hi guys

Since some time there are efforts to have a language portable support in
beam but I cant really find a case it "works" being based on docker except
for some vendor specific infra.

Current solution:

1. Is runner intrusive (which is bad for beam and prevents adoption of big
data vendors)
2. Based on docker (which assumed a runtime environment and is very
ops/infra intrusive and likely too $$ quite often for what it brings)

Did anyone had a look to graal which seems a way to make the feature doable
in a lighter manner and optimized compared to default jsr223 impls?


Re: How to create a runtime ValueProvider

2018-05-03 Thread Frank Yellin
Unfortunately, I can't use a side input, because I need to use the result
as an input to a DatastoreV1.Read object.  This is a PTransform rather than
DoFn, so there is no place for a side input.

But you did give me the hint I needed to get something that is close
enough.  It's a bit dirty, but it's what I'll have to use until something
better comes along.

 public interface MyOptions extends PipelineOptions,
DataflowPipelineOptions {
@Description("ignore")
@Default.Integer(0)
ValueProvider getIgnored();
void setIgnored(ValueProvider value);
  }

  ValueProvider now =
NestedValueProvider.of(options.getIgnored(), ignored ->
DateTime.now(DateTimeZone.UTC));

The fact that options.getIgnored() is not static means that the value of
now can't be either.  Each time I call now.get(), the value will be
re-evaluated.  This isn't quite what I wanted, (I was hoping for a single
evaluation), but it will work well enough for my situation.



On Thu, May 3, 2018 at 6:01 PM, Eugene Kirpichov 
wrote:

> There is no way to achieve this using ValueProvider. Its value is either
> fixed at construction time (StaticValueProvider), or completely dynamic
> (evaluated every time you call .get()).
> You'll need to implement this using a side input. E.g. take a look at
> implementation of BigQueryIO, how it generates unique job id tokens -
> https://github.com/apache/beam/blob/bf94e36f67a8bc5d24c795e40697ad
> 2504c8594c/sdks/java/io/google-cloud-platform/src/
> main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L756
>
> On Thu, May 3, 2018 at 5:42 PM Frank Yellin  wrote:
>
>> [Sorry, I accidentally hit send before I had finished typing . .]
>>
>> Is there any way to achieve what I'm looking for?  Or is this just beyond
>> the scope of ValueProvider and templates?
>>
>>
>>
>> On Thu, May 3, 2018 at 5:36 PM, Frank Yellin  wrote:
>>
>>> I'm attempting to create a dataflow template, and within the template
>>> have a variable
>>> ValueProvider now
>>> such that now is the time the dataflow is started, note the time that
>>> the template was created.
>>>
>>> My first attempt was
>>> ValueProvider now = StaticValueProvider.of(
>>> DateTime.now(DateTimeZone.UTC));
>>>
>>> My second attempt was
>>>
>>>   public interface MyOptions extends PipelineOptions,
>>> DataflowPipelineOptions {
>>> @Description("Now")
>>> @Default.InstanceFactory(GetNow.class)
>>> ValueProvider getNow();
>>> void setNow(ValueProvider value);
>>>   }
>>>
>>>   static class GetNow implements DefaultValueFactory {
>>> @Override
>>> public DateTime create(PipelineOptions options) {
>>>   return DateTime.now(DateTimeZone.UTC);
>>> }
>>>   }
>>>
>>>   ValueProvider now = options.getNow()
>>>
>>> My final attempt was:
>>>
>>>ValueProvider> nowFn =
>>> StaticValueProvider.of(x -> DateTime.now(DateTimeZone.UTC));
>>>
>>> ValueProvider now = NestedValueProvider.of(nowFn, x ->
>>> x.apply(null));
>>>
>>>
>>>
>>> In every case, it was clear that "now" was being set to
>>> template-creation time rather than actual runtime.
>>>
>>> I note that the documentation talks about a RuntimeValueProvider, but
>>> there is no user-visible constructor for this.
>>>
>>>
>>>
>>>
>>


Re: How to create a runtime ValueProvider

2018-05-03 Thread Eugene Kirpichov
There is no way to achieve this using ValueProvider. Its value is either
fixed at construction time (StaticValueProvider), or completely dynamic
(evaluated every time you call .get()).
You'll need to implement this using a side input. E.g. take a look at
implementation of BigQueryIO, how it generates unique job id tokens -
https://github.com/apache/beam/blob/bf94e36f67a8bc5d24c795e40697ad2504c8594c/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L756


On Thu, May 3, 2018 at 5:42 PM Frank Yellin  wrote:

> [Sorry, I accidentally hit send before I had finished typing . .]
>
> Is there any way to achieve what I'm looking for?  Or is this just beyond
> the scope of ValueProvider and templates?
>
>
>
> On Thu, May 3, 2018 at 5:36 PM, Frank Yellin  wrote:
>
>> I'm attempting to create a dataflow template, and within the template
>> have a variable
>> ValueProvider now
>> such that now is the time the dataflow is started, note the time that the
>> template was created.
>>
>> My first attempt was
>> ValueProvider now =
>> StaticValueProvider.of(DateTime.now(DateTimeZone.UTC));
>>
>> My second attempt was
>>
>>   public interface MyOptions extends PipelineOptions,
>> DataflowPipelineOptions {
>> @Description("Now")
>> @Default.InstanceFactory(GetNow.class)
>> ValueProvider getNow();
>> void setNow(ValueProvider value);
>>   }
>>
>>   static class GetNow implements DefaultValueFactory {
>> @Override
>> public DateTime create(PipelineOptions options) {
>>   return DateTime.now(DateTimeZone.UTC);
>> }
>>   }
>>
>>   ValueProvider now = options.getNow()
>>
>> My final attempt was:
>>
>>ValueProvider> nowFn =
>> StaticValueProvider.of(x -> DateTime.now(DateTimeZone.UTC));
>>
>> ValueProvider now = NestedValueProvider.of(nowFn, x ->
>> x.apply(null));
>>
>>
>>
>> In every case, it was clear that "now" was being set to template-creation
>> time rather than actual runtime.
>>
>> I note that the documentation talks about a RuntimeValueProvider, but
>> there is no user-visible constructor for this.
>>
>>
>>
>>
>


Re: Pubsub to Beam SQL

2018-05-03 Thread Ankur Goenka
I like the idea of exposing source timestamp in TBLPROPERTIES which is
closely tied to source (KafkaIO, KinesisIO, MqttIO, AmqpIO, unbounded
FileIO, PubSubIO).
Exposing timestamp as a top level keyword will break the symmetry between
streaming and batch pipelines.
TBLPROPERTIES gives us flexibility on defining timestamp in source specific
way if needed.


On Thu, May 3, 2018 at 4:08 PM Kenneth Knowles  wrote:

> It is an interesting question for Beam DDL - since timestamps are
> fundamental to Beam's data model, should we have a DDL extension that makes
> it very explicit? Seems nice, but perhaps TBLPROPERTIES is a way to stage
> the work, getting the functionality in place first and the parsing second.
>
> What would the TIMESTAMP (let's maybe choose a term that isn't already
> reserved) metadata thing look like for e.g. KafkaIO, KinesisIO, MqttIO,
> AmqpIO, unbounded FileIO? I think a lot of these don't actually have any
> configurability so maybe it is moot. Does Calcite already have an opinion
> about timestamps on rows?
>
> Kenn
>
> On Thu, May 3, 2018 at 1:02 PM Andrew Pilloud  wrote:
>
>> I like to avoid magic too. I might not have been entirely clear in what I
>> was asking. Here is an example of what I had in mind, replacing the 
>> TBLPROPERTIES
>> with a more generic TIMESTAMP option:
>>
>> CREATE TABLE  table_name (
>>   publishTimestamp TIMESTAMP,
>>   attributes MAP(VARCHAR, VARCHAR),
>>   payload ROW (
>>name VARCHAR,
>>age INTEGER,
>>isSWE BOOLEAN,
>>tags ARRAY(VARCHAR)))
>> TIMESTAMP attributes["createTime"];
>>
>> Andrew
>>
>> On Thu, May 3, 2018 at 12:47 PM Anton Kedin  wrote:
>>
>>> I think it makes sense for the case when timestamp is provided in the
>>> payload (including pubsub message attributes).  We can mark the field as an
>>> event timestamp. But if the timestamp is internally defined by the source
>>> (pubsub message publish time) and not exposed in the event body, then we
>>> need a source-specific mechanism to extract and map the event timestamp to
>>> the schema. This is, of course, if we don't automatically add a magic
>>> timestamp field which Beam SQL can populate behind the scenes and add to
>>> the schema. I want to avoid this magic path for now.
>>>
>>> On Thu, May 3, 2018 at 11:10 AM Andrew Pilloud 
>>> wrote:
>>>
 This sounds awesome!

 Is event timestamp something that we need to specify for every source?
 If so, I would suggest we add this as a first class option on CREATE TABLE
 rather then something hidden in TBLPROPERTIES.

 Andrew

 On Wed, May 2, 2018 at 10:30 AM Anton Kedin  wrote:

> Hi
>
> I am working on adding functionality to support querying Pubsub
> messages directly from Beam SQL.
>
> *Goal*
>   Provide Beam users a pure  SQL solution to create the pipelines with
> Pubsub as a data source, without the need to set up the pipelines in
> Java before applying the query.
>
> *High level approach*
>
>-
>- Build on top of PubsubIO;
>- Pubsub source will be declared using CREATE TABLE DDL statement:
>   - Beam SQL already supports declaring sources like Kafka and
>   Text using CREATE TABLE DDL;
>   - it supports additional configuration using TBLPROPERTIES
>   clause. Currently it takes a text blob, where we can put a JSON
>   configuration;
>   - wrapping PubsubIO into a similar source looks feasible;
>- The plan is to initially support messages only with JSON payload:
>-
>   - more payload formats can be added later;
>- Messages will be fully described in the CREATE TABLE statements:
>   - event timestamps. Source of the timestamp is configurable. It
>   is required by Beam SQL to have an explicit timestamp column for 
> windowing
>   support;
>   - messages attributes map;
>   - JSON payload schema;
>- Event timestamps will be taken either from publish time or
>user-specified message attribute (configurable);
>
> Thoughts, ideas, comments?
>
> More details are in the doc here:
> https://docs.google.com/document/d/1wIXTxh-nQ3u694XbF0iEZX_7-b3yi4ad0ML2pcAxYfE
>
>
> Thank you,
> Anton
>



Re: How to create a runtime ValueProvider

2018-05-03 Thread Frank Yellin
[Sorry, I accidentally hit send before I had finished typing . .]

Is there any way to achieve what I'm looking for?  Or is this just beyond
the scope of ValueProvider and templates?



On Thu, May 3, 2018 at 5:36 PM, Frank Yellin  wrote:

> I'm attempting to create a dataflow template, and within the template have
> a variable
> ValueProvider now
> such that now is the time the dataflow is started, note the time that the
> template was created.
>
> My first attempt was
> ValueProvider now = StaticValueProvider.of(DateTim
> e.now(DateTimeZone.UTC));
>
> My second attempt was
>
>   public interface MyOptions extends PipelineOptions,
> DataflowPipelineOptions {
> @Description("Now")
> @Default.InstanceFactory(GetNow.class)
> ValueProvider getNow();
> void setNow(ValueProvider value);
>   }
>
>   static class GetNow implements DefaultValueFactory {
> @Override
> public DateTime create(PipelineOptions options) {
>   return DateTime.now(DateTimeZone.UTC);
> }
>   }
>
>   ValueProvider now = options.getNow()
>
> My final attempt was:
>
>ValueProvider> nowFn =
> StaticValueProvider.of(x -> DateTime.now(DateTimeZone.UTC));
>
> ValueProvider now = NestedValueProvider.of(nowFn, x ->
> x.apply(null));
>
>
>
> In every case, it was clear that "now" was being set to template-creation
> time rather than actual runtime.
>
> I note that the documentation talks about a RuntimeValueProvider, but
> there is no user-visible constructor for this.
>
>
>
>


How to create a runtime ValueProvider

2018-05-03 Thread Frank Yellin
I'm attempting to create a dataflow template, and within the template have
a variable
ValueProvider now
such that now is the time the dataflow is started, note the time that the
template was created.

My first attempt was
ValueProvider now = StaticValueProvider.of(
DateTime.now(DateTimeZone.UTC));

My second attempt was

  public interface MyOptions extends PipelineOptions,
DataflowPipelineOptions {
@Description("Now")
@Default.InstanceFactory(GetNow.class)
ValueProvider getNow();
void setNow(ValueProvider value);
  }

  static class GetNow implements DefaultValueFactory {
@Override
public DateTime create(PipelineOptions options) {
  return DateTime.now(DateTimeZone.UTC);
}
  }

  ValueProvider now = options.getNow()

My final attempt was:

   ValueProvider> nowFn =
StaticValueProvider.of(x -> DateTime.now(DateTimeZone.UTC));

ValueProvider now = NestedValueProvider.of(nowFn, x ->
x.apply(null));



In every case, it was clear that "now" was being set to template-creation
time rather than actual runtime.

I note that the documentation talks about a RuntimeValueProvider, but there
is no user-visible constructor for this.


Re: Pubsub to Beam SQL

2018-05-03 Thread Kenneth Knowles
It is an interesting question for Beam DDL - since timestamps are
fundamental to Beam's data model, should we have a DDL extension that makes
it very explicit? Seems nice, but perhaps TBLPROPERTIES is a way to stage
the work, getting the functionality in place first and the parsing second.

What would the TIMESTAMP (let's maybe choose a term that isn't already
reserved) metadata thing look like for e.g. KafkaIO, KinesisIO, MqttIO,
AmqpIO, unbounded FileIO? I think a lot of these don't actually have any
configurability so maybe it is moot. Does Calcite already have an opinion
about timestamps on rows?

Kenn

On Thu, May 3, 2018 at 1:02 PM Andrew Pilloud  wrote:

> I like to avoid magic too. I might not have been entirely clear in what I
> was asking. Here is an example of what I had in mind, replacing the 
> TBLPROPERTIES
> with a more generic TIMESTAMP option:
>
> CREATE TABLE  table_name (
>   publishTimestamp TIMESTAMP,
>   attributes MAP(VARCHAR, VARCHAR),
>   payload ROW (
>name VARCHAR,
>age INTEGER,
>isSWE BOOLEAN,
>tags ARRAY(VARCHAR)))
> TIMESTAMP attributes["createTime"];
>
> Andrew
>
> On Thu, May 3, 2018 at 12:47 PM Anton Kedin  wrote:
>
>> I think it makes sense for the case when timestamp is provided in the
>> payload (including pubsub message attributes).  We can mark the field as an
>> event timestamp. But if the timestamp is internally defined by the source
>> (pubsub message publish time) and not exposed in the event body, then we
>> need a source-specific mechanism to extract and map the event timestamp to
>> the schema. This is, of course, if we don't automatically add a magic
>> timestamp field which Beam SQL can populate behind the scenes and add to
>> the schema. I want to avoid this magic path for now.
>>
>> On Thu, May 3, 2018 at 11:10 AM Andrew Pilloud 
>> wrote:
>>
>>> This sounds awesome!
>>>
>>> Is event timestamp something that we need to specify for every source?
>>> If so, I would suggest we add this as a first class option on CREATE TABLE
>>> rather then something hidden in TBLPROPERTIES.
>>>
>>> Andrew
>>>
>>> On Wed, May 2, 2018 at 10:30 AM Anton Kedin  wrote:
>>>
 Hi

 I am working on adding functionality to support querying Pubsub
 messages directly from Beam SQL.

 *Goal*
   Provide Beam users a pure  SQL solution to create the pipelines with
 Pubsub as a data source, without the need to set up the pipelines in
 Java before applying the query.

 *High level approach*

-
- Build on top of PubsubIO;
- Pubsub source will be declared using CREATE TABLE DDL statement:
   - Beam SQL already supports declaring sources like Kafka and
   Text using CREATE TABLE DDL;
   - it supports additional configuration using TBLPROPERTIES
   clause. Currently it takes a text blob, where we can put a JSON
   configuration;
   - wrapping PubsubIO into a similar source looks feasible;
- The plan is to initially support messages only with JSON payload:
-
   - more payload formats can be added later;
- Messages will be fully described in the CREATE TABLE statements:
   - event timestamps. Source of the timestamp is configurable. It
   is required by Beam SQL to have an explicit timestamp column for 
 windowing
   support;
   - messages attributes map;
   - JSON payload schema;
- Event timestamps will be taken either from publish time or
user-specified message attribute (configurable);

 Thoughts, ideas, comments?

 More details are in the doc here:
 https://docs.google.com/document/d/1wIXTxh-nQ3u694XbF0iEZX_7-b3yi4ad0ML2pcAxYfE


 Thank you,
 Anton

>>>


Re: ValidatesRunner test cleanup

2018-05-03 Thread Scott Wegner
Thanks for the feedback. For methodology, I crudely went through existing
tests and looked at whether they exercise runner behavior or not. When I
wasn't sure, I opted to leave them as-is. And then I leaned on Kenn's
expertise to help categorize further :)

For the current state: here's a run of the Flink ValidatesRunner tests with
my change [1] and without [2]. We reduced the number of tests from 581 to
267 (54% reduction), and runtime from 17m to 6m22s (63% reduction). Note
that Flink runs the tests for Batch and Streaming variants. Dataflow will
have similar delta percentage-wise, but I don't have numbers yet as the
tests are currently not stable.

[1]
https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Gradle/314/
[2]
https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Gradle/315/

On Thu, May 3, 2018 at 2:43 PM Kenneth Knowles  wrote:

> I think actually that the runner should have such an IT, not the core SDK.
>
> On Thu, May 3, 2018 at 11:20 AM Eugene Kirpichov 
> wrote:
>
>> Thanks Kenn! Note though that we should have VR tests for transforms that
>> have a runner specific override, such as TextIO.write() and Create that you
>> mentioned.
>>
>> Agreed that it'd be good to have a more clear packaging separation
>> between the two.
>>
>> On Thu, May 3, 2018, 10:35 AM Kenneth Knowles  wrote:
>>
>>> Since I went over the PR and dropped a lot of random opinions about what
>>> should be VR versus NR, I'll answer too:
>>>
>>> VR - all primitives: ParDo, GroupByKey, Flatten.pCollections
>>> (Flatten.iterables is an unrelated composite), Metrics
>>> VR - critical special composites: Combine
>>> VR - test infrastructure that ensures tests aren't vacuous: PAssert
>>> NR - everything else in the core SDK that needs a runner but is really
>>> only testing the transform, not the runner, notably Create, TextIO,
>>> extended bits of Combine
>>> (nothing) - everything in modules that depend on the core SDK can use
>>> TestPipeline without an annotation; personally I think NR makes sense to
>>> annotate these, but it has no effect
>>>
>>> And it is a good time to mention that it might be very cool for someone
>>> to take on the task of conceiving of a more independent runner validation
>>> suite. This framework is clever, but a bit deceptive as runner tests look
>>> like unit tests of the primitives.
>>>
>>> Kenn
>>>
>>> On Thu, May 3, 2018 at 9:24 AM Eugene Kirpichov 
>>> wrote:
>>>
 Thanks Scott, this is awesome!
 However, we should be careful when choosing what should be
 ValidatesRunner and what should be NeedsRunner.
 Could you briefly describe how you made the call and roughly what are
 the statistics before/after your PR (number of tests in both categories)?

 On Thu, May 3, 2018 at 9:18 AM Jean-Baptiste Onofré 
 wrote:

> Thanks for the update Scott. That's really a great job.
>
> I will ping you on slack about some points as I'm preparing the build
> for the release (and I have some issues ).
>
> Thanks again
> Regards
> JB
> Le 3 mai 2018, à 17:54, Scott Wegner  a écrit:
>>
>> Note: if you don't care about Java runner tests, you can stop reading
>> now.
>>
>> tl;dr: I've made a pass over all @ValidatesRunner tests in pr/5218
>> [1] and converted many to @NeedsRunner in order to reduce post-commit
>> runtime.
>>
>> This is work that was long overdue and finally got my attention due
>> to the Gradle migration. As context, @ValidatesRunner [2] tests 
>> construct a
>> TestPipeline and exercise runner behavior through SDK constructs. The 
>> tests
>> are written runner-agnostic so that they can be run on and validate all
>> supported runners.
>>
>> The framework for these tests is great and writing them is
>> super-easy. But as a result, we have way too many of them-- over 250. 
>> These
>> tests run against all runners, and even when parallelized we see Dataflow
>> post-commit times exceeding 3-5 hours [3].
>>
>> When reading through these tests, we found many of them don't
>> actually exercise runner-specific behavior, and were simply using the
>> TestPipeline framework to validate SDK components. This is a valid 
>> pattern,
>> but tests should be annotated with @NeedsRunner instead. With this
>> annotation, the tests will run on only a single runner, currently
>> DirectRunner.
>>
>> So, PR/5218 looks at all existing @ValidatesRunner tests and
>> conservatively converts tests which don't need to validate all runners 
>> into
>> @NeedsRunner. I've also sharded out some very large test classes into
>> scenario-based sub-classes. This is because Gradle parallelizes tests at
>> the class-level, and we found a couple very large test classes 

Re: ValidatesRunner test cleanup

2018-05-03 Thread Kenneth Knowles
I think actually that the runner should have such an IT, not the core SDK.

On Thu, May 3, 2018 at 11:20 AM Eugene Kirpichov 
wrote:

> Thanks Kenn! Note though that we should have VR tests for transforms that
> have a runner specific override, such as TextIO.write() and Create that you
> mentioned.
>
> Agreed that it'd be good to have a more clear packaging separation between
> the two.
>
> On Thu, May 3, 2018, 10:35 AM Kenneth Knowles  wrote:
>
>> Since I went over the PR and dropped a lot of random opinions about what
>> should be VR versus NR, I'll answer too:
>>
>> VR - all primitives: ParDo, GroupByKey, Flatten.pCollections
>> (Flatten.iterables is an unrelated composite), Metrics
>> VR - critical special composites: Combine
>> VR - test infrastructure that ensures tests aren't vacuous: PAssert
>> NR - everything else in the core SDK that needs a runner but is really
>> only testing the transform, not the runner, notably Create, TextIO,
>> extended bits of Combine
>> (nothing) - everything in modules that depend on the core SDK can use
>> TestPipeline without an annotation; personally I think NR makes sense to
>> annotate these, but it has no effect
>>
>> And it is a good time to mention that it might be very cool for someone
>> to take on the task of conceiving of a more independent runner validation
>> suite. This framework is clever, but a bit deceptive as runner tests look
>> like unit tests of the primitives.
>>
>> Kenn
>>
>> On Thu, May 3, 2018 at 9:24 AM Eugene Kirpichov 
>> wrote:
>>
>>> Thanks Scott, this is awesome!
>>> However, we should be careful when choosing what should be
>>> ValidatesRunner and what should be NeedsRunner.
>>> Could you briefly describe how you made the call and roughly what are
>>> the statistics before/after your PR (number of tests in both categories)?
>>>
>>> On Thu, May 3, 2018 at 9:18 AM Jean-Baptiste Onofré 
>>> wrote:
>>>
 Thanks for the update Scott. That's really a great job.

 I will ping you on slack about some points as I'm preparing the build
 for the release (and I have some issues ).

 Thanks again
 Regards
 JB
 Le 3 mai 2018, à 17:54, Scott Wegner  a écrit:
>
> Note: if you don't care about Java runner tests, you can stop reading
> now.
>
> tl;dr: I've made a pass over all @ValidatesRunner tests in pr/5218 [1]
> and converted many to @NeedsRunner in order to reduce post-commit runtime.
>
> This is work that was long overdue and finally got my attention due to
> the Gradle migration. As context, @ValidatesRunner [2] tests construct a
> TestPipeline and exercise runner behavior through SDK constructs. The 
> tests
> are written runner-agnostic so that they can be run on and validate all
> supported runners.
>
> The framework for these tests is great and writing them is super-easy.
> But as a result, we have way too many of them-- over 250. These tests run
> against all runners, and even when parallelized we see Dataflow 
> post-commit
> times exceeding 3-5 hours [3].
>
> When reading through these tests, we found many of them don't actually
> exercise runner-specific behavior, and were simply using the TestPipeline
> framework to validate SDK components. This is a valid pattern, but tests
> should be annotated with @NeedsRunner instead. With this annotation, the
> tests will run on only a single runner, currently DirectRunner.
>
> So, PR/5218 looks at all existing @ValidatesRunner tests and
> conservatively converts tests which don't need to validate all runners 
> into
> @NeedsRunner. I've also sharded out some very large test classes into
> scenario-based sub-classes. This is because Gradle parallelizes tests at
> the class-level, and we found a couple very large test classes (ParDoTest)
> became stragglers for the entire execution. Hopefully Gradle will soon
> implement dynamic splitting :)
>
> So, the action I'd like to request from others:
> 1) If you are an author of @ValidatesRunner tests, feel free to look
> over the PR and let me know if I missed anything. Kenn Knowles is also
> helping out here.
> 2) If you find yourself writing new @ValidatesRunner tests, please
> consider whether your test is validating runner-provided behavior. If not,
> use @NeedsRunner instead.
>
>
> [1] https://github.com/apache/beam/pull/5218
> [2]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/ValidatesRunner.java
>
> [3]
> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle/buildTimeTrend
>
>



Re: I want to allow a user-specified QuerySplitter for DatastoreIO

2018-05-03 Thread Lukasz Cwik
I also like the idea of doing the splitting when the pipeline is running
and not during pipeline construction. This works a lot better with things
like templates.

Do you know what Maven package contains com.google.rpc classes and what is
the transitive dependency tree of the package?

If those dependencies are already exposed (or not complex) then adding
com.google.rpc to the API surface whitelist will be a non-issue.

On Thu, May 3, 2018 at 8:28 AM Frank Yellin  wrote:

> I actually tried (1), and ran precisely into the size limit that you
> mentioned.  Because of the size of the database, I needed to split it into
> a few hundred shards, and that was more than the request limit.
>
> I was also considering a slightly different alternative to (2), such as
> adding setQueries(), or setSplitterPTransform().  The semantics would be
> identical to that of your ReadAll, but I'd be able to reuse more of the
> code that is there.  This gave me interesting results, but it wasn't as
> powerful as what I needed.  See (2) below.
>
> The two specific use cases that were motivating me were that I needed to
> write code that could
>   (1) delete a property from all Entitys whose creationTime is between
> one month and two months ago..
>   (2) delete all Entitys whose creationTime is more than two years ago.
> I think these are common-enough operations.  For a very large database, it
> would be nice to be able to open read the small piece of it that is needed
> for your operation.
>
> The first is easy to handle.  I know the start and end of creationTime,
> and I can shard it myself.  The second requires me to consult the datastore
> to find out what the smallest creationTime is in the datastore, and then
> use it as a[n] (advisory  not hard,) lower limit; the query splitter should
> work well whether the oldest records were four years old or barely more
> than two years old.   For this to be possible, I need access to the
> Datastore object, and this Datastore object needs to be passed as some sort
> of user callback.  The QuerySplitter hook already existed and seemed to fit
> my needs perfectly.
>
> Is there a better alternative that still gives me access to the Datastore?
>
>
>
>
>
>
>
> On Thu, May 3, 2018 at 2:52 AM, Chamikara Jayalath 
> wrote:
>
>> Thanks. IMHO it might be better to perform this splitting as a part of
>> your pipeline instead of making source splitting customizable. The reason
>> is, it's easy for users to shoot themselves on the foot if we allow
>> specifying a custom splitter. A bug in a custom QuerySplitter can result in
>> a hard to catch data loss or data duplication bug. So I'd rather not make
>> it a part of the user API.
>>
>> I can think of two ways for performing this splitting as a part of your
>> pipeline.
>> (1) Split the query during job construction and create a source per
>> query. This can be followed by a Flatten transform that creates a single
>> PCollection. (Once caveat is, you might run into 10MB request size limit if
>> you create two many splits here. So try reducing the number of splits if
>> you ran into this).
>> (2) Add a ReadAll transform to DatastoreIO. This will allow you to
>> precede the step that performs reading by a ParDo step that splits your
>> query and create a PCollection of queries. You should not run into size
>> limits here since splitting happens in the data plane.
>>
>> Thanks,
>> Cham
>>
>> On Wed, May 2, 2018 at 12:50 PM Frank Yellin  wrote:
>>
>>> TLDR:
>>> Is it okay for me to expose Datastore in apache beam's DatastoreIO, and
>>> thus indirectly expose com.google.rpc.Code?
>>> Is there a better solution?
>>>
>>>
>>> As I explain in Beam 4186
>>> , I would like to be
>>> able to extend DatastoreV1.Read to have a
>>>withQuerySplitter(QuerrySplitter querySplitter)
>>> method, which would use an alternative query splitter.   The standard
>>> one shards by key and is very limited.
>>>
>>> I have already written such a query splitter.  In fact, the query
>>> splitter I've written goes further than specified in the beam, and reads
>>> the minimum or maximum value of the field from the datastore if no minimum
>>> or maximum is specified in the query, and uses that value for the
>>> sharding.   I can write:
>>>SELECT * FROM ledger where type = 'purchase'
>>> and then ask it to shard on the eventTime, and it will shard nicely!  I
>>> am working with the Datastore folks to separately add my new query splitter
>>> as an option in DatastoreHelper.
>>>
>>>
>>> I have already written the code to add withQuerySplitter.
>>>
>>>https://github.com/apache/beam/pull/5246
>>>
>>> However the problem is that I am increasing the "surface API" of
>>> Dataflow.
>>>QuerySplitter exposes Datastore  exposes DatastoreException
>>> exposes com.google.rpc.Code
>>> and com.google.rpc.Code is not (yet) part of the API surface.
>>>
>>> 

Re: Pubsub to Beam SQL

2018-05-03 Thread Andrew Pilloud
I like to avoid magic too. I might not have been entirely clear in what I
was asking. Here is an example of what I had in mind, replacing the
TBLPROPERTIES
with a more generic TIMESTAMP option:

CREATE TABLE  table_name (
  publishTimestamp TIMESTAMP,
  attributes MAP(VARCHAR, VARCHAR),
  payload ROW (
   name VARCHAR,
   age INTEGER,
   isSWE BOOLEAN,
   tags ARRAY(VARCHAR)))
TIMESTAMP attributes["createTime"];

Andrew

On Thu, May 3, 2018 at 12:47 PM Anton Kedin  wrote:

> I think it makes sense for the case when timestamp is provided in the
> payload (including pubsub message attributes).  We can mark the field as an
> event timestamp. But if the timestamp is internally defined by the source
> (pubsub message publish time) and not exposed in the event body, then we
> need a source-specific mechanism to extract and map the event timestamp to
> the schema. This is, of course, if we don't automatically add a magic
> timestamp field which Beam SQL can populate behind the scenes and add to
> the schema. I want to avoid this magic path for now.
>
> On Thu, May 3, 2018 at 11:10 AM Andrew Pilloud 
> wrote:
>
>> This sounds awesome!
>>
>> Is event timestamp something that we need to specify for every source? If
>> so, I would suggest we add this as a first class option on CREATE TABLE
>> rather then something hidden in TBLPROPERTIES.
>>
>> Andrew
>>
>> On Wed, May 2, 2018 at 10:30 AM Anton Kedin  wrote:
>>
>>> Hi
>>>
>>> I am working on adding functionality to support querying Pubsub messages
>>> directly from Beam SQL.
>>>
>>> *Goal*
>>>   Provide Beam users a pure  SQL solution to create the pipelines with
>>> Pubsub as a data source, without the need to set up the pipelines in
>>> Java before applying the query.
>>>
>>> *High level approach*
>>>
>>>-
>>>- Build on top of PubsubIO;
>>>- Pubsub source will be declared using CREATE TABLE DDL statement:
>>>   - Beam SQL already supports declaring sources like Kafka and Text
>>>   using CREATE TABLE DDL;
>>>   - it supports additional configuration using TBLPROPERTIES
>>>   clause. Currently it takes a text blob, where we can put a JSON
>>>   configuration;
>>>   - wrapping PubsubIO into a similar source looks feasible;
>>>- The plan is to initially support messages only with JSON payload:
>>>-
>>>   - more payload formats can be added later;
>>>- Messages will be fully described in the CREATE TABLE statements:
>>>   - event timestamps. Source of the timestamp is configurable. It
>>>   is required by Beam SQL to have an explicit timestamp column for 
>>> windowing
>>>   support;
>>>   - messages attributes map;
>>>   - JSON payload schema;
>>>- Event timestamps will be taken either from publish time or
>>>user-specified message attribute (configurable);
>>>
>>> Thoughts, ideas, comments?
>>>
>>> More details are in the doc here:
>>> https://docs.google.com/document/d/1wIXTxh-nQ3u694XbF0iEZX_7-b3yi4ad0ML2pcAxYfE
>>>
>>>
>>> Thank you,
>>> Anton
>>>
>>


Re: Pubsub to Beam SQL

2018-05-03 Thread Reuven Lax
I believe PubSubIO already exposes the publish timestamp if no timestamp
attribute is set.

On Thu, May 3, 2018 at 12:52 PM Anton Kedin  wrote:

> A SQL-specific wrapper+custom transforms for PubsubIO should suffice. We
> will probably need to a way to expose a message publish timestamp if we
> want to use it as an event timestamp, but that will be consumed by the same
> wrapper/transform without adding anything schema or SQL-specific to
> PubsubIO itself.
>
> On Thu, May 3, 2018 at 11:44 AM Reuven Lax  wrote:
>
>> Are you planning on integrating this directly into PubSubIO, or add a
>> follow-on transform?
>>
>> On Wed, May 2, 2018 at 10:30 AM Anton Kedin  wrote:
>>
>>> Hi
>>>
>>> I am working on adding functionality to support querying Pubsub messages
>>> directly from Beam SQL.
>>>
>>> *Goal*
>>>   Provide Beam users a pure  SQL solution to create the pipelines with
>>> Pubsub as a data source, without the need to set up the pipelines in
>>> Java before applying the query.
>>>
>>> *High level approach*
>>>
>>>-
>>>- Build on top of PubsubIO;
>>>- Pubsub source will be declared using CREATE TABLE DDL statement:
>>>   - Beam SQL already supports declaring sources like Kafka and Text
>>>   using CREATE TABLE DDL;
>>>   - it supports additional configuration using TBLPROPERTIES
>>>   clause. Currently it takes a text blob, where we can put a JSON
>>>   configuration;
>>>   - wrapping PubsubIO into a similar source looks feasible;
>>>- The plan is to initially support messages only with JSON payload:
>>>-
>>>   - more payload formats can be added later;
>>>- Messages will be fully described in the CREATE TABLE statements:
>>>   - event timestamps. Source of the timestamp is configurable. It
>>>   is required by Beam SQL to have an explicit timestamp column for 
>>> windowing
>>>   support;
>>>   - messages attributes map;
>>>   - JSON payload schema;
>>>- Event timestamps will be taken either from publish time or
>>>user-specified message attribute (configurable);
>>>
>>> Thoughts, ideas, comments?
>>>
>>> More details are in the doc here:
>>> https://docs.google.com/document/d/1wIXTxh-nQ3u694XbF0iEZX_7-b3yi4ad0ML2pcAxYfE
>>>
>>>
>>> Thank you,
>>> Anton
>>>
>>


Re: Pubsub to Beam SQL

2018-05-03 Thread Anton Kedin
A SQL-specific wrapper+custom transforms for PubsubIO should suffice. We
will probably need to a way to expose a message publish timestamp if we
want to use it as an event timestamp, but that will be consumed by the same
wrapper/transform without adding anything schema or SQL-specific to
PubsubIO itself.

On Thu, May 3, 2018 at 11:44 AM Reuven Lax  wrote:

> Are you planning on integrating this directly into PubSubIO, or add a
> follow-on transform?
>
> On Wed, May 2, 2018 at 10:30 AM Anton Kedin  wrote:
>
>> Hi
>>
>> I am working on adding functionality to support querying Pubsub messages
>> directly from Beam SQL.
>>
>> *Goal*
>>   Provide Beam users a pure  SQL solution to create the pipelines with
>> Pubsub as a data source, without the need to set up the pipelines in
>> Java before applying the query.
>>
>> *High level approach*
>>
>>-
>>- Build on top of PubsubIO;
>>- Pubsub source will be declared using CREATE TABLE DDL statement:
>>   - Beam SQL already supports declaring sources like Kafka and Text
>>   using CREATE TABLE DDL;
>>   - it supports additional configuration using TBLPROPERTIES clause.
>>   Currently it takes a text blob, where we can put a JSON configuration;
>>   - wrapping PubsubIO into a similar source looks feasible;
>>- The plan is to initially support messages only with JSON payload:
>>-
>>   - more payload formats can be added later;
>>- Messages will be fully described in the CREATE TABLE statements:
>>   - event timestamps. Source of the timestamp is configurable. It is
>>   required by Beam SQL to have an explicit timestamp column for windowing
>>   support;
>>   - messages attributes map;
>>   - JSON payload schema;
>>- Event timestamps will be taken either from publish time or
>>user-specified message attribute (configurable);
>>
>> Thoughts, ideas, comments?
>>
>> More details are in the doc here:
>> https://docs.google.com/document/d/1wIXTxh-nQ3u694XbF0iEZX_7-b3yi4ad0ML2pcAxYfE
>>
>>
>> Thank you,
>> Anton
>>
>


Re: Pubsub to Beam SQL

2018-05-03 Thread Anton Kedin
I think it makes sense for the case when timestamp is provided in the
payload (including pubsub message attributes).  We can mark the field as an
event timestamp. But if the timestamp is internally defined by the source
(pubsub message publish time) and not exposed in the event body, then we
need a source-specific mechanism to extract and map the event timestamp to
the schema. This is, of course, if we don't automatically add a magic
timestamp field which Beam SQL can populate behind the scenes and add to
the schema. I want to avoid this magic path for now.

On Thu, May 3, 2018 at 11:10 AM Andrew Pilloud  wrote:

> This sounds awesome!
>
> Is event timestamp something that we need to specify for every source? If
> so, I would suggest we add this as a first class option on CREATE TABLE
> rather then something hidden in TBLPROPERTIES.
>
> Andrew
>
> On Wed, May 2, 2018 at 10:30 AM Anton Kedin  wrote:
>
>> Hi
>>
>> I am working on adding functionality to support querying Pubsub messages
>> directly from Beam SQL.
>>
>> *Goal*
>>   Provide Beam users a pure  SQL solution to create the pipelines with
>> Pubsub as a data source, without the need to set up the pipelines in
>> Java before applying the query.
>>
>> *High level approach*
>>
>>-
>>- Build on top of PubsubIO;
>>- Pubsub source will be declared using CREATE TABLE DDL statement:
>>   - Beam SQL already supports declaring sources like Kafka and Text
>>   using CREATE TABLE DDL;
>>   - it supports additional configuration using TBLPROPERTIES clause.
>>   Currently it takes a text blob, where we can put a JSON configuration;
>>   - wrapping PubsubIO into a similar source looks feasible;
>>- The plan is to initially support messages only with JSON payload:
>>-
>>   - more payload formats can be added later;
>>- Messages will be fully described in the CREATE TABLE statements:
>>   - event timestamps. Source of the timestamp is configurable. It is
>>   required by Beam SQL to have an explicit timestamp column for windowing
>>   support;
>>   - messages attributes map;
>>   - JSON payload schema;
>>- Event timestamps will be taken either from publish time or
>>user-specified message attribute (configurable);
>>
>> Thoughts, ideas, comments?
>>
>> More details are in the doc here:
>> https://docs.google.com/document/d/1wIXTxh-nQ3u694XbF0iEZX_7-b3yi4ad0ML2pcAxYfE
>>
>>
>> Thank you,
>> Anton
>>
>


Re: Pubsub to Beam SQL

2018-05-03 Thread Reuven Lax
Are you planning on integrating this directly into PubSubIO, or add a
follow-on transform?

On Wed, May 2, 2018 at 10:30 AM Anton Kedin  wrote:

> Hi
>
> I am working on adding functionality to support querying Pubsub messages
> directly from Beam SQL.
>
> *Goal*
>   Provide Beam users a pure  SQL solution to create the pipelines with
> Pubsub as a data source, without the need to set up the pipelines in Java
> before applying the query.
>
> *High level approach*
>
>-
>- Build on top of PubsubIO;
>- Pubsub source will be declared using CREATE TABLE DDL statement:
>   - Beam SQL already supports declaring sources like Kafka and Text
>   using CREATE TABLE DDL;
>   - it supports additional configuration using TBLPROPERTIES clause.
>   Currently it takes a text blob, where we can put a JSON configuration;
>   - wrapping PubsubIO into a similar source looks feasible;
>- The plan is to initially support messages only with JSON payload:
>-
>   - more payload formats can be added later;
>- Messages will be fully described in the CREATE TABLE statements:
>   - event timestamps. Source of the timestamp is configurable. It is
>   required by Beam SQL to have an explicit timestamp column for windowing
>   support;
>   - messages attributes map;
>   - JSON payload schema;
>- Event timestamps will be taken either from publish time or
>user-specified message attribute (configurable);
>
> Thoughts, ideas, comments?
>
> More details are in the doc here:
> https://docs.google.com/document/d/1wIXTxh-nQ3u694XbF0iEZX_7-b3yi4ad0ML2pcAxYfE
>
>
> Thank you,
> Anton
>


Re: Google Summer of Code Project Intro

2018-05-03 Thread Andrew Pilloud
Hi Kai,

Glad to hear someone is putting more work into benchmarking Beam SQL! It
would be really cool if we had some of these running as nightly performance
test jobs so we would know when there is a performance regression. This
might be out of scope of your project, but keep it in mind.

I am working on SQL and ported some of the Nexmark benchmarks there. Feel
free to email me questions. I can also poke Kenn for you whenever he's not
responsive.

Andrew

On Thu, May 3, 2018 at 4:43 AM Kai Jiang  wrote:

> Hi Beam Dev,
>
> I am Kai. GSoC has announced selected projects last week. During community
> bonding period, I want to share some basics about this year's project with
> Apache Beam.
>
> Project abstract:
> https://summerofcode.withgoogle.com/projects/#6460770829729792
> Issue Tracker: BEAM-3783 
>
> This project will be mentored by Kenneth Knowles. Many thanks to Kenn's
> mentorship in next three months. Also, Welcome any ideas and comments from
> you!
>
> The project will mainly focus on implementing a TPC-DS benchmark on Beam
> SQL. We've seen many works have been tested on Spark, Hive and Pig, etc.
> It's interesting to see what happened if it builds onto Beam SQL.
> Presumably, the benchmark will test against on different runners (like,
> spark or flink). Based on the benchmark, a performance report will be
> generated eventually.
>
> Proposal doc is here:(more details will be updated)
>
> https://docs.google.com/document/d/15oYd_jFVbkiSPGT8-XnSh7Q-R3CtZwHaizyQfmrShfo/edit?usp=sharing
>
> Once coding period starts on May 14, I will keep updating the status and
> progress of this project.
>
> Best,
> Kai
> ᐧ
>


Re: ValidatesRunner test cleanup

2018-05-03 Thread Eugene Kirpichov
Thanks Kenn! Note though that we should have VR tests for transforms that
have a runner specific override, such as TextIO.write() and Create that you
mentioned.

Agreed that it'd be good to have a more clear packaging separation between
the two.

On Thu, May 3, 2018, 10:35 AM Kenneth Knowles  wrote:

> Since I went over the PR and dropped a lot of random opinions about what
> should be VR versus NR, I'll answer too:
>
> VR - all primitives: ParDo, GroupByKey, Flatten.pCollections
> (Flatten.iterables is an unrelated composite), Metrics
> VR - critical special composites: Combine
> VR - test infrastructure that ensures tests aren't vacuous: PAssert
> NR - everything else in the core SDK that needs a runner but is really
> only testing the transform, not the runner, notably Create, TextIO,
> extended bits of Combine
> (nothing) - everything in modules that depend on the core SDK can use
> TestPipeline without an annotation; personally I think NR makes sense to
> annotate these, but it has no effect
>
> And it is a good time to mention that it might be very cool for someone to
> take on the task of conceiving of a more independent runner validation
> suite. This framework is clever, but a bit deceptive as runner tests look
> like unit tests of the primitives.
>
> Kenn
>
> On Thu, May 3, 2018 at 9:24 AM Eugene Kirpichov 
> wrote:
>
>> Thanks Scott, this is awesome!
>> However, we should be careful when choosing what should be
>> ValidatesRunner and what should be NeedsRunner.
>> Could you briefly describe how you made the call and roughly what are the
>> statistics before/after your PR (number of tests in both categories)?
>>
>> On Thu, May 3, 2018 at 9:18 AM Jean-Baptiste Onofré 
>> wrote:
>>
>>> Thanks for the update Scott. That's really a great job.
>>>
>>> I will ping you on slack about some points as I'm preparing the build
>>> for the release (and I have some issues ).
>>>
>>> Thanks again
>>> Regards
>>> JB
>>> Le 3 mai 2018, à 17:54, Scott Wegner  a écrit:

 Note: if you don't care about Java runner tests, you can stop reading
 now.

 tl;dr: I've made a pass over all @ValidatesRunner tests in pr/5218 [1]
 and converted many to @NeedsRunner in order to reduce post-commit runtime.

 This is work that was long overdue and finally got my attention due to
 the Gradle migration. As context, @ValidatesRunner [2] tests construct a
 TestPipeline and exercise runner behavior through SDK constructs. The tests
 are written runner-agnostic so that they can be run on and validate all
 supported runners.

 The framework for these tests is great and writing them is super-easy.
 But as a result, we have way too many of them-- over 250. These tests run
 against all runners, and even when parallelized we see Dataflow post-commit
 times exceeding 3-5 hours [3].

 When reading through these tests, we found many of them don't actually
 exercise runner-specific behavior, and were simply using the TestPipeline
 framework to validate SDK components. This is a valid pattern, but tests
 should be annotated with @NeedsRunner instead. With this annotation, the
 tests will run on only a single runner, currently DirectRunner.

 So, PR/5218 looks at all existing @ValidatesRunner tests and
 conservatively converts tests which don't need to validate all runners into
 @NeedsRunner. I've also sharded out some very large test classes into
 scenario-based sub-classes. This is because Gradle parallelizes tests at
 the class-level, and we found a couple very large test classes (ParDoTest)
 became stragglers for the entire execution. Hopefully Gradle will soon
 implement dynamic splitting :)

 So, the action I'd like to request from others:
 1) If you are an author of @ValidatesRunner tests, feel free to look
 over the PR and let me know if I missed anything. Kenn Knowles is also
 helping out here.
 2) If you find yourself writing new @ValidatesRunner tests, please
 consider whether your test is validating runner-provided behavior. If not,
 use @NeedsRunner instead.


 [1] https://github.com/apache/beam/pull/5218
 [2]
 https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/ValidatesRunner.java

 [3]
 https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle/buildTimeTrend


>>>


Re: Pubsub to Beam SQL

2018-05-03 Thread Andrew Pilloud
This sounds awesome!

Is event timestamp something that we need to specify for every source? If
so, I would suggest we add this as a first class option on CREATE TABLE
rather then something hidden in TBLPROPERTIES.

Andrew

On Wed, May 2, 2018 at 10:30 AM Anton Kedin  wrote:

> Hi
>
> I am working on adding functionality to support querying Pubsub messages
> directly from Beam SQL.
>
> *Goal*
>   Provide Beam users a pure  SQL solution to create the pipelines with
> Pubsub as a data source, without the need to set up the pipelines in Java
> before applying the query.
>
> *High level approach*
>
>-
>- Build on top of PubsubIO;
>- Pubsub source will be declared using CREATE TABLE DDL statement:
>   - Beam SQL already supports declaring sources like Kafka and Text
>   using CREATE TABLE DDL;
>   - it supports additional configuration using TBLPROPERTIES clause.
>   Currently it takes a text blob, where we can put a JSON configuration;
>   - wrapping PubsubIO into a similar source looks feasible;
>- The plan is to initially support messages only with JSON payload:
>-
>   - more payload formats can be added later;
>- Messages will be fully described in the CREATE TABLE statements:
>   - event timestamps. Source of the timestamp is configurable. It is
>   required by Beam SQL to have an explicit timestamp column for windowing
>   support;
>   - messages attributes map;
>   - JSON payload schema;
>- Event timestamps will be taken either from publish time or
>user-specified message attribute (configurable);
>
> Thoughts, ideas, comments?
>
> More details are in the doc here:
> https://docs.google.com/document/d/1wIXTxh-nQ3u694XbF0iEZX_7-b3yi4ad0ML2pcAxYfE
>
>
> Thank you,
> Anton
>


Re: [SQL] Reconciling Beam SQL Environments with Calcite Schema

2018-05-03 Thread Andrew Pilloud
Ok, I've finished with this change. Didn't get reviews on the early cleanup
PRs, so I've pushed all these changes into the first cleanup PR:
https://github.com/apache/beam/pull/5224

Andrew

On Tue, May 1, 2018 at 10:35 AM Andrew Pilloud  wrote:

> I'm just starting to move forward on this. Looking at my team's short term
> needs for SQL, option one would be good enough, however I agree with Kenn
> that we want something like option two eventually. I also don't want to
> break existing users and it sounds like there is at least one custom
> MetaStore not in beam. So my plan is to go with option two and simplify the
> interface where functionality loss will not result.
>
> There is a common set of operations between the MetaStore and the
> TableProvider. I'd like to make MetaStore inherit the interface of
> TableProvider. Most operations we need (createTable, dropTable, listTables)
> are already identical between the two, and so this will have no impact on
> custom implementations. The buildBeamSqlTable operation does differ: the
> MetaStore takes a table name, the TableProvider takes a table object.
> However everything calling this API already has the full table object, so I
> would like to simplify this interface by passing the table object in both
> cases. Objections?
>
> Andrew
>
> On Tue, Apr 24, 2018 at 9:27 AM James  wrote:
>
>> Kenn: yes, MetaStore is user-facing, Users can choose to implement their
>> own MetaStore, currently only an InMemory implementation in Beam CodeBase.
>>
>> Andrew: I like the second option, since it "retains the ability for DDL
>> operations to be processed by a custom MetaStore.", IMO we should have the
>> DDL ability as a fully functional SQL.
>>
>> On Tue, Apr 24, 2018 at 10:28 PM Kenneth Knowles  wrote:
>>
>>> Can you say more about how the metastore is used? I presume it is or
>>> will be user-facing, so are Beam SQL users already providing their own?
>>>
>>> I'm sure we want something like that eventually to support things like
>>> Apache Atlas and HCatalog, IIUC for the "create if needed" logic when using
>>> Beam SQL to create a derived data set. But I don't think we should build
>>> out those code paths until we have at least one non-in-memory
>>> implementation.
>>>
>>> Just a really high level $0.02.
>>>
>>> Kenn
>>>
>>> On Mon, Apr 23, 2018 at 4:56 PM Andrew Pilloud 
>>> wrote:
>>>
 I'm working on updating our Beam DDL code to use the DDL execution
 functionality that recently merged into core calcite. This enables us to
 take advantage of Calcite JDBC as a way to use Beam SQL. As part of that I
 need to reconcile the Beam SQL Environments with the Calcite Schema (which
 is calcite's environment). We currently have copies of our tables in the
 Beam meta/store, Calcite Schema, BeamSqlEnv, and BeamQueryPlanner. I have a
 pending PR which merges the later two to just use the Calcite Schema copy.
 Merging the Beam MetaStore and Calcite Schema isn't as simple. I have
 two options I'm looking for feedback on:

 1. Make Calcite Schema authoritative and demote MetaStore to be
 something more like a Calcite TableFactory. Calcite Schema already
 implements the semantics of our InMemoryMetaStore. If the Store interface
 is just over built, this approach would result in a significant reduction
 in code. This would however eliminate the CRUD part of the interface
 leaving just the buildBeamSqlTable function.

 2. Pass the Beam MetaStore into Calcite wrapped with a class
 translating to Calcite Schema (like we do already with tables). Instead of
 copying tables into the Calcite Schema we would pass in Beam meta/store as
 the source of truth and Calcite would manipulate tables directly in the
 Beam meta/store. This is a bit more complicated but retains the ability for
 DDL operations to be processed by a custom MetaStore.

 Thoughts?

 Andrew

>>>


Re: ValidatesRunner test cleanup

2018-05-03 Thread Kenneth Knowles
Since I went over the PR and dropped a lot of random opinions about what
should be VR versus NR, I'll answer too:

VR - all primitives: ParDo, GroupByKey, Flatten.pCollections
(Flatten.iterables is an unrelated composite), Metrics
VR - critical special composites: Combine
VR - test infrastructure that ensures tests aren't vacuous: PAssert
NR - everything else in the core SDK that needs a runner but is really only
testing the transform, not the runner, notably Create, TextIO, extended
bits of Combine
(nothing) - everything in modules that depend on the core SDK can use
TestPipeline without an annotation; personally I think NR makes sense to
annotate these, but it has no effect

And it is a good time to mention that it might be very cool for someone to
take on the task of conceiving of a more independent runner validation
suite. This framework is clever, but a bit deceptive as runner tests look
like unit tests of the primitives.

Kenn

On Thu, May 3, 2018 at 9:24 AM Eugene Kirpichov 
wrote:

> Thanks Scott, this is awesome!
> However, we should be careful when choosing what should be ValidatesRunner
> and what should be NeedsRunner.
> Could you briefly describe how you made the call and roughly what are the
> statistics before/after your PR (number of tests in both categories)?
>
> On Thu, May 3, 2018 at 9:18 AM Jean-Baptiste Onofré 
> wrote:
>
>> Thanks for the update Scott. That's really a great job.
>>
>> I will ping you on slack about some points as I'm preparing the build for
>> the release (and I have some issues ).
>>
>> Thanks again
>> Regards
>> JB
>> Le 3 mai 2018, à 17:54, Scott Wegner  a écrit:
>>>
>>> Note: if you don't care about Java runner tests, you can stop reading
>>> now.
>>>
>>> tl;dr: I've made a pass over all @ValidatesRunner tests in pr/5218 [1]
>>> and converted many to @NeedsRunner in order to reduce post-commit runtime.
>>>
>>> This is work that was long overdue and finally got my attention due to
>>> the Gradle migration. As context, @ValidatesRunner [2] tests construct a
>>> TestPipeline and exercise runner behavior through SDK constructs. The tests
>>> are written runner-agnostic so that they can be run on and validate all
>>> supported runners.
>>>
>>> The framework for these tests is great and writing them is super-easy.
>>> But as a result, we have way too many of them-- over 250. These tests run
>>> against all runners, and even when parallelized we see Dataflow post-commit
>>> times exceeding 3-5 hours [3].
>>>
>>> When reading through these tests, we found many of them don't actually
>>> exercise runner-specific behavior, and were simply using the TestPipeline
>>> framework to validate SDK components. This is a valid pattern, but tests
>>> should be annotated with @NeedsRunner instead. With this annotation, the
>>> tests will run on only a single runner, currently DirectRunner.
>>>
>>> So, PR/5218 looks at all existing @ValidatesRunner tests and
>>> conservatively converts tests which don't need to validate all runners into
>>> @NeedsRunner. I've also sharded out some very large test classes into
>>> scenario-based sub-classes. This is because Gradle parallelizes tests at
>>> the class-level, and we found a couple very large test classes (ParDoTest)
>>> became stragglers for the entire execution. Hopefully Gradle will soon
>>> implement dynamic splitting :)
>>>
>>> So, the action I'd like to request from others:
>>> 1) If you are an author of @ValidatesRunner tests, feel free to look
>>> over the PR and let me know if I missed anything. Kenn Knowles is also
>>> helping out here.
>>> 2) If you find yourself writing new @ValidatesRunner tests, please
>>> consider whether your test is validating runner-provided behavior. If not,
>>> use @NeedsRunner instead.
>>>
>>>
>>> [1] https://github.com/apache/beam/pull/5218
>>> [2]
>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/ValidatesRunner.java
>>>
>>> [3]
>>> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle/buildTimeTrend
>>>
>>>
>>


Re: ValidatesRunner test cleanup

2018-05-03 Thread Robert Burke
I am curious as to how long the suite takes with the changes you've made.
How long does a full Validates Runner suite take with your recategorizing?

On Thu, May 3, 2018, 9:24 AM Eugene Kirpichov  wrote:

> Thanks Scott, this is awesome!
> However, we should be careful when choosing what should be ValidatesRunner
> and what should be NeedsRunner.
> Could you briefly describe how you made the call and roughly what are the
> statistics before/after your PR (number of tests in both categories)?
>
> On Thu, May 3, 2018 at 9:18 AM Jean-Baptiste Onofré 
> wrote:
>
>> Thanks for the update Scott. That's really a great job.
>>
>> I will ping you on slack about some points as I'm preparing the build for
>> the release (and I have some issues ).
>>
>> Thanks again
>> Regards
>> JB
>> Le 3 mai 2018, à 17:54, Scott Wegner  a écrit:
>>>
>>> Note: if you don't care about Java runner tests, you can stop reading
>>> now.
>>>
>>> tl;dr: I've made a pass over all @ValidatesRunner tests in pr/5218 [1]
>>> and converted many to @NeedsRunner in order to reduce post-commit runtime.
>>>
>>> This is work that was long overdue and finally got my attention due to
>>> the Gradle migration. As context, @ValidatesRunner [2] tests construct a
>>> TestPipeline and exercise runner behavior through SDK constructs. The tests
>>> are written runner-agnostic so that they can be run on and validate all
>>> supported runners.
>>>
>>> The framework for these tests is great and writing them is super-easy.
>>> But as a result, we have way too many of them-- over 250. These tests run
>>> against all runners, and even when parallelized we see Dataflow post-commit
>>> times exceeding 3-5 hours [3].
>>>
>>> When reading through these tests, we found many of them don't actually
>>> exercise runner-specific behavior, and were simply using the TestPipeline
>>> framework to validate SDK components. This is a valid pattern, but tests
>>> should be annotated with @NeedsRunner instead. With this annotation, the
>>> tests will run on only a single runner, currently DirectRunner.
>>>
>>> So, PR/5218 looks at all existing @ValidatesRunner tests and
>>> conservatively converts tests which don't need to validate all runners into
>>> @NeedsRunner. I've also sharded out some very large test classes into
>>> scenario-based sub-classes. This is because Gradle parallelizes tests at
>>> the class-level, and we found a couple very large test classes (ParDoTest)
>>> became stragglers for the entire execution. Hopefully Gradle will soon
>>> implement dynamic splitting :)
>>>
>>> So, the action I'd like to request from others:
>>> 1) If you are an author of @ValidatesRunner tests, feel free to look
>>> over the PR and let me know if I missed anything. Kenn Knowles is also
>>> helping out here.
>>> 2) If you find yourself writing new @ValidatesRunner tests, please
>>> consider whether your test is validating runner-provided behavior. If not,
>>> use @NeedsRunner instead.
>>>
>>>
>>> [1] https://github.com/apache/beam/pull/5218
>>> [2]
>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/ValidatesRunner.java
>>>
>>> [3]
>>> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle/buildTimeTrend
>>>
>>>
>>


Re: ValidatesRunner test cleanup

2018-05-03 Thread Eugene Kirpichov
Thanks Scott, this is awesome!
However, we should be careful when choosing what should be ValidatesRunner
and what should be NeedsRunner.
Could you briefly describe how you made the call and roughly what are the
statistics before/after your PR (number of tests in both categories)?

On Thu, May 3, 2018 at 9:18 AM Jean-Baptiste Onofré  wrote:

> Thanks for the update Scott. That's really a great job.
>
> I will ping you on slack about some points as I'm preparing the build for
> the release (and I have some issues ).
>
> Thanks again
> Regards
> JB
> Le 3 mai 2018, à 17:54, Scott Wegner  a écrit:
>>
>> Note: if you don't care about Java runner tests, you can stop reading now.
>>
>> tl;dr: I've made a pass over all @ValidatesRunner tests in pr/5218 [1]
>> and converted many to @NeedsRunner in order to reduce post-commit runtime.
>>
>> This is work that was long overdue and finally got my attention due to
>> the Gradle migration. As context, @ValidatesRunner [2] tests construct a
>> TestPipeline and exercise runner behavior through SDK constructs. The tests
>> are written runner-agnostic so that they can be run on and validate all
>> supported runners.
>>
>> The framework for these tests is great and writing them is super-easy.
>> But as a result, we have way too many of them-- over 250. These tests run
>> against all runners, and even when parallelized we see Dataflow post-commit
>> times exceeding 3-5 hours [3].
>>
>> When reading through these tests, we found many of them don't actually
>> exercise runner-specific behavior, and were simply using the TestPipeline
>> framework to validate SDK components. This is a valid pattern, but tests
>> should be annotated with @NeedsRunner instead. With this annotation, the
>> tests will run on only a single runner, currently DirectRunner.
>>
>> So, PR/5218 looks at all existing @ValidatesRunner tests and
>> conservatively converts tests which don't need to validate all runners into
>> @NeedsRunner. I've also sharded out some very large test classes into
>> scenario-based sub-classes. This is because Gradle parallelizes tests at
>> the class-level, and we found a couple very large test classes (ParDoTest)
>> became stragglers for the entire execution. Hopefully Gradle will soon
>> implement dynamic splitting :)
>>
>> So, the action I'd like to request from others:
>> 1) If you are an author of @ValidatesRunner tests, feel free to look over
>> the PR and let me know if I missed anything. Kenn Knowles is also helping
>> out here.
>> 2) If you find yourself writing new @ValidatesRunner tests, please
>> consider whether your test is validating runner-provided behavior. If not,
>> use @NeedsRunner instead.
>>
>>
>> [1] https://github.com/apache/beam/pull/5218
>> [2]
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/ValidatesRunner.java
>>
>> [3]
>> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle/buildTimeTrend
>>
>>
>


Re: ValidatesRunner test cleanup

2018-05-03 Thread Jean-Baptiste Onofré
Thanks for the update Scott. That's really a great job.

I will ping you on slack about some points as I'm preparing the build for the 
release (and I have some issues ).

Thanks again
Regards
JB

Le 3 mai 2018 à 17:54, à 17:54, Scott Wegner  a écrit:
>Note: if you don't care about Java runner tests, you can stop reading
>now.
>
>tl;dr: I've made a pass over all @ValidatesRunner tests in pr/5218 [1]
>and
>converted many to @NeedsRunner in order to reduce post-commit runtime.
>
>This is work that was long overdue and finally got my attention due to
>the
>Gradle migration. As context, @ValidatesRunner [2] tests construct a
>TestPipeline and exercise runner behavior through SDK constructs. The
>tests
>are written runner-agnostic so that they can be run on and validate all
>supported runners.
>
>The framework for these tests is great and writing them is super-easy.
>But
>as a result, we have way too many of them-- over 250. These tests run
>against all runners, and even when parallelized we see Dataflow
>post-commit
>times exceeding 3-5 hours [3].
>
>When reading through these tests, we found many of them don't actually
>exercise runner-specific behavior, and were simply using the
>TestPipeline
>framework to validate SDK components. This is a valid pattern, but
>tests
>should be annotated with @NeedsRunner instead. With this annotation,
>the
>tests will run on only a single runner, currently DirectRunner.
>
>So, PR/5218 looks at all existing @ValidatesRunner tests and
>conservatively
>converts tests which don't need to validate all runners into
>@NeedsRunner.
>I've also sharded out some very large test classes into scenario-based
>sub-classes. This is because Gradle parallelizes tests at the
>class-level,
>and we found a couple very large test classes (ParDoTest) became
>stragglers
>for the entire execution. Hopefully Gradle will soon implement dynamic
>splitting :)
>
>So, the action I'd like to request from others:
>1) If you are an author of @ValidatesRunner tests, feel free to look
>over
>the PR and let me know if I missed anything. Kenn Knowles is also
>helping
>out here.
>2) If you find yourself writing new @ValidatesRunner tests, please
>consider
>whether your test is validating runner-provided behavior. If not, use
>@NeedsRunner instead.
>
>
>[1] https://github.com/apache/beam/pull/5218
>[2]
>https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/ValidatesRunner.java
>
>[3]
>https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle/buildTimeTrend


Re: ValidatesRunner test cleanup

2018-05-03 Thread Reuven Lax
I suspect that at least some of these are because people copy/pasted other
tests, not realizing the overhead of ValidatesRunner. Is this something we
should document in the contributors guide?

On Thu, May 3, 2018 at 8:54 AM Scott Wegner  wrote:

> Note: if you don't care about Java runner tests, you can stop reading now.
>
> tl;dr: I've made a pass over all @ValidatesRunner tests in pr/5218 [1] and
> converted many to @NeedsRunner in order to reduce post-commit runtime.
>
> This is work that was long overdue and finally got my attention due to the
> Gradle migration. As context, @ValidatesRunner [2] tests construct a
> TestPipeline and exercise runner behavior through SDK constructs. The tests
> are written runner-agnostic so that they can be run on and validate all
> supported runners.
>
> The framework for these tests is great and writing them is super-easy. But
> as a result, we have way too many of them-- over 250. These tests run
> against all runners, and even when parallelized we see Dataflow post-commit
> times exceeding 3-5 hours [3].
>
> When reading through these tests, we found many of them don't actually
> exercise runner-specific behavior, and were simply using the TestPipeline
> framework to validate SDK components. This is a valid pattern, but tests
> should be annotated with @NeedsRunner instead. With this annotation, the
> tests will run on only a single runner, currently DirectRunner.
>
> So, PR/5218 looks at all existing @ValidatesRunner tests and
> conservatively converts tests which don't need to validate all runners into
> @NeedsRunner. I've also sharded out some very large test classes into
> scenario-based sub-classes. This is because Gradle parallelizes tests at
> the class-level, and we found a couple very large test classes (ParDoTest)
> became stragglers for the entire execution. Hopefully Gradle will soon
> implement dynamic splitting :)
>
> So, the action I'd like to request from others:
> 1) If you are an author of @ValidatesRunner tests, feel free to look over
> the PR and let me know if I missed anything. Kenn Knowles is also helping
> out here.
> 2) If you find yourself writing new @ValidatesRunner tests, please
> consider whether your test is validating runner-provided behavior. If not,
> use @NeedsRunner instead.
>
>
> [1] https://github.com/apache/beam/pull/5218
> [2]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/ValidatesRunner.java
>
> [3]
> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle/buildTimeTrend
>
>


ValidatesRunner test cleanup

2018-05-03 Thread Scott Wegner
Note: if you don't care about Java runner tests, you can stop reading now.

tl;dr: I've made a pass over all @ValidatesRunner tests in pr/5218 [1] and
converted many to @NeedsRunner in order to reduce post-commit runtime.

This is work that was long overdue and finally got my attention due to the
Gradle migration. As context, @ValidatesRunner [2] tests construct a
TestPipeline and exercise runner behavior through SDK constructs. The tests
are written runner-agnostic so that they can be run on and validate all
supported runners.

The framework for these tests is great and writing them is super-easy. But
as a result, we have way too many of them-- over 250. These tests run
against all runners, and even when parallelized we see Dataflow post-commit
times exceeding 3-5 hours [3].

When reading through these tests, we found many of them don't actually
exercise runner-specific behavior, and were simply using the TestPipeline
framework to validate SDK components. This is a valid pattern, but tests
should be annotated with @NeedsRunner instead. With this annotation, the
tests will run on only a single runner, currently DirectRunner.

So, PR/5218 looks at all existing @ValidatesRunner tests and conservatively
converts tests which don't need to validate all runners into @NeedsRunner.
I've also sharded out some very large test classes into scenario-based
sub-classes. This is because Gradle parallelizes tests at the class-level,
and we found a couple very large test classes (ParDoTest) became stragglers
for the entire execution. Hopefully Gradle will soon implement dynamic
splitting :)

So, the action I'd like to request from others:
1) If you are an author of @ValidatesRunner tests, feel free to look over
the PR and let me know if I missed anything. Kenn Knowles is also helping
out here.
2) If you find yourself writing new @ValidatesRunner tests, please consider
whether your test is validating runner-provided behavior. If not, use
@NeedsRunner instead.


[1] https://github.com/apache/beam/pull/5218
[2]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/ValidatesRunner.java

[3]
https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle/buildTimeTrend


Re: I want to allow a user-specified QuerySplitter for DatastoreIO

2018-05-03 Thread Frank Yellin
I actually tried (1), and ran precisely into the size limit that you
mentioned.  Because of the size of the database, I needed to split it into
a few hundred shards, and that was more than the request limit.

I was also considering a slightly different alternative to (2), such as
adding setQueries(), or setSplitterPTransform().  The semantics would be
identical to that of your ReadAll, but I'd be able to reuse more of the
code that is there.  This gave me interesting results, but it wasn't as
powerful as what I needed.  See (2) below.

The two specific use cases that were motivating me were that I needed to
write code that could
  (1) delete a property from all Entitys whose creationTime is between
one month and two months ago..
  (2) delete all Entitys whose creationTime is more than two years ago.
I think these are common-enough operations.  For a very large database, it
would be nice to be able to open read the small piece of it that is needed
for your operation.

The first is easy to handle.  I know the start and end of creationTime, and
I can shard it myself.  The second requires me to consult the datastore to
find out what the smallest creationTime is in the datastore, and then use
it as a[n] (advisory  not hard,) lower limit; the query splitter should
work well whether the oldest records were four years old or barely more
than two years old.   For this to be possible, I need access to the
Datastore object, and this Datastore object needs to be passed as some sort
of user callback.  The QuerySplitter hook already existed and seemed to fit
my needs perfectly.

Is there a better alternative that still gives me access to the Datastore?







On Thu, May 3, 2018 at 2:52 AM, Chamikara Jayalath 
wrote:

> Thanks. IMHO it might be better to perform this splitting as a part of
> your pipeline instead of making source splitting customizable. The reason
> is, it's easy for users to shoot themselves on the foot if we allow
> specifying a custom splitter. A bug in a custom QuerySplitter can result in
> a hard to catch data loss or data duplication bug. So I'd rather not make
> it a part of the user API.
>
> I can think of two ways for performing this splitting as a part of your
> pipeline.
> (1) Split the query during job construction and create a source per query.
> This can be followed by a Flatten transform that creates a single
> PCollection. (Once caveat is, you might run into 10MB request size limit if
> you create two many splits here. So try reducing the number of splits if
> you ran into this).
> (2) Add a ReadAll transform to DatastoreIO. This will allow you to precede
> the step that performs reading by a ParDo step that splits your query and
> create a PCollection of queries. You should not run into size limits here
> since splitting happens in the data plane.
>
> Thanks,
> Cham
>
> On Wed, May 2, 2018 at 12:50 PM Frank Yellin  wrote:
>
>> TLDR:
>> Is it okay for me to expose Datastore in apache beam's DatastoreIO, and
>> thus indirectly expose com.google.rpc.Code?
>> Is there a better solution?
>>
>>
>> As I explain in Beam 4186
>> , I would like to be
>> able to extend DatastoreV1.Read to have a
>>withQuerySplitter(QuerrySplitter querySplitter)
>> method, which would use an alternative query splitter.   The standard one
>> shards by key and is very limited.
>>
>> I have already written such a query splitter.  In fact, the query
>> splitter I've written goes further than specified in the beam, and reads
>> the minimum or maximum value of the field from the datastore if no minimum
>> or maximum is specified in the query, and uses that value for the
>> sharding.   I can write:
>>SELECT * FROM ledger where type = 'purchase'
>> and then ask it to shard on the eventTime, and it will shard nicely!  I
>> am working with the Datastore folks to separately add my new query splitter
>> as an option in DatastoreHelper.
>>
>>
>> I have already written the code to add withQuerySplitter.
>>
>>https://github.com/apache/beam/pull/5246
>>
>> However the problem is that I am increasing the "surface API" of
>> Dataflow.
>>QuerySplitter exposes Datastore  exposes DatastoreException
>> exposes com.google.rpc.Code
>> and com.google.rpc.Code is not (yet) part of the API surface.
>>
>> As a solution, I've added package com.google.rpc to the list of classes
>> exposed.  This package contains protobuf enums.  Is this okay?  Is there a
>> better solution?
>>
>> Thanks.
>>
>>


Google Summer of Code Project Intro

2018-05-03 Thread Kai Jiang
Hi Beam Dev,

I am Kai. GSoC has announced selected projects last week. During community
bonding period, I want to share some basics about this year's project with
Apache Beam.

Project abstract:
https://summerofcode.withgoogle.com/projects/#6460770829729792
Issue Tracker: BEAM-3783 

This project will be mentored by Kenneth Knowles. Many thanks to Kenn's
mentorship in next three months. Also, Welcome any ideas and comments from
you!

The project will mainly focus on implementing a TPC-DS benchmark on Beam
SQL. We've seen many works have been tested on Spark, Hive and Pig, etc.
It's interesting to see what happened if it builds onto Beam SQL.
Presumably, the benchmark will test against on different runners (like,
spark or flink). Based on the benchmark, a performance report will be
generated eventually.

Proposal doc is here:(more details will be updated)
https://docs.google.com/document/d/15oYd_jFVbkiSPGT8-XnSh7Q-R3CtZwHaizyQfmrShfo/edit?usp=sharing

Once coding period starts on May 14, I will keep updating the status and
progress of this project.

Best,
Kai
ᐧ


Jenkins build is back to normal : beam_Release_Gradle_NightlySnapshot #27

2018-05-03 Thread Apache Jenkins Server
See