Re: Let's make Beam transforms comply with PTransform Style Guide

2017-04-07 Thread Jean-Baptiste Onofré

Hi Eugene,

thanks for the update. I'm volunteer to tackle some those IOs (and make them 
conform with PTransform style guide). I'm pretty sure other people will jump on ;)


Regards
JB

On 04/08/2017 12:20 AM, Eugene Kirpichov wrote:

Hey all,

More progress has been made and we're nearing completion. ParDo, BigQueryIO
and Window are fixed; Map/FlatMapElements are in review.

The remaining unclaimed ones are all IOs of some form, and here's a list.
I've marked them all as "starter" in JIRA.

XML - https://issues.apache.org/jira/browse/BEAM-1914
TFRecordIO (Tensorflow) - https://issues.apache.org/jira/browse/BEAM-1913
KinesisIO - https://issues.apache.org/jira/browse/BEAM-1428
PubsubIO - https://issues.apache.org/jira/browse/BEAM-1415
CountingInput - https://issues.apache.org/jira/browse/BEAM-1414

https://github.com/apache/beam/pull/2149 , which fixes BigQueryIO, is a
good model to follow when taking these on, as well as e.g.
https://github.com/apache/beam/pull/1927 (TextIO)

These are all actually easy to fix, but need volunteers (I do not have time
to fix all of these myself, but happy to be a reviewer - @jkff).
Let's finish this up in time for the first Beam stable release, so Beam's
stable API surface is consistent and polished!

On Fri, Mar 3, 2017 at 12:00 PM Eugene Kirpichov 
wrote:


It sounds like a great idea in general, though there are actually not that
many issues remaining - I fixed a bunch, and have PRs out for two more
(ParDo and BigQueryIO).
The remaining ones are:

[image: Bug - A problem which impairs or prevents the functions of the
product.] BEAM-1355  HDFS
IO should comply with PTransform style guide

   - [image: Major - Major loss of function.]
   - OPEN

[image: Bug - A problem which impairs or prevents the functions of the
product.] BEAM-1402  Make
TextIO and AvroIO use best-practice types.

   - [image: Major - Major loss of function.]
   - OPEN

[image: Bug - A problem which impairs or prevents the functions of the
product.] BEAM-1414  
CountingInput
should comply with PTransform style guide

   - [image: Major - Major loss of function.]
   - OPEN

[image: Bug - A problem which impairs or prevents the functions of the
product.] BEAM-1415  PubsubIO
should comply with PTransfrom style guide

   - [image: Major - Major loss of function.]
   - OPEN

[image: Bug - A problem which impairs or prevents the functions of the
product.] BEAM-1418  
MapElements
and FlatMapElements should comply with PTransform style guide

   - [image: Major - Major loss of function.]
   - OPEN

[image: Bug - A problem which impairs or prevents the functions of the
product.] BEAM-1425  Window
should comply with PTransform style guide

   - [image: Major - Major loss of function.]
   - OPEN

[image: Bug - A problem which impairs or prevents the functions of the
product.] BEAM-1428  KinesisIO
should comply with PTransform style guide

   - [image: Major - Major loss of function.]
   - OPEN


HDFS is pending the other mailing list thread.
For TextIO/AvroIO there's a pending PR by Reuven.
Window and Map/FlatMapElements could use a brief discussion on the mailing
list as to what the proper API should be.
KinesisIO is, I think, due for a rewrite.

However, CountingInput and PubsubIO I think are great targets for people
to pick up. Any volunteers?


On Fri, Mar 3, 2017 at 11:30 AM Robert Bradshaw
 wrote:

Here's a crazy idea: what if we had a virtual fixit/hackathon to knock
these out (similar to the virtual meet-up, but with an agenda)? I find
communal hacking sessions towards a common goal are a good way to get to
know each other and get a lot done. Would there be any interest in this?

On Wed, Mar 1, 2017 at 11:30 AM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:


Hey all,

First couple rounds of fixes are in. Thanks Aviem Zur for contributing
TextIO fixes and Dan Halperin for reviewing! One more fix by Reuven in
progress (https://github.com/apache/beam/pull/1927).

Follow https://issues.apache.org/jira/browse/BEAM-1353 and sub-issues

for

the status.
Many of the changes are backwards-incompatible (though with only minor
changes in pipelines required). I'll make a separate announcement about
that on the users list now.

Call for help with the other sub-issues still stands! They are pretty
simple work items but pretty important since this is a blocker for
declaring stable BEAM API.

On Wed, Feb 8, 2017 at 3:51 AM Jean-Baptiste Onofré 
wrote:

Thanks Eugene.

I will tackle some Jira when back next week.

Regards
JB

On Feb 7, 2017, 18:16, at 18:16, Eugene Kirpichov

Re: IO ITs: Hosting Docker images

2017-04-07 Thread Jean-Baptiste Onofré

Hi Stephen,

I think we should go to 1 and 4:

1. Try to use existing images providing what we need. If we don't find existing 
image, we can always ask and help other community to provide so.
4. If we don't find a suitable image, and waiting for this image, we can store 
the image in our own "IT dockerhub".


Regards
JB

On 04/08/2017 01:03 AM, Stephen Sisk wrote:

Wanted to see if anyone else had opinions on this/provide a quick update.

I think for both elasticsearch and HIFIO that we can find existing,
supported images that could serve those purposes - HIFIO is looking like
it'll able to do so for cassandra, which was proving tricky.

So to summarize my current proposed solutions: (ordered by my preference)
1. (new) Strongly urge people to find existing docker images that meet our
image criteria - regularly updated/security checked
2. Start using helm
3. Push our docker images to docker hub
4. Host our own public container registry

S

On Tue, Apr 4, 2017 at 10:16 AM Stephen Sisk  wrote:


I'd like to hear what direction folks want to go in, and from there look
at the options. I think for some of these options (like running our own
public registry), they may be able to and it's something we should look at,
but I don't assume they have time to work on this type of issue.

S

On Tue, Apr 4, 2017 at 10:00 AM Lukasz Cwik 
wrote:

Is this something that Apache infra could help us with?

On Mon, Apr 3, 2017 at 7:22 PM, Stephen Sisk 
wrote:


Summary:

For IO ITs that use data stores that need custom docker images in order

to

run, we can't currently use them in a kubernetes cluster (which is where

we

host our data stores.) I have a couple options for how to solve this and

am

looking for feedback from folks involved in creating IO ITs/opinions on
kubernetes.


Details:

We've discussed in the past that we'll want to allow developers to submit
just a dockerfile, and then we'll use that when creating the data store

on

kubernetes. This is the case for ElasticsearchIO and I assume more data
stores in the future will want to do this. It's also looking like it'll

be

necessary to use custom docker images for the HadoopInputFormatIO's
cassandra ITs - to run a cassandra cluster, there doesn't seem to be a

good

image you can use out of the box.

In either case, in order to retrieve a docker image, kubernetes needs a
container registry - it will read the docker images from there. A simple
private container registry doesn't work because kubernetes config files

are

static - this means that if local devs try to use the kubernetes files,
they point at the private container registry and they wouldn't be able to
retrieve the images since they don't have access. They'd have to manually
edit the files, which in theory is an option, but I don't consider that

to

be acceptable since it feels pretty unfriendly (it is simple, so if we
really don't like the below options we can revisit it.)

Quick summary of the options

===

We can:

* Start using something like k8 helm - this adds more dependencies, adds

a

small amount of complexity (this is my recommendation, but only by a
little)

* Start pushing images to docker hub - this means they'll be publicly
visible and raises the bar for maintenance of those images

* Host our own public container registry - this means running our own
public service with costs, etc..

Below are detailed discussions of these options. You can skip to the "My
thoughts on this" section if you're not interested in the details.


1. Templated kubernetes images

=

Kubernetes (k8) does not currently have built in support for

parameterizing

scripts - there's an issues open for this[1], but it doesn't seem to be
very active.

There are tools like Kubernetes helm that allow users to specify

parameters

when running their kubernetes scripts. They also enable a lot more

(they're

probably closer to a package manager like apt-get) - see this
description[3] for an overview.

I'm open to other options besides helm, but it seems to be the officially
supported one.

How the world would look using helm:

* When developing an IO IT, someone (either the developer or one of us),
would need to create a chart (the name for the helm script) - it's
basically another set of config files but in theory is as simple as a
couple metadata files plus a templatized version of a regular k8 script.
This should be trivial compared to the task of creating a k8 script.

*  When creating an instance of a data store, the developer (or the beam

CI

server) would first build the docker image for the data store and push to
their container registry, then run a command like `helm install -f
mydb.yaml --set imageRepo=1.2.3.4`

* when done running tests/developing/etc…  the developer/beam CI server
would run `helm delete -f mydb.yaml`

Upsides:

* Something like helm is pretty interesting - we talked about it as an

Re: Proposed Splittable DoFn API changes

2017-04-07 Thread Jean-Baptiste Onofré

Hi Eugene,

thanks for the update and nice example.

I plan to start to refactor/experiment on some IOs.

Regards
JB

On 04/08/2017 02:44 AM, Eugene Kirpichov wrote:

The changes are in.

Also included is a handy change that allows one to skip implementing the
NewTracker method if the restriction type implements HasDefaultTracker,
leaving the only two required methods be ProcessElement and
GetInitialRestriction.

E.g. here's what a minimal SDF example looks like now - splittably pairing
a string with every number in 0..100:

  class CountFn extends DoFn> {
@ProcessElement
public void process(ProcessContext c, OffsetRangeTracker tracker) {
  for (long i = tracker.currentRestriction().getFrom();
tracker.tryClaim(i); ++i) {
c.output(KV.of(c.element(), i));
  }
}

@GetInitialRestriction
public OffsetRange getInitialRange(String element) { return new
OffsetRange(0, 100); }
  }


On Thu, Apr 6, 2017 at 3:16 PM Eugene Kirpichov 
wrote:


FWIW, here's a pull request implementing these changes:
https://github.com/apache/beam/pull/2455

On Wed, Apr 5, 2017 at 4:55 PM Eugene Kirpichov 
wrote:

Hey all,

From the recent experience in continuing implementation of Splittable
DoFn, I would like to propose a few changes to its API. They get rid of a
bug, make parts of its semantics more well-defined and easier for a user to
get right, and reduce the assumptions about the runner implementation.

In short:
- Add c.updateWatermark() and report watermark continuously via this
method.
- Make SDF.@ProcessElement return void, which is simpler for users though
it doesn't allow to resume after a specified time
- Declare that SDF.@ProcessElement must guarantee that after it returns,
the entire tracker.currentRestriction() was processed.
- Add a bool RestrictionTracker.done() method to enforce the bullet above.
- For resuming after specified time, use regular DoFn with state and
timers API.

The only downside is the removal (from SDF) of ability to suspend the call
for a certain amount of time - the suggestion is that, if you need that,
you should use a regular DoFn and the timers API.

Please see the full proposal in the following doc and comment there & vote
on this thread.

https://docs.google.com/document/d/1BGc8pM1GOvZhwR9SARSVte-20XEoBUxrGJ5gTWXdv3c/edit?usp=sharing


I am going to concurrently start prototyping some parts of this proposal,
because the current implementation is simply wrong and this proposal is the
only way to fix it that I can think of, but I will adjust my implementation
based on the discussion. I believe this proposal should not affect runner
authors - I can make all the necessary changes myself.

Thanks!






--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Combine.Global

2017-04-07 Thread Aviem Zur
I wasn't able to reproduce the issue you're experiencing.
I've created a gist with an example that works and is similar to what you
have described.
Please help us make tweaks to the gist reproduce your problem:
https://gist.github.com/aviemzur/ba213d98b4484492099b3cf709ddded0

On Fri, Apr 7, 2017 at 7:25 PM Paul Gerver  wrote:

> Yes, the pipeline is quite small:
>
> pipeline.apply("source",
> Read.from(new CustomSource())).setCoder(CustomSource.coder)
> .apply("GlobalCombine", Combine.globally(new
> CustomCombineFn())).setCoder(CustomTuple.coder);
>
>
> The InputT is not the same as OutputT, so the input coder can't be used.
>
> On 2017-04-07 08:58 (-0500), Aviem Zur  wrote:
> > Have you set the coder for your input PCollection? The one on which you
> > perform the Combine?
> >
> > On Fri, Apr 7, 2017 at 4:24 PM Paul Gerver  wrote:
> >
> > > Hello All,
> > >
> > > I'm trying to test out a Combine.Globally transform which takes in a
> small
> > > custom class (CustomA) and outputs a secondary custom class (CustomB).
> I
> > > have set the coder for the resulting PCollection, but Beam is
> > > arguing that a coder for a KV type is missing (see output at bottom).
> > >
> > > Since this a global combine, the input nor the output is of KV type,
> so I
> > > decided to take a look at the Combine code. Since
> Combine.Globally.expand()
> > > performs a perKeys and groupedValues underneath the covers, but
> requires
> > > making an intermediate PCollection KV which--according
> to
> > > the docs--is inferred from the CombineFn.
> > >
> > > I believe I could workaround this by registering a KvCoder with the
> > > CoderRegistry, but that's not intuitive. Is there a better way to
> address
> > > this currently, or should something be added to the CombineFn area for
> > > setting an output coder similar to PCollection.
> > >
> > >
> > > Output:
> > > Exception in thread "main" java.lang.IllegalStateException: Unable to
> > > return a default Coder for
> > >
> > >
> GlobalCombine/Combine.perKey(CustomTuple)/Combine.GroupedValues/ParDo(Anonymous).out
> > > [Class]. Correct one of the following root causes:
> > >   No Coder has been manually specified;  you may do so using
> .setCoder().
> > >   Inferring a Coder from the CoderRegistry failed: Unable to provide a
> > > default Coder for org.apache.beam.sdk.values.KV. Correct
> one of
> > > the following root causes:
> > >   Building a Coder using a registered CoderFactory failed: Cannot
> provide
> > > coder for parameterized type org.apache.beam.sdk.values.KV:
> > > Unable to provide a default Coder for java.lang.Object. Correct one of
> the
> > > following root causes:
> > >
> > >
> > > Stack:
> > > at
> > >
> > >
> org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkState(Preconditions.java:174)
> > > at
> > > org.apache.beam.sdk.values.TypedPValue.getCoder(TypedPValue.java:51)
> > > at
> > > org.apache.beam.sdk.values.PCollection.getCoder(PCollection.java:130)
> > > at
> > >
> > >
> org.apache.beam.sdk.values.TypedPValue.finishSpecifying(TypedPValue.java:90)
> > > at
> > >
> > >
> org.apache.beam.sdk.runners.TransformHierarchy.finishSpecifyingInput(TransformHierarchy.java:95)
> > > at
> org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:386)
> > > at
> org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:302)
> > > at
> > > org.apache.beam.sdk.values.PCollection.apply(PCollection.java:154)
> > > at
> > >
> org.apache.beam.sdk.transforms.Combine$Globally.expand(Combine.java:1460)
> > > at
> > >
> org.apache.beam.sdk.transforms.Combine$Globally.expand(Combine.java:1337)
> > > at
> > >
> org.apache.beam.sdk.runners.PipelineRunner.apply(PipelineRunner.java:76)
> > > at
> > >
> org.apache.beam.runners.direct.DirectRunner.apply(DirectRunner.java:296)
> > > at
> org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:388)
> > > at
> org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:318)
> > > at
> > > org.apache.beam.sdk.values.PCollection.apply(PCollection.java:167)
> > > at
> > > org.iastate.edu.CombineTestPipeline.main(CombineTestPipeline.java:110)
> > >
> > >
> > > Let me know. Thanks!
> > > -Paul G
> > >
> > > --
> > > *Paul Gerver*
> > > pfger...@gmail.com
> > >
> >
>


Re: Proposed Splittable DoFn API changes

2017-04-07 Thread Eugene Kirpichov
The changes are in.

Also included is a handy change that allows one to skip implementing the
NewTracker method if the restriction type implements HasDefaultTracker,
leaving the only two required methods be ProcessElement and
GetInitialRestriction.

E.g. here's what a minimal SDF example looks like now - splittably pairing
a string with every number in 0..100:

  class CountFn extends DoFn> {
@ProcessElement
public void process(ProcessContext c, OffsetRangeTracker tracker) {
  for (long i = tracker.currentRestriction().getFrom();
tracker.tryClaim(i); ++i) {
c.output(KV.of(c.element(), i));
  }
}

@GetInitialRestriction
public OffsetRange getInitialRange(String element) { return new
OffsetRange(0, 100); }
  }


On Thu, Apr 6, 2017 at 3:16 PM Eugene Kirpichov 
wrote:

> FWIW, here's a pull request implementing these changes:
> https://github.com/apache/beam/pull/2455
>
> On Wed, Apr 5, 2017 at 4:55 PM Eugene Kirpichov 
> wrote:
>
> Hey all,
>
> From the recent experience in continuing implementation of Splittable
> DoFn, I would like to propose a few changes to its API. They get rid of a
> bug, make parts of its semantics more well-defined and easier for a user to
> get right, and reduce the assumptions about the runner implementation.
>
> In short:
> - Add c.updateWatermark() and report watermark continuously via this
> method.
> - Make SDF.@ProcessElement return void, which is simpler for users though
> it doesn't allow to resume after a specified time
> - Declare that SDF.@ProcessElement must guarantee that after it returns,
> the entire tracker.currentRestriction() was processed.
> - Add a bool RestrictionTracker.done() method to enforce the bullet above.
> - For resuming after specified time, use regular DoFn with state and
> timers API.
>
> The only downside is the removal (from SDF) of ability to suspend the call
> for a certain amount of time - the suggestion is that, if you need that,
> you should use a regular DoFn and the timers API.
>
> Please see the full proposal in the following doc and comment there & vote
> on this thread.
>
> https://docs.google.com/document/d/1BGc8pM1GOvZhwR9SARSVte-20XEoBUxrGJ5gTWXdv3c/edit?usp=sharing
>
>
> I am going to concurrently start prototyping some parts of this proposal,
> because the current implementation is simply wrong and this proposal is the
> only way to fix it that I can think of, but I will adjust my implementation
> based on the discussion. I believe this proposal should not affect runner
> authors - I can make all the necessary changes myself.
>
> Thanks!
>
>


[DISCUSSION] PAssert success/failure count validation for all runners

2017-04-07 Thread Aviem Zur
Currently, PAssert assertions may not happen and tests will pass while
silently hiding issues.

Up until now, several runners have implemented an assertion that the number
of expected successful assertions have actually happened, and that no
failed assertions have happened. (runners which check this are Dataflow
runner and Spark runner).

This has been valuable in the past to find bugs which were hidden by
passing tests.

The work to repeat this in https://issues.apache.org/jira/browse/BEAM-1726 has
surfaced bugs in the Flink runner that were also hidden by passing tests.
However, with the removal of aggregators in
https://issues.apache.org/jira/browse/BEAM-1148 this ticket will be harder
to implement, since Flink runner does not support metrics.

I believe that validating that runners do in fact support Beam model is a
blocker for first stable release. (BEAM-1726 was also marked as a blocker
for Flink runner).

I think we have one of 2 choices here:
1. Keep implementing this for each runner separately.
2. Implement this in a runner agnostic way (For runners which support
metrics - use metrics, for those that do not use a fallback implementation,
perhaps using files or some other method). This should be covered by the
following ticket: https://issues.apache.org/jira/browse/BEAM-1763

Thoughts?


Re: Combine.Global

2017-04-07 Thread Aviem Zur
Have you set the coder for your input PCollection? The one on which you
perform the Combine?

On Fri, Apr 7, 2017 at 4:24 PM Paul Gerver  wrote:

> Hello All,
>
> I'm trying to test out a Combine.Globally transform which takes in a small
> custom class (CustomA) and outputs a secondary custom class (CustomB). I
> have set the coder for the resulting PCollection, but Beam is
> arguing that a coder for a KV type is missing (see output at bottom).
>
> Since this a global combine, the input nor the output is of KV type, so I
> decided to take a look at the Combine code. Since Combine.Globally.expand()
> performs a perKeys and groupedValues underneath the covers, but requires
> making an intermediate PCollection KV which--according to
> the docs--is inferred from the CombineFn.
>
> I believe I could workaround this by registering a KvCoder with the
> CoderRegistry, but that's not intuitive. Is there a better way to address
> this currently, or should something be added to the CombineFn area for
> setting an output coder similar to PCollection.
>
>
> Output:
> Exception in thread "main" java.lang.IllegalStateException: Unable to
> return a default Coder for
>
> GlobalCombine/Combine.perKey(CustomTuple)/Combine.GroupedValues/ParDo(Anonymous).out
> [Class]. Correct one of the following root causes:
>   No Coder has been manually specified;  you may do so using .setCoder().
>   Inferring a Coder from the CoderRegistry failed: Unable to provide a
> default Coder for org.apache.beam.sdk.values.KV. Correct one of
> the following root causes:
>   Building a Coder using a registered CoderFactory failed: Cannot provide
> coder for parameterized type org.apache.beam.sdk.values.KV:
> Unable to provide a default Coder for java.lang.Object. Correct one of the
> following root causes:
>
>
> Stack:
> at
>
> org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkState(Preconditions.java:174)
> at
> org.apache.beam.sdk.values.TypedPValue.getCoder(TypedPValue.java:51)
> at
> org.apache.beam.sdk.values.PCollection.getCoder(PCollection.java:130)
> at
>
> org.apache.beam.sdk.values.TypedPValue.finishSpecifying(TypedPValue.java:90)
> at
>
> org.apache.beam.sdk.runners.TransformHierarchy.finishSpecifyingInput(TransformHierarchy.java:95)
> at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:386)
> at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:302)
> at
> org.apache.beam.sdk.values.PCollection.apply(PCollection.java:154)
> at
> org.apache.beam.sdk.transforms.Combine$Globally.expand(Combine.java:1460)
> at
> org.apache.beam.sdk.transforms.Combine$Globally.expand(Combine.java:1337)
> at
> org.apache.beam.sdk.runners.PipelineRunner.apply(PipelineRunner.java:76)
> at
> org.apache.beam.runners.direct.DirectRunner.apply(DirectRunner.java:296)
> at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:388)
> at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:318)
> at
> org.apache.beam.sdk.values.PCollection.apply(PCollection.java:167)
> at
> org.iastate.edu.CombineTestPipeline.main(CombineTestPipeline.java:110)
>
>
> Let me know. Thanks!
> -Paul G
>
> --
> *Paul Gerver*
> pfger...@gmail.com
>


Combine.Global

2017-04-07 Thread Paul Gerver
Hello All,

I'm trying to test out a Combine.Globally transform which takes in a small
custom class (CustomA) and outputs a secondary custom class (CustomB). I
have set the coder for the resulting PCollection, but Beam is
arguing that a coder for a KV type is missing (see output at bottom).

Since this a global combine, the input nor the output is of KV type, so I
decided to take a look at the Combine code. Since Combine.Globally.expand()
performs a perKeys and groupedValues underneath the covers, but requires
making an intermediate PCollection KV which--according to
the docs--is inferred from the CombineFn.

I believe I could workaround this by registering a KvCoder with the
CoderRegistry, but that's not intuitive. Is there a better way to address
this currently, or should something be added to the CombineFn area for
setting an output coder similar to PCollection.


Output:
Exception in thread "main" java.lang.IllegalStateException: Unable to
return a default Coder for
GlobalCombine/Combine.perKey(CustomTuple)/Combine.GroupedValues/ParDo(Anonymous).out
[Class]. Correct one of the following root causes:
  No Coder has been manually specified;  you may do so using .setCoder().
  Inferring a Coder from the CoderRegistry failed: Unable to provide a
default Coder for org.apache.beam.sdk.values.KV. Correct one of
the following root causes:
  Building a Coder using a registered CoderFactory failed: Cannot provide
coder for parameterized type org.apache.beam.sdk.values.KV:
Unable to provide a default Coder for java.lang.Object. Correct one of the
following root causes:


Stack:
at
org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkState(Preconditions.java:174)
at
org.apache.beam.sdk.values.TypedPValue.getCoder(TypedPValue.java:51)
at
org.apache.beam.sdk.values.PCollection.getCoder(PCollection.java:130)
at
org.apache.beam.sdk.values.TypedPValue.finishSpecifying(TypedPValue.java:90)
at
org.apache.beam.sdk.runners.TransformHierarchy.finishSpecifyingInput(TransformHierarchy.java:95)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:386)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:302)
at
org.apache.beam.sdk.values.PCollection.apply(PCollection.java:154)
at
org.apache.beam.sdk.transforms.Combine$Globally.expand(Combine.java:1460)
at
org.apache.beam.sdk.transforms.Combine$Globally.expand(Combine.java:1337)
at
org.apache.beam.sdk.runners.PipelineRunner.apply(PipelineRunner.java:76)
at
org.apache.beam.runners.direct.DirectRunner.apply(DirectRunner.java:296)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:388)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:318)
at
org.apache.beam.sdk.values.PCollection.apply(PCollection.java:167)
at
org.iastate.edu.CombineTestPipeline.main(CombineTestPipeline.java:110)


Let me know. Thanks!
-Paul G

-- 
*Paul Gerver*
pfger...@gmail.com


Re: Update of Pei in Alibaba

2017-04-07 Thread Ismaël Mejía
Hello Basti,

Thanks a lot for answering, I imagined that, that all the improvements
of both JStorm and Heron wouldn’t translate perfectly but still a
worth goal to try to have the ‘common’ storm parts isolated so they
can be shared with the other runners.

Really interesting, I wish you guys the best for this work, and
welcome to the community.

Ismaël

On Thu, Apr 6, 2017 at 11:51 AM, 刘键(Basti Liu)  wrote:
> Hi Ismaël,
>
>
>
> Sorry for the late response. I am the developer of JStorm, and currently work 
> with Pei on JStorm runner.
>
> We have went through current Storm runner( 
>  
> https://github.com/apache/storm/commits/beam-runner). But it is a very draft 
> version, several PTransforms are not supported or not fully supported, 
> especial for window, trigger and state.
>
>
>
> Generally, JStorm is compatible with the basic API of Storm, while providing 
> improvements or new features on topology master, state manager, exactly once, 
> message transfer mechanism, stage-by-stage backpressure flow control, 
> metrics, etc.
>
> For the basic “at most once” job, JStorm runner can be reused on Storm. But 
> for “window”, “state” and “exactly once” job, unfortunately, JStorm runner 
> can’t be reused. Anyway, we will figure out if the propagation is possible 
> for Storm in the future.
>
>
>
> Regards
>
> Jian Liu(Basti)
>
>
>
> -Original Message-
> From: Ismaël Mejía [mailto:ieme...@gmail.com]
> Sent: Sunday, April 02, 2017 3:18 AM
> To: dev@beam.apache.org
> Subject: Re: Update of Pei in Alibaba
>
>
>
> Excellent news,
>
>
>
> Pei it would be great to have a new runner. I am curious about how different 
> are the implementations of storm among them considering that there are 
> already three 'versions': Storm, Jstorm and Heron, I wonder if one runner 
> could traduce to an API that would cover all of them (of course maybe I am 
> super naive I really don't know much about JStorm or Heron and how much they 
> differ from the original storm).
>
>
>
> Jingson, I am super curious about this Galaxy project, it is there any public 
> information about it? is this related to the previous blink ali baba project? 
> I already looked a bit but searching "Ali baba galaxy"
>
> is a recipe for a myriad of telephone sellers :)
>
>
>
> Nice to see that you are going to keep contributing to the project Pei.
>
>
>
> Regards,
>
> Ismaël
>
>
>
>
>
>
>
> On Sat, Apr 1, 2017 at 4:59 PM, Tibor Kiss <  
> tibor.k...@gmail.com> wrote:
>
>> Exciting times, looking forward to try it out!
>
>>
>
>> I shall mention that Taylor Goetz also started creating a BEAM runner
>
>> using Storm.
>
>> His work is available in the storm repo:
>
>>   
>> https://github.com/apache/storm/commits/beam-runner
>
>> Maybe it's worth while to take a peek and see if something is reusable
>
>> from there.
>
>>
>
>> - Tibor
>
>>
>
>> On Sat, Apr 1, 2017 at 4:37 AM, JingsongLee < 
>>  lzljs3620...@aliyun.com> wrote:
>
>>
>
>>> Wow, very glad to see JStorm also started building BeamRunner.
>
>>> I am working in Galaxy (Another streaming process engine in Alibaba).
>
>>> I hope that we can work together to promote the use of Apache Beam in
>
>>> Alibaba and China.
>
>>>
>
>>> best,
>
>>> JingsongLee
>
>>> --Fro
>
>>> m:Pei HE <  pei...@gmail.com>Time:2017 Apr 1 (Sat) 
>>> 09:24To:dev <
>
>>>   
>>> dev@beam.apache.org>Subject:Update of Pei in Alibaba Hi all, On
>
>>> February, I moved from Seattle to Hangzhou, China, and joined Alibaba.
>
>>> And, I want to give an update of things in here.
>
>>>
>
>>> A colleague and I have been working on JStorm
>
>>> <  https://github.com/alibaba/jstorm> 
>>> runner. We have a prototype that
>
>>> works with WordCount and PAssert. (I am going to start a separate
>
>>> email thread about how to get it reviewed and merged in Apache Beam.)
>
>>> We also have Spark clusters, and are very interested in using Spark
>
>>> runner.
>
>>>
>
>>> Last Saturday, I went to China Hadoop Summit, and gave a talk about
>
>>> Apache Beam model. While many companies gave talks of their in-house
>
>>> solutions for unified batch and unified SQL, there are also
>
>>> lots of interests and enthusiasts of Beam.
>
>>>
>
>>> Looking forward to chat more.
>
>>> --
>
>>> Pei
>
>>>
>
>>>
>
>>
>
>>
>
>> --
>
>> Kiss Tibor
>
>>
>
>> +36 70 275 9863
>
>>   tibor.k...@gmail.com
>