Re: How to expose/use the External transform on Java SDK

2019-07-25 Thread Kenneth Knowles
Top posting just to address the vendoring question. We didn't have
vendoring of gRPC until very recently. I think all rationale about keeping
it off the SDK surface are obsolete now. It will probably unlock a lot of
simplification to just go for it and use gRPC in the core SDK. Notably,
runners-core-construction-java is separate for only this reason, and it is
a pretty confusing distinction* to get right sometimes. But I could be
missing another pitfall. For example we wouldn't just undo all the API
decisions in runners-core-construction-java, because we still want to
isolate the user from gRPC when it is not relevant to them, and we also
want transform authors to be able to register payload translators.

Kenn

 * sdks-java-core versus runners-core-construction-java is all
construction-time bits and you have to think about whether it is
user-facing or runner-facing / whether it depends on proto
 * runners-core-construction-java versus runners-core-java is all
runner/harness-facing but you have to think about whether it is
construction-time or run-time

On Thu, Jul 25, 2019 at 9:45 AM Ismaël Mejía  wrote:

> It seems we mostly agree that this is a ‘core’ feature, and it already
> is at least in the other SDKs (python and go). So if its place is in
> sdks/java/core then the correct path might be #1.
>
> Robert where did the discussion about merging transform translation
> happened (I probably missed it) because that could be an extra point
> to decide to do this.
>
> I thought that gRPC leaked more stuff so that's better than I expected
> and vendoring helps with possible versions conflicts. Bad part is it
> is still 14MB so this is more stuff to stage and bigger fat jars for
> deployments in the open source runners. Not to mention the potential
> repeated versions that could arise from runner repackaging (as we
> lived with guava in the past).
>
> The other thing I feel weird about is that this feels a bit like
> mixing the execution part with the definition part that is something
> that I really appreciated of the current sdks/java/core separation, at
> least in Java.
>
> Is there a way just to include just the generated clients for an gRPC
> service and not the full gRPC stuff (side not I am an gRPC newbie
> still)? That could make it at least a bit more constrained (even if
> still mixing runtime with definition).
>
> On Thu, Jul 25, 2019 at 10:20 AM Robert Bradshaw 
> wrote:
> >
> > From the portability perspective,
> >
> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto
> > and the associated services for executing pipelines is about as "core"
> > as it gets, and eventually I'd like to see all runners being portable
> > (even if they have an option of running user code in process rather
> > than in separate docker images) and the API between SDKs and Runners
> > would be these beam model protos rather than some parallel set of Java
> > classes. This would argue for #1. (There was also some discussion
> > recently about merging the transform translation into core as well, as
> > the current structure of keeping it separate introduces a lot of extra
> > hoops and makes it difficult to define user-level transforms that have
> > proper translation, which is along the same lines.)
> >
> > I'm not quite sure I follow the downsides of leaking the vendored
> > classes into the users classpath--isn't the point of vendoring to make
> > such exposure benign (and as you'd almost always be linking in a
> > runner, you'd get this anyway).
> >
> > Finally, from a simple user's API perspective, having
> > ExternalTransform in core makes a lot of sense and it'd be unfortunate
> > to contort the API for underlying technical reasons if it can be
> > avoided.
> >
> > On Wed, Jul 24, 2019 at 9:18 PM Heejong Lee  wrote:
> > >
> > > I think it depends how we define "the core" part of the SDK. If we
> define the core as only the (abstract) data types which describe BEAM
> pipeline model then it would be more sensible to put external transform
> into a separate extension module (option 4). Otherwise, option 1 makes
> sense.
> > >
> > > On Wed, Jul 24, 2019 at 11:56 AM Chamikara Jayalath <
> chamik...@google.com> wrote:
> > >>
> > >> The idea of 'ExternalTransform' is to allow users to use transforms
> in SDK X from SDK Y. I think this should be a core part of each SDK and
> corresponding external transforms ([a] for Java, [b] for Python) should be
> released with each SDK. This will also allow us to add core external
> transforms to some of the critical transforms that are not available in
> certain SDKs. So I prefer option (1).
> > >>
> > >> Rebo, I didn't realize there's an external transform in Go SDK.
> Looking at it, seems like it's more of an interface for native transforms
> implemented in each runner, not for cross-language use-cases. Is that
> correct ? May be we can reuse it for latter as well.
> > >>
> > >> Thanks,
> > >> Cham
> > >>
> > >> [a]
> 

Re: How to expose/use the External transform on Java SDK

2019-07-25 Thread Robert Bradshaw
>From the portability perspective,
https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto
and the associated services for executing pipelines is about as "core"
as it gets, and eventually I'd like to see all runners being portable
(even if they have an option of running user code in process rather
than in separate docker images) and the API between SDKs and Runners
would be these beam model protos rather than some parallel set of Java
classes. This would argue for #1. (There was also some discussion
recently about merging the transform translation into core as well, as
the current structure of keeping it separate introduces a lot of extra
hoops and makes it difficult to define user-level transforms that have
proper translation, which is along the same lines.)

I'm not quite sure I follow the downsides of leaking the vendored
classes into the users classpath--isn't the point of vendoring to make
such exposure benign (and as you'd almost always be linking in a
runner, you'd get this anyway).

Finally, from a simple user's API perspective, having
ExternalTransform in core makes a lot of sense and it'd be unfortunate
to contort the API for underlying technical reasons if it can be
avoided.

On Wed, Jul 24, 2019 at 9:18 PM Heejong Lee  wrote:
>
> I think it depends how we define "the core" part of the SDK. If we define the 
> core as only the (abstract) data types which describe BEAM pipeline model 
> then it would be more sensible to put external transform into a separate 
> extension module (option 4). Otherwise, option 1 makes sense.
>
> On Wed, Jul 24, 2019 at 11:56 AM Chamikara Jayalath  
> wrote:
>>
>> The idea of 'ExternalTransform' is to allow users to use transforms in SDK X 
>> from SDK Y. I think this should be a core part of each SDK and corresponding 
>> external transforms ([a] for Java, [b] for Python) should be released with 
>> each SDK. This will also allow us to add core external transforms to some of 
>> the critical transforms that are not available in certain SDKs. So I prefer 
>> option (1).
>>
>> Rebo, I didn't realize there's an external transform in Go SDK. Looking at 
>> it, seems like it's more of an interface for native transforms implemented 
>> in each runner, not for cross-language use-cases. Is that correct ? May be 
>> we can reuse it for latter as well.
>>
>> Thanks,
>> Cham
>>
>> [a] 
>> https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/External.java
>> [b] 
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/external.py
>>
>> On Wed, Jul 24, 2019 at 10:25 AM Robert Burke  wrote:
>>>
>>> Ideas inline.
>>>
>>> On Wed, Jul 24, 2019, 9:56 AM Ismaël Mejía  wrote:

 After Beam Summit EU I was curious about the External transform. I was
 interested on the scenario of using it to call python code in the
 middle of a Java pipeline. This is a potentially useful scenario for
 example to evaluate models from python ML frameworks on Java
 pipelines. In my example I did a transform to classify elements in a
 simple Python ParDo and tried to connect it via the Java External
 transform.

 I found that the ExternalTransform code was added into
 `runners/core-construction-java` as part of BEAM-6747 [1]. However
 this code is not exposed currently as part of the Beam Java SDK, so
 end users won’t be able to find it easily. I found this weird and
 thought well it will be as simple as to move it into the Java SDK and
 voila!

 But of course this could not be so easy because this transform calls
 the Expansion service via gRPC and Java SDK does not have (and
 probably should not have) gRPC in its dependencies.
 So my second reflex was to add it into Java SDK and translate it a
 generic expansion all the runners, but this may not make sense because
 the External transform is not part of the runner translation since
 this is part of the Pipeline construction process (as pointed to me by
 Max in a slack discussion).

 So the question is: How do you think this should be exposed to the end 
 users?

 1. Should we add gRPC with all its deps to SDKs Java core? (this of
 course it is not nice because we will leak our vendored gRPC and
 friends into users classpath).
>>>
>>> If there's separation between the SDK and the Harness then this makes 
>>> sense. Otherwise the portable harness depends on GRPC at present, doesn't 
>>> it? Presently the Go SDK kicks off the harness, and then carries the GRPC 
>>> dependency (Though that's separable if necessary.)

 2. Should we do the dynamic loading of classes only an runtime if the
 transform is used to avoid the big extra compile dependency (and add
 runners/core-construction-java) as a runtime dependency.
 3. Should we create a ‘shim’ module to hide the gRPC dependency and

Re: How to expose/use the External transform on Java SDK

2019-07-24 Thread Heejong Lee
I think it depends how we define "the core" part of the SDK. If we define
the core as only the (abstract) data types which describe BEAM pipeline
model then it would be more sensible to put external transform into a
separate extension module (option 4). Otherwise, option 1 makes sense.

On Wed, Jul 24, 2019 at 11:56 AM Chamikara Jayalath 
wrote:

> The idea of 'ExternalTransform' is to allow users to use transforms in SDK
> X from SDK Y. I think this should be a core part of each SDK and
> corresponding external transforms ([a] for Java, [b] for Python) should be
> released with each SDK. This will also allow us to add core external
> transforms to some of the critical transforms that are not available in
> certain SDKs. So I prefer option (1).
>
> Rebo, I didn't realize there's an external transform in Go SDK. Looking at
> it, seems like it's more of an interface for native transforms implemented
> in each runner, not for cross-language use-cases. Is that correct ? May be
> we can reuse it for latter as well.
>
> Thanks,
> Cham
>
> [a]
> https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/External.java
> [b]
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/external.py
>
> On Wed, Jul 24, 2019 at 10:25 AM Robert Burke  wrote:
>
>> Ideas inline.
>>
>> On Wed, Jul 24, 2019, 9:56 AM Ismaël Mejía  wrote:
>>
>>> After Beam Summit EU I was curious about the External transform. I was
>>> interested on the scenario of using it to call python code in the
>>> middle of a Java pipeline. This is a potentially useful scenario for
>>> example to evaluate models from python ML frameworks on Java
>>> pipelines. In my example I did a transform to classify elements in a
>>> simple Python ParDo and tried to connect it via the Java External
>>> transform.
>>>
>>> I found that the ExternalTransform code was added into
>>> `runners/core-construction-java` as part of BEAM-6747 [1]. However
>>> this code is not exposed currently as part of the Beam Java SDK, so
>>> end users won’t be able to find it easily. I found this weird and
>>> thought well it will be as simple as to move it into the Java SDK and
>>> voila!
>>>
>>> But of course this could not be so easy because this transform calls
>>> the Expansion service via gRPC and Java SDK does not have (and
>>> probably should not have) gRPC in its dependencies.
>>> So my second reflex was to add it into Java SDK and translate it a
>>> generic expansion all the runners, but this may not make sense because
>>> the External transform is not part of the runner translation since
>>> this is part of the Pipeline construction process (as pointed to me by
>>> Max in a slack discussion).
>>>
>>> So the question is: How do you think this should be exposed to the end
>>> users?
>>>
>>> 1. Should we add gRPC with all its deps to SDKs Java core? (this of
>>> course it is not nice because we will leak our vendored gRPC and
>>> friends into users classpath).
>>>
>> If there's separation between the SDK and the Harness then this makes
>> sense. Otherwise the portable harness depends on GRPC at present, doesn't
>> it? Presently the Go SDK kicks off the harness, and then carries the GRPC
>> dependency (Though that's separable if necessary.)
>>
>>> 2. Should we do the dynamic loading of classes only an runtime if the
>>> transform is used to avoid the big extra compile dependency (and add
>>> runners/core-construction-java) as a runtime dependency.
>>> 3. Should we create a ‘shim’ module to hide the gRPC dependency and
>>> load the gRPC classes dynamically on it when the External transform is
>>> part of the pipeline.
>>> 4. Should we pack it as an extension (with the same issue of needing
>>> to leak the dependencies, but with less impact for users who do not
>>> use External) ?
>>> 5. Other?
>>>
>>> The ‘purist’ me thinks we should have External in sdks/java/core but
>>> maybe it is better not to. Any other opinions or ideas?
>>>
>>
>> The Go SDK supports External in it's core transforms set  However it
>> would be the callers are able to populate the data field however they need
>> to, whether that's some "known" configuration object or something sourced
>> from another service (eg the expansion service). The important part on the
>> other side is that the runner knows what to do with it.
>>
>> The non-portable pubsubio in the Go SDK is an example [1] using External
>> currently. The Dataflow runner recognizes it, and makes the substitution.
>> Eventually once the SDK supports SDF that can generate unbounded
>> PCollections, this will likely be replaced with that kind of
>> implementation, and the the existing "External" version will be moved to
>> part of the Go SDKs Dataflow runner package.
>>
>>
>> [1]
>> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/io/pubsubio/pubsubio.go#L65
>>
>>>
>>> Thanks,
>>> Ismaël
>>>
>>> [1] https://issues.apache.org/jira/browse/BEAM-6747
>>>
>>

Re: How to expose/use the External transform on Java SDK

2019-07-24 Thread Chamikara Jayalath
The idea of 'ExternalTransform' is to allow users to use transforms in SDK
X from SDK Y. I think this should be a core part of each SDK and
corresponding external transforms ([a] for Java, [b] for Python) should be
released with each SDK. This will also allow us to add core external
transforms to some of the critical transforms that are not available in
certain SDKs. So I prefer option (1).

Rebo, I didn't realize there's an external transform in Go SDK. Looking at
it, seems like it's more of an interface for native transforms implemented
in each runner, not for cross-language use-cases. Is that correct ? May be
we can reuse it for latter as well.

Thanks,
Cham

[a]
https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/External.java
[b]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/external.py

On Wed, Jul 24, 2019 at 10:25 AM Robert Burke  wrote:

> Ideas inline.
>
> On Wed, Jul 24, 2019, 9:56 AM Ismaël Mejía  wrote:
>
>> After Beam Summit EU I was curious about the External transform. I was
>> interested on the scenario of using it to call python code in the
>> middle of a Java pipeline. This is a potentially useful scenario for
>> example to evaluate models from python ML frameworks on Java
>> pipelines. In my example I did a transform to classify elements in a
>> simple Python ParDo and tried to connect it via the Java External
>> transform.
>>
>> I found that the ExternalTransform code was added into
>> `runners/core-construction-java` as part of BEAM-6747 [1]. However
>> this code is not exposed currently as part of the Beam Java SDK, so
>> end users won’t be able to find it easily. I found this weird and
>> thought well it will be as simple as to move it into the Java SDK and
>> voila!
>>
>> But of course this could not be so easy because this transform calls
>> the Expansion service via gRPC and Java SDK does not have (and
>> probably should not have) gRPC in its dependencies.
>> So my second reflex was to add it into Java SDK and translate it a
>> generic expansion all the runners, but this may not make sense because
>> the External transform is not part of the runner translation since
>> this is part of the Pipeline construction process (as pointed to me by
>> Max in a slack discussion).
>>
>> So the question is: How do you think this should be exposed to the end
>> users?
>>
>> 1. Should we add gRPC with all its deps to SDKs Java core? (this of
>> course it is not nice because we will leak our vendored gRPC and
>> friends into users classpath).
>>
> If there's separation between the SDK and the Harness then this makes
> sense. Otherwise the portable harness depends on GRPC at present, doesn't
> it? Presently the Go SDK kicks off the harness, and then carries the GRPC
> dependency (Though that's separable if necessary.)
>
>> 2. Should we do the dynamic loading of classes only an runtime if the
>> transform is used to avoid the big extra compile dependency (and add
>> runners/core-construction-java) as a runtime dependency.
>> 3. Should we create a ‘shim’ module to hide the gRPC dependency and
>> load the gRPC classes dynamically on it when the External transform is
>> part of the pipeline.
>> 4. Should we pack it as an extension (with the same issue of needing
>> to leak the dependencies, but with less impact for users who do not
>> use External) ?
>> 5. Other?
>>
>> The ‘purist’ me thinks we should have External in sdks/java/core but
>> maybe it is better not to. Any other opinions or ideas?
>>
>
> The Go SDK supports External in it's core transforms set  However it would
> be the callers are able to populate the data field however they need to,
> whether that's some "known" configuration object or something sourced from
> another service (eg the expansion service). The important part on the other
> side is that the runner knows what to do with it.
>
> The non-portable pubsubio in the Go SDK is an example [1] using External
> currently. The Dataflow runner recognizes it, and makes the substitution.
> Eventually once the SDK supports SDF that can generate unbounded
> PCollections, this will likely be replaced with that kind of
> implementation, and the the existing "External" version will be moved to
> part of the Go SDKs Dataflow runner package.
>
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/io/pubsubio/pubsubio.go#L65
>
>>
>> Thanks,
>> Ismaël
>>
>> [1] https://issues.apache.org/jira/browse/BEAM-6747
>>
>


Re: How to expose/use the External transform on Java SDK

2019-07-24 Thread Robert Burke
Ideas inline.

On Wed, Jul 24, 2019, 9:56 AM Ismaël Mejía  wrote:

> After Beam Summit EU I was curious about the External transform. I was
> interested on the scenario of using it to call python code in the
> middle of a Java pipeline. This is a potentially useful scenario for
> example to evaluate models from python ML frameworks on Java
> pipelines. In my example I did a transform to classify elements in a
> simple Python ParDo and tried to connect it via the Java External
> transform.
>
> I found that the ExternalTransform code was added into
> `runners/core-construction-java` as part of BEAM-6747 [1]. However
> this code is not exposed currently as part of the Beam Java SDK, so
> end users won’t be able to find it easily. I found this weird and
> thought well it will be as simple as to move it into the Java SDK and
> voila!
>
> But of course this could not be so easy because this transform calls
> the Expansion service via gRPC and Java SDK does not have (and
> probably should not have) gRPC in its dependencies.
> So my second reflex was to add it into Java SDK and translate it a
> generic expansion all the runners, but this may not make sense because
> the External transform is not part of the runner translation since
> this is part of the Pipeline construction process (as pointed to me by
> Max in a slack discussion).
>
> So the question is: How do you think this should be exposed to the end
> users?
>
> 1. Should we add gRPC with all its deps to SDKs Java core? (this of
> course it is not nice because we will leak our vendored gRPC and
> friends into users classpath).
>
If there's separation between the SDK and the Harness then this makes
sense. Otherwise the portable harness depends on GRPC at present, doesn't
it? Presently the Go SDK kicks off the harness, and then carries the GRPC
dependency (Though that's separable if necessary.)

> 2. Should we do the dynamic loading of classes only an runtime if the
> transform is used to avoid the big extra compile dependency (and add
> runners/core-construction-java) as a runtime dependency.
> 3. Should we create a ‘shim’ module to hide the gRPC dependency and
> load the gRPC classes dynamically on it when the External transform is
> part of the pipeline.
> 4. Should we pack it as an extension (with the same issue of needing
> to leak the dependencies, but with less impact for users who do not
> use External) ?
> 5. Other?
>
> The ‘purist’ me thinks we should have External in sdks/java/core but
> maybe it is better not to. Any other opinions or ideas?
>

The Go SDK supports External in it's core transforms set  However it would
be the callers are able to populate the data field however they need to,
whether that's some "known" configuration object or something sourced from
another service (eg the expansion service). The important part on the other
side is that the runner knows what to do with it.

The non-portable pubsubio in the Go SDK is an example [1] using External
currently. The Dataflow runner recognizes it, and makes the substitution.
Eventually once the SDK supports SDF that can generate unbounded
PCollections, this will likely be replaced with that kind of
implementation, and the the existing "External" version will be moved to
part of the Go SDKs Dataflow runner package.


[1]
https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/io/pubsubio/pubsubio.go#L65

>
> Thanks,
> Ismaël
>
> [1] https://issues.apache.org/jira/browse/BEAM-6747
>


How to expose/use the External transform on Java SDK

2019-07-24 Thread Ismaël Mejía
After Beam Summit EU I was curious about the External transform. I was
interested on the scenario of using it to call python code in the
middle of a Java pipeline. This is a potentially useful scenario for
example to evaluate models from python ML frameworks on Java
pipelines. In my example I did a transform to classify elements in a
simple Python ParDo and tried to connect it via the Java External
transform.

I found that the ExternalTransform code was added into
`runners/core-construction-java` as part of BEAM-6747 [1]. However
this code is not exposed currently as part of the Beam Java SDK, so
end users won’t be able to find it easily. I found this weird and
thought well it will be as simple as to move it into the Java SDK and
voila!

But of course this could not be so easy because this transform calls
the Expansion service via gRPC and Java SDK does not have (and
probably should not have) gRPC in its dependencies.
So my second reflex was to add it into Java SDK and translate it a
generic expansion all the runners, but this may not make sense because
the External transform is not part of the runner translation since
this is part of the Pipeline construction process (as pointed to me by
Max in a slack discussion).

So the question is: How do you think this should be exposed to the end users?

1. Should we add gRPC with all its deps to SDKs Java core? (this of
course it is not nice because we will leak our vendored gRPC and
friends into users classpath).
2. Should we do the dynamic loading of classes only an runtime if the
transform is used to avoid the big extra compile dependency (and add
runners/core-construction-java) as a runtime dependency.
3. Should we create a ‘shim’ module to hide the gRPC dependency and
load the gRPC classes dynamically on it when the External transform is
part of the pipeline.
4. Should we pack it as an extension (with the same issue of needing
to leak the dependencies, but with less impact for users who do not
use External) ?
5. Other?

The ‘purist’ me thinks we should have External in sdks/java/core but
maybe it is better not to. Any other opinions or ideas?

Thanks,
Ismaël

[1] https://issues.apache.org/jira/browse/BEAM-6747