Re: Artifact staging in cross-language pipelines

2019-12-17 Thread Robert Bradshaw
Thanks! I've left some comments on the doc.

On Tue, Dec 17, 2019, 5:03 PM Heejong Lee  wrote:

> Hi,
>
> I wrote the draft on implementation plan[1]. The summary is on the first
> page. Any help would be appreciated!
>
> [1]:
> https://docs.google.com/document/d/1L7MJcfyy9mg2Ahfw5XPhUeBe-dyvAPMOYOiFA1-kAog/edit?usp=sharing
>
> On Thu, Dec 12, 2019 at 5:02 PM Heejong Lee  wrote:
>
>> I'm brushing up memory by revisiting the doc[1] and it seems like we've
>> already reached the consensus on the bigger picture. I would start drafting
>> the implementation plan.
>>
>> [1]:
>> https://docs.google.com/document/d/1XaiNekAY2sptuQRIXpjGAyaYdSc-wlJ-VKjl04c8N48/edit?usp=sharing
>>
>> On Tue, Nov 26, 2019 at 3:54 AM Maximilian Michels 
>> wrote:
>>
>>> Hey Heejong,
>>>
>>> I don't think so. It would be great to push this forward.
>>>
>>> Thanks,
>>> Max
>>>
>>> On 26.11.19 02:49, Heejong Lee wrote:
>>> > Hi,
>>> >
>>> > Is anyone actively working on artifact staging extension for
>>> > cross-language pipelines? I'm thinking I can contribute to it in
>>> coming
>>> > Dec. If anyone has any progress on this and needs help, please let me
>>> know.
>>> >
>>> > Thanks,
>>> >
>>> > On Wed, Jun 12, 2019 at 2:42 AM Ismaël Mejía >> > > wrote:
>>> >
>>> > Can you please add this to the design documents webpage.
>>> > https://beam.apache.org/contribute/design-documents/
>>> >
>>> > On Wed, May 8, 2019 at 7:29 PM Chamikara Jayalath
>>> > mailto:chamik...@google.com>> wrote:
>>> >  >
>>> >  >
>>> >  >
>>> >  > On Tue, May 7, 2019 at 10:21 AM Maximilian Michels
>>> > mailto:m...@apache.org>> wrote:
>>> >  >>
>>> >  >> Here's the first draft:
>>> >  >>
>>> >
>>> https://docs.google.com/document/d/1XaiNekAY2sptuQRIXpjGAyaYdSc-wlJ-VKjl04c8N48/edit?usp=sharing
>>> >  >>
>>> >  >> It's rather high-level. We may want to add more details once
>>> we have
>>> >  >> finalized the design. Feel free to make comments and edits.
>>> >  >
>>> >  >
>>> >  > Thanks Max. Added some comments.
>>> >  >
>>> >  >>
>>> >  >>
>>> >  >> > All of this goes back to the idea that I think the listing of
>>> >  >> > artifacts (or more general dependencies) should be a property
>>> > of the
>>> >  >> > environment themselves.
>>> >  >>
>>> >  >> +1 I came to the same conclusion while thinking about how to
>>> store
>>> >  >> artifact information for deferred execution of the pipeline.
>>> >  >>
>>> >  >> -Max
>>> >  >>
>>> >  >> On 07.05.19 18:10, Robert Bradshaw wrote:
>>> >  >> > Looking forward to your writeup, Max. In the meantime, some
>>> > comments below.
>>> >  >> >
>>> >  >> >
>>> >  >> > From: Lukasz Cwik mailto:lc...@google.com
>>> >>
>>> >  >> > Date: Thu, May 2, 2019 at 6:45 PM
>>> >  >> > To: dev
>>> >  >> >
>>> >  >> >>
>>> >  >> >>
>>> >  >> >> On Thu, May 2, 2019 at 7:20 AM Robert Bradshaw
>>> > mailto:rober...@google.com>> wrote:
>>> >  >> >>>
>>> >  >> >>> On Sat, Apr 27, 2019 at 1:14 AM Lukasz Cwik
>>> > mailto:lc...@google.com>> wrote:
>>> >  >> 
>>> >  >>  We should stick with URN + payload + artifact metadata[1]
>>> > where the only mandatory one that all SDKs and expansion services
>>> > understand is the "bytes" artifact type. This allows us to add
>>> > optional URNs for file://, http://, Maven, PyPi, ... in the
>>> future.
>>> > I would make the artifact staging service use the same URN +
>>> payload
>>> > mechanism to get compatibility of artifacts across the different
>>> > services and also have the artifact staging service be able to be
>>> > queried for the list of artifact types it supports.
>>> >  >> >>>
>>> >  >> >>> +1
>>> >  >> >>>
>>> >  >>  Finally, we would need to have environments enumerate the
>>> > artifact types that they support.
>>> >  >> >>>
>>> >  >> >>> Meaning at runtime, or as another field statically set in
>>> > the proto?
>>> >  >> >>
>>> >  >> >>
>>> >  >> >> I don't believe runners/SDKs should have to know what
>>> > artifacts each environment supports at runtime and instead have
>>> > environments enumerate them explicitly in the proto. I have been
>>> > thinking about a more general "capabilities" block on environments
>>> > which allow them to enumerate URNs that the environment
>>> understands.
>>> > This would include artifact type URNs, PTransform URNs, coder URNs,
>>> > ... I haven't proposed anything specific down this line yet because
>>> > I was wondering how environment resources (CPU, min memory,
>>> hardware
>>> > like GPU, AWS/GCP/Azure/... machine types) should/could tie into
>>> this.
>>> >  >> >>
>>> >  >> >>>
>>> >  >>  Having everyone have the same "artifact" representation
>>> > would be beneficial since:
>>> >  >>  a) 

Re: Artifact staging in cross-language pipelines

2019-12-12 Thread Heejong Lee
I'm brushing up memory by revisiting the doc[1] and it seems like we've
already reached the consensus on the bigger picture. I would start drafting
the implementation plan.

[1]:
https://docs.google.com/document/d/1XaiNekAY2sptuQRIXpjGAyaYdSc-wlJ-VKjl04c8N48/edit?usp=sharing

On Tue, Nov 26, 2019 at 3:54 AM Maximilian Michels  wrote:

> Hey Heejong,
>
> I don't think so. It would be great to push this forward.
>
> Thanks,
> Max
>
> On 26.11.19 02:49, Heejong Lee wrote:
> > Hi,
> >
> > Is anyone actively working on artifact staging extension for
> > cross-language pipelines? I'm thinking I can contribute to it in coming
> > Dec. If anyone has any progress on this and needs help, please let me
> know.
> >
> > Thanks,
> >
> > On Wed, Jun 12, 2019 at 2:42 AM Ismaël Mejía  > > wrote:
> >
> > Can you please add this to the design documents webpage.
> > https://beam.apache.org/contribute/design-documents/
> >
> > On Wed, May 8, 2019 at 7:29 PM Chamikara Jayalath
> > mailto:chamik...@google.com>> wrote:
> >  >
> >  >
> >  >
> >  > On Tue, May 7, 2019 at 10:21 AM Maximilian Michels
> > mailto:m...@apache.org>> wrote:
> >  >>
> >  >> Here's the first draft:
> >  >>
> >
> https://docs.google.com/document/d/1XaiNekAY2sptuQRIXpjGAyaYdSc-wlJ-VKjl04c8N48/edit?usp=sharing
> >  >>
> >  >> It's rather high-level. We may want to add more details once we
> have
> >  >> finalized the design. Feel free to make comments and edits.
> >  >
> >  >
> >  > Thanks Max. Added some comments.
> >  >
> >  >>
> >  >>
> >  >> > All of this goes back to the idea that I think the listing of
> >  >> > artifacts (or more general dependencies) should be a property
> > of the
> >  >> > environment themselves.
> >  >>
> >  >> +1 I came to the same conclusion while thinking about how to
> store
> >  >> artifact information for deferred execution of the pipeline.
> >  >>
> >  >> -Max
> >  >>
> >  >> On 07.05.19 18:10, Robert Bradshaw wrote:
> >  >> > Looking forward to your writeup, Max. In the meantime, some
> > comments below.
> >  >> >
> >  >> >
> >  >> > From: Lukasz Cwik mailto:lc...@google.com>>
> >  >> > Date: Thu, May 2, 2019 at 6:45 PM
> >  >> > To: dev
> >  >> >
> >  >> >>
> >  >> >>
> >  >> >> On Thu, May 2, 2019 at 7:20 AM Robert Bradshaw
> > mailto:rober...@google.com>> wrote:
> >  >> >>>
> >  >> >>> On Sat, Apr 27, 2019 at 1:14 AM Lukasz Cwik
> > mailto:lc...@google.com>> wrote:
> >  >> 
> >  >>  We should stick with URN + payload + artifact metadata[1]
> > where the only mandatory one that all SDKs and expansion services
> > understand is the "bytes" artifact type. This allows us to add
> > optional URNs for file://, http://, Maven, PyPi, ... in the future.
> > I would make the artifact staging service use the same URN + payload
> > mechanism to get compatibility of artifacts across the different
> > services and also have the artifact staging service be able to be
> > queried for the list of artifact types it supports.
> >  >> >>>
> >  >> >>> +1
> >  >> >>>
> >  >>  Finally, we would need to have environments enumerate the
> > artifact types that they support.
> >  >> >>>
> >  >> >>> Meaning at runtime, or as another field statically set in
> > the proto?
> >  >> >>
> >  >> >>
> >  >> >> I don't believe runners/SDKs should have to know what
> > artifacts each environment supports at runtime and instead have
> > environments enumerate them explicitly in the proto. I have been
> > thinking about a more general "capabilities" block on environments
> > which allow them to enumerate URNs that the environment understands.
> > This would include artifact type URNs, PTransform URNs, coder URNs,
> > ... I haven't proposed anything specific down this line yet because
> > I was wondering how environment resources (CPU, min memory, hardware
> > like GPU, AWS/GCP/Azure/... machine types) should/could tie into
> this.
> >  >> >>
> >  >> >>>
> >  >>  Having everyone have the same "artifact" representation
> > would be beneficial since:
> >  >>  a) Python environments could install dependencies from a
> > requirements.txt file (something that the Google Cloud Dataflow
> > Python docker container allows for today)
> >  >>  b) It provides an extensible and versioned mechanism for
> > SDKs, environments, and artifact staging/retrieval services to
> > support additional artifact types
> >  >>  c) Allow for expressing a canonical representation of an
> > artifact like a Maven package so a runner could merge environments
> > that the runner deems compatible.
> >  >> 
> >  >>  The flow I could see is:
> >  >>  1) (optional) query 

Re: Artifact staging in cross-language pipelines

2019-11-26 Thread Maximilian Michels

Hey Heejong,

I don't think so. It would be great to push this forward.

Thanks,
Max

On 26.11.19 02:49, Heejong Lee wrote:

Hi,

Is anyone actively working on artifact staging extension for 
cross-language pipelines? I'm thinking I can contribute to it in coming 
Dec. If anyone has any progress on this and needs help, please let me know.


Thanks,

On Wed, Jun 12, 2019 at 2:42 AM Ismaël Mejía > wrote:


Can you please add this to the design documents webpage.
https://beam.apache.org/contribute/design-documents/

On Wed, May 8, 2019 at 7:29 PM Chamikara Jayalath
mailto:chamik...@google.com>> wrote:
 >
 >
 >
 > On Tue, May 7, 2019 at 10:21 AM Maximilian Michels
mailto:m...@apache.org>> wrote:
 >>
 >> Here's the first draft:
 >>

https://docs.google.com/document/d/1XaiNekAY2sptuQRIXpjGAyaYdSc-wlJ-VKjl04c8N48/edit?usp=sharing
 >>
 >> It's rather high-level. We may want to add more details once we have
 >> finalized the design. Feel free to make comments and edits.
 >
 >
 > Thanks Max. Added some comments.
 >
 >>
 >>
 >> > All of this goes back to the idea that I think the listing of
 >> > artifacts (or more general dependencies) should be a property
of the
 >> > environment themselves.
 >>
 >> +1 I came to the same conclusion while thinking about how to store
 >> artifact information for deferred execution of the pipeline.
 >>
 >> -Max
 >>
 >> On 07.05.19 18:10, Robert Bradshaw wrote:
 >> > Looking forward to your writeup, Max. In the meantime, some
comments below.
 >> >
 >> >
 >> > From: Lukasz Cwik mailto:lc...@google.com>>
 >> > Date: Thu, May 2, 2019 at 6:45 PM
 >> > To: dev
 >> >
 >> >>
 >> >>
 >> >> On Thu, May 2, 2019 at 7:20 AM Robert Bradshaw
mailto:rober...@google.com>> wrote:
 >> >>>
 >> >>> On Sat, Apr 27, 2019 at 1:14 AM Lukasz Cwik
mailto:lc...@google.com>> wrote:
 >> 
 >>  We should stick with URN + payload + artifact metadata[1]
where the only mandatory one that all SDKs and expansion services
understand is the "bytes" artifact type. This allows us to add
optional URNs for file://, http://, Maven, PyPi, ... in the future.
I would make the artifact staging service use the same URN + payload
mechanism to get compatibility of artifacts across the different
services and also have the artifact staging service be able to be
queried for the list of artifact types it supports.
 >> >>>
 >> >>> +1
 >> >>>
 >>  Finally, we would need to have environments enumerate the
artifact types that they support.
 >> >>>
 >> >>> Meaning at runtime, or as another field statically set in
the proto?
 >> >>
 >> >>
 >> >> I don't believe runners/SDKs should have to know what
artifacts each environment supports at runtime and instead have
environments enumerate them explicitly in the proto. I have been
thinking about a more general "capabilities" block on environments
which allow them to enumerate URNs that the environment understands.
This would include artifact type URNs, PTransform URNs, coder URNs,
... I haven't proposed anything specific down this line yet because
I was wondering how environment resources (CPU, min memory, hardware
like GPU, AWS/GCP/Azure/... machine types) should/could tie into this.
 >> >>
 >> >>>
 >>  Having everyone have the same "artifact" representation
would be beneficial since:
 >>  a) Python environments could install dependencies from a
requirements.txt file (something that the Google Cloud Dataflow
Python docker container allows for today)
 >>  b) It provides an extensible and versioned mechanism for
SDKs, environments, and artifact staging/retrieval services to
support additional artifact types
 >>  c) Allow for expressing a canonical representation of an
artifact like a Maven package so a runner could merge environments
that the runner deems compatible.
 >> 
 >>  The flow I could see is:
 >>  1) (optional) query artifact staging service for supported
artifact types
 >>  2) SDK request expansion service to expand transform
passing in a list of artifact types the SDK and artifact staging
service support, the expansion service returns a list of artifact
types limited to those supported types + any supported by the
environment
 >> >>>
 >> >>> The crux of the issue seems to be how the expansion service
returns
 >> >>> the artifacts themselves. Is this going with the approach
that the
 >> >>> caller of the expansion service must host an artifact
staging service?
 >> >>
 >> >>
 >> >> The caller would not need to host an artifact staging service
(but would become effectively a proxy service, see 

Re: Artifact staging in cross-language pipelines

2019-06-12 Thread Ismaël Mejía
Can you please add this to the design documents webpage.
https://beam.apache.org/contribute/design-documents/

On Wed, May 8, 2019 at 7:29 PM Chamikara Jayalath  wrote:
>
>
>
> On Tue, May 7, 2019 at 10:21 AM Maximilian Michels  wrote:
>>
>> Here's the first draft:
>> https://docs.google.com/document/d/1XaiNekAY2sptuQRIXpjGAyaYdSc-wlJ-VKjl04c8N48/edit?usp=sharing
>>
>> It's rather high-level. We may want to add more details once we have
>> finalized the design. Feel free to make comments and edits.
>
>
> Thanks Max. Added some comments.
>
>>
>>
>> > All of this goes back to the idea that I think the listing of
>> > artifacts (or more general dependencies) should be a property of the
>> > environment themselves.
>>
>> +1 I came to the same conclusion while thinking about how to store
>> artifact information for deferred execution of the pipeline.
>>
>> -Max
>>
>> On 07.05.19 18:10, Robert Bradshaw wrote:
>> > Looking forward to your writeup, Max. In the meantime, some comments below.
>> >
>> >
>> > From: Lukasz Cwik 
>> > Date: Thu, May 2, 2019 at 6:45 PM
>> > To: dev
>> >
>> >>
>> >>
>> >> On Thu, May 2, 2019 at 7:20 AM Robert Bradshaw  
>> >> wrote:
>> >>>
>> >>> On Sat, Apr 27, 2019 at 1:14 AM Lukasz Cwik  wrote:
>> 
>>  We should stick with URN + payload + artifact metadata[1] where the 
>>  only mandatory one that all SDKs and expansion services understand is 
>>  the "bytes" artifact type. This allows us to add optional URNs for 
>>  file://, http://, Maven, PyPi, ... in the future. I would make the 
>>  artifact staging service use the same URN + payload mechanism to get 
>>  compatibility of artifacts across the different services and also have 
>>  the artifact staging service be able to be queried for the list of 
>>  artifact types it supports.
>> >>>
>> >>> +1
>> >>>
>>  Finally, we would need to have environments enumerate the artifact 
>>  types that they support.
>> >>>
>> >>> Meaning at runtime, or as another field statically set in the proto?
>> >>
>> >>
>> >> I don't believe runners/SDKs should have to know what artifacts each 
>> >> environment supports at runtime and instead have environments enumerate 
>> >> them explicitly in the proto. I have been thinking about a more general 
>> >> "capabilities" block on environments which allow them to enumerate URNs 
>> >> that the environment understands. This would include artifact type URNs, 
>> >> PTransform URNs, coder URNs, ... I haven't proposed anything specific 
>> >> down this line yet because I was wondering how environment resources 
>> >> (CPU, min memory, hardware like GPU, AWS/GCP/Azure/... machine types) 
>> >> should/could tie into this.
>> >>
>> >>>
>>  Having everyone have the same "artifact" representation would be 
>>  beneficial since:
>>  a) Python environments could install dependencies from a 
>>  requirements.txt file (something that the Google Cloud Dataflow Python 
>>  docker container allows for today)
>>  b) It provides an extensible and versioned mechanism for SDKs, 
>>  environments, and artifact staging/retrieval services to support 
>>  additional artifact types
>>  c) Allow for expressing a canonical representation of an artifact like 
>>  a Maven package so a runner could merge environments that the runner 
>>  deems compatible.
>> 
>>  The flow I could see is:
>>  1) (optional) query artifact staging service for supported artifact 
>>  types
>>  2) SDK request expansion service to expand transform passing in a list 
>>  of artifact types the SDK and artifact staging service support, the 
>>  expansion service returns a list of artifact types limited to those 
>>  supported types + any supported by the environment
>> >>>
>> >>> The crux of the issue seems to be how the expansion service returns
>> >>> the artifacts themselves. Is this going with the approach that the
>> >>> caller of the expansion service must host an artifact staging service?
>> >>
>> >>
>> >> The caller would not need to host an artifact staging service (but would 
>> >> become effectively a proxy service, see my comment below for more 
>> >> details) as I would have expected this to be part of the expansion 
>> >> service response.
>> >>
>> >>>
>> >>> There is also the question here is how the returned artifacts get
>> >>> attached to the various environments, or whether they get implicitly
>> >>> applied to all returned stages (which need not have a consistent
>> >>> environment)?
>> >>
>> >>
>> >> I would suggest returning additional information that says what artifact 
>> >> is for which environment. Applying all artifacts to all environments is 
>> >> likely to cause issues since some environments may not understand certain 
>> >> artifact types or may get conflicting versions of artifacts. I would see 
>> >> this happening since an expansion service that aggregates other expansion 
>> >> services seems likely, for 

Re: Artifact staging in cross-language pipelines

2019-05-08 Thread Chamikara Jayalath
On Tue, May 7, 2019 at 10:21 AM Maximilian Michels  wrote:

> Here's the first draft:
>
> https://docs.google.com/document/d/1XaiNekAY2sptuQRIXpjGAyaYdSc-wlJ-VKjl04c8N48/edit?usp=sharing
>
> It's rather high-level. We may want to add more details once we have
> finalized the design. Feel free to make comments and edits.
>

Thanks Max. Added some comments.


>
> > All of this goes back to the idea that I think the listing of
> > artifacts (or more general dependencies) should be a property of the
> > environment themselves.
>
> +1 I came to the same conclusion while thinking about how to store
> artifact information for deferred execution of the pipeline.
>
> -Max
>
> On 07.05.19 18:10, Robert Bradshaw wrote:
> > Looking forward to your writeup, Max. In the meantime, some comments
> below.
> >
> >
> > From: Lukasz Cwik 
> > Date: Thu, May 2, 2019 at 6:45 PM
> > To: dev
> >
> >>
> >>
> >> On Thu, May 2, 2019 at 7:20 AM Robert Bradshaw 
> wrote:
> >>>
> >>> On Sat, Apr 27, 2019 at 1:14 AM Lukasz Cwik  wrote:
> 
>  We should stick with URN + payload + artifact metadata[1] where the
> only mandatory one that all SDKs and expansion services understand is the
> "bytes" artifact type. This allows us to add optional URNs for file://,
> http://, Maven, PyPi, ... in the future. I would make the artifact
> staging service use the same URN + payload mechanism to get compatibility
> of artifacts across the different services and also have the artifact
> staging service be able to be queried for the list of artifact types it
> supports.
> >>>
> >>> +1
> >>>
>  Finally, we would need to have environments enumerate the artifact
> types that they support.
> >>>
> >>> Meaning at runtime, or as another field statically set in the proto?
> >>
> >>
> >> I don't believe runners/SDKs should have to know what artifacts each
> environment supports at runtime and instead have environments enumerate
> them explicitly in the proto. I have been thinking about a more general
> "capabilities" block on environments which allow them to enumerate URNs
> that the environment understands. This would include artifact type URNs,
> PTransform URNs, coder URNs, ... I haven't proposed anything specific down
> this line yet because I was wondering how environment resources (CPU, min
> memory, hardware like GPU, AWS/GCP/Azure/... machine types) should/could
> tie into this.
> >>
> >>>
>  Having everyone have the same "artifact" representation would be
> beneficial since:
>  a) Python environments could install dependencies from a
> requirements.txt file (something that the Google Cloud Dataflow Python
> docker container allows for today)
>  b) It provides an extensible and versioned mechanism for SDKs,
> environments, and artifact staging/retrieval services to support additional
> artifact types
>  c) Allow for expressing a canonical representation of an artifact
> like a Maven package so a runner could merge environments that the runner
> deems compatible.
> 
>  The flow I could see is:
>  1) (optional) query artifact staging service for supported artifact
> types
>  2) SDK request expansion service to expand transform passing in a
> list of artifact types the SDK and artifact staging service support, the
> expansion service returns a list of artifact types limited to those
> supported types + any supported by the environment
> >>>
> >>> The crux of the issue seems to be how the expansion service returns
> >>> the artifacts themselves. Is this going with the approach that the
> >>> caller of the expansion service must host an artifact staging service?
> >>
> >>
> >> The caller would not need to host an artifact staging service (but
> would become effectively a proxy service, see my comment below for more
> details) as I would have expected this to be part of the expansion service
> response.
> >>
> >>>
> >>> There is also the question here is how the returned artifacts get
> >>> attached to the various environments, or whether they get implicitly
> >>> applied to all returned stages (which need not have a consistent
> >>> environment)?
> >>
> >>
> >> I would suggest returning additional information that says what
> artifact is for which environment. Applying all artifacts to all
> environments is likely to cause issues since some environments may not
> understand certain artifact types or may get conflicting versions of
> artifacts. I would see this happening since an expansion service that
> aggregates other expansion services seems likely, for example:
> >>   /-> ExpansionSerivce(Python)
> >> ExpansionService(Aggregator) --> ExpansionService(Java)
> >>   \-> ExpansionSerivce(Go)
> >
> > All of this goes back to the idea that I think the listing of
> > artifacts (or more general dependencies) should be a property of the
> > environment themselves.
> >
>  3) SDK converts any artifact types that the artifact staging service
> or environment 

Re: Artifact staging in cross-language pipelines

2019-05-07 Thread Maximilian Michels
Here's the first draft: 
https://docs.google.com/document/d/1XaiNekAY2sptuQRIXpjGAyaYdSc-wlJ-VKjl04c8N48/edit?usp=sharing


It's rather high-level. We may want to add more details once we have 
finalized the design. Feel free to make comments and edits.



All of this goes back to the idea that I think the listing of
artifacts (or more general dependencies) should be a property of the
environment themselves.


+1 I came to the same conclusion while thinking about how to store 
artifact information for deferred execution of the pipeline.


-Max

On 07.05.19 18:10, Robert Bradshaw wrote:

Looking forward to your writeup, Max. In the meantime, some comments below.


From: Lukasz Cwik 
Date: Thu, May 2, 2019 at 6:45 PM
To: dev




On Thu, May 2, 2019 at 7:20 AM Robert Bradshaw  wrote:


On Sat, Apr 27, 2019 at 1:14 AM Lukasz Cwik  wrote:


We should stick with URN + payload + artifact metadata[1] where the only mandatory one 
that all SDKs and expansion services understand is the "bytes" artifact type. 
This allows us to add optional URNs for file://, http://, Maven, PyPi, ... in the future. 
I would make the artifact staging service use the same URN + payload mechanism to get 
compatibility of artifacts across the different services and also have the artifact 
staging service be able to be queried for the list of artifact types it supports.


+1


Finally, we would need to have environments enumerate the artifact types that 
they support.


Meaning at runtime, or as another field statically set in the proto?



I don't believe runners/SDKs should have to know what artifacts each environment supports 
at runtime and instead have environments enumerate them explicitly in the proto. I have 
been thinking about a more general "capabilities" block on environments which 
allow them to enumerate URNs that the environment understands. This would include 
artifact type URNs, PTransform URNs, coder URNs, ... I haven't proposed anything specific 
down this line yet because I was wondering how environment resources (CPU, min memory, 
hardware like GPU, AWS/GCP/Azure/... machine types) should/could tie into this.




Having everyone have the same "artifact" representation would be beneficial 
since:
a) Python environments could install dependencies from a requirements.txt file 
(something that the Google Cloud Dataflow Python docker container allows for 
today)
b) It provides an extensible and versioned mechanism for SDKs, environments, 
and artifact staging/retrieval services to support additional artifact types
c) Allow for expressing a canonical representation of an artifact like a Maven 
package so a runner could merge environments that the runner deems compatible.

The flow I could see is:
1) (optional) query artifact staging service for supported artifact types
2) SDK request expansion service to expand transform passing in a list of 
artifact types the SDK and artifact staging service support, the expansion 
service returns a list of artifact types limited to those supported types + any 
supported by the environment


The crux of the issue seems to be how the expansion service returns
the artifacts themselves. Is this going with the approach that the
caller of the expansion service must host an artifact staging service?



The caller would not need to host an artifact staging service (but would become 
effectively a proxy service, see my comment below for more details) as I would 
have expected this to be part of the expansion service response.



There is also the question here is how the returned artifacts get
attached to the various environments, or whether they get implicitly
applied to all returned stages (which need not have a consistent
environment)?



I would suggest returning additional information that says what artifact is for 
which environment. Applying all artifacts to all environments is likely to 
cause issues since some environments may not understand certain artifact types 
or may get conflicting versions of artifacts. I would see this happening since 
an expansion service that aggregates other expansion services seems likely, for 
example:
  /-> ExpansionSerivce(Python)
ExpansionService(Aggregator) --> ExpansionService(Java)
  \-> ExpansionSerivce(Go)


All of this goes back to the idea that I think the listing of
artifacts (or more general dependencies) should be a property of the
environment themselves.


3) SDK converts any artifact types that the artifact staging service or environment 
doesn't understand, e.g. pulls down Maven dependencies and converts them to 
"bytes" artifacts


Here I think we're conflating two things. The "type" of an artifact is
both (1) how to fetch the bytes and (2) how to interpret them (e.g. is
this a jar file, or a pip tarball, or just some data needed by a DoFn,
or ...) Only (1) can be freely transmuted.



Your right. Thinking about this some more, general artifact conversion is 
unlikely to be practical 

Re: Artifact staging in cross-language pipelines

2019-05-07 Thread Robert Bradshaw
Looking forward to your writeup, Max. In the meantime, some comments below.


From: Lukasz Cwik 
Date: Thu, May 2, 2019 at 6:45 PM
To: dev

>
>
> On Thu, May 2, 2019 at 7:20 AM Robert Bradshaw  wrote:
>>
>> On Sat, Apr 27, 2019 at 1:14 AM Lukasz Cwik  wrote:
>> >
>> > We should stick with URN + payload + artifact metadata[1] where the only 
>> > mandatory one that all SDKs and expansion services understand is the 
>> > "bytes" artifact type. This allows us to add optional URNs for file://, 
>> > http://, Maven, PyPi, ... in the future. I would make the artifact staging 
>> > service use the same URN + payload mechanism to get compatibility of 
>> > artifacts across the different services and also have the artifact staging 
>> > service be able to be queried for the list of artifact types it supports.
>>
>> +1
>>
>> > Finally, we would need to have environments enumerate the artifact types 
>> > that they support.
>>
>> Meaning at runtime, or as another field statically set in the proto?
>
>
> I don't believe runners/SDKs should have to know what artifacts each 
> environment supports at runtime and instead have environments enumerate them 
> explicitly in the proto. I have been thinking about a more general 
> "capabilities" block on environments which allow them to enumerate URNs that 
> the environment understands. This would include artifact type URNs, 
> PTransform URNs, coder URNs, ... I haven't proposed anything specific down 
> this line yet because I was wondering how environment resources (CPU, min 
> memory, hardware like GPU, AWS/GCP/Azure/... machine types) should/could tie 
> into this.
>
>>
>> > Having everyone have the same "artifact" representation would be 
>> > beneficial since:
>> > a) Python environments could install dependencies from a requirements.txt 
>> > file (something that the Google Cloud Dataflow Python docker container 
>> > allows for today)
>> > b) It provides an extensible and versioned mechanism for SDKs, 
>> > environments, and artifact staging/retrieval services to support 
>> > additional artifact types
>> > c) Allow for expressing a canonical representation of an artifact like a 
>> > Maven package so a runner could merge environments that the runner deems 
>> > compatible.
>> >
>> > The flow I could see is:
>> > 1) (optional) query artifact staging service for supported artifact types
>> > 2) SDK request expansion service to expand transform passing in a list of 
>> > artifact types the SDK and artifact staging service support, the expansion 
>> > service returns a list of artifact types limited to those supported types 
>> > + any supported by the environment
>>
>> The crux of the issue seems to be how the expansion service returns
>> the artifacts themselves. Is this going with the approach that the
>> caller of the expansion service must host an artifact staging service?
>
>
> The caller would not need to host an artifact staging service (but would 
> become effectively a proxy service, see my comment below for more details) as 
> I would have expected this to be part of the expansion service response.
>
>>
>> There is also the question here is how the returned artifacts get
>> attached to the various environments, or whether they get implicitly
>> applied to all returned stages (which need not have a consistent
>> environment)?
>
>
> I would suggest returning additional information that says what artifact is 
> for which environment. Applying all artifacts to all environments is likely 
> to cause issues since some environments may not understand certain artifact 
> types or may get conflicting versions of artifacts. I would see this 
> happening since an expansion service that aggregates other expansion services 
> seems likely, for example:
>  /-> ExpansionSerivce(Python)
> ExpansionService(Aggregator) --> ExpansionService(Java)
>  \-> ExpansionSerivce(Go)

All of this goes back to the idea that I think the listing of
artifacts (or more general dependencies) should be a property of the
environment themselves.

>> > 3) SDK converts any artifact types that the artifact staging service or 
>> > environment doesn't understand, e.g. pulls down Maven dependencies and 
>> > converts them to "bytes" artifacts
>>
>> Here I think we're conflating two things. The "type" of an artifact is
>> both (1) how to fetch the bytes and (2) how to interpret them (e.g. is
>> this a jar file, or a pip tarball, or just some data needed by a DoFn,
>> or ...) Only (1) can be freely transmuted.
>
>
> Your right. Thinking about this some more, general artifact conversion is 
> unlikely to be practical because how to interpret an artifact is environment 
> dependent. For example, a requirements.txt used to install pip packages for a 
> Python docker container depends on the filesystem layout of that specific 
> docker container. One could simulate doing a pip install on the same 
> filesystem, see the diff and then of all the packages 

Re: Artifact staging in cross-language pipelines

2019-05-02 Thread Maximilian Michels

BTW what are the next steps here ? Heejong or Max, will one of you be able to 
come up with a detailed proposal around this ?


Thank you for all the additional comments and ideas. I will try to 
capture them in a document and share it here. Of course we can continue 
the discussion in the meantime.


-Max

On 30.04.19 19:02, Chamikara Jayalath wrote:


On Fri, Apr 26, 2019 at 4:14 PM Lukasz Cwik > wrote:


We should stick with URN + payload + artifact metadata[1] where the
only mandatory one that all SDKs and expansion services understand
is the "bytes" artifact type. This allows us to add optional URNs
for file://, http://, Maven, PyPi, ... in the future. I would make
the artifact staging service use the same URN + payload mechanism to
get compatibility of artifacts across the different services and
also have the artifact staging service be able to be queried for the
list of artifact types it supports. Finally, we would need to have
environments enumerate the artifact types that they support.

Having everyone have the same "artifact" representation would be
beneficial since:
a) Python environments could install dependencies from a
requirements.txt file (something that the Google Cloud Dataflow
Python docker container allows for today)
b) It provides an extensible and versioned mechanism for SDKs,
environments, and artifact staging/retrieval services to support
additional artifact types
c) Allow for expressing a canonical representation of an artifact
like a Maven package so a runner could merge environments that the
runner deems compatible.

The flow I could see is:
1) (optional) query artifact staging service for supported artifact
types
2) SDK request expansion service to expand transform passing in a
list of artifact types the SDK and artifact staging service support,
the expansion service returns a list of artifact types limited to
those supported types + any supported by the environment
3) SDK converts any artifact types that the artifact staging service
or environment doesn't understand, e.g. pulls down Maven
dependencies and converts them to "bytes" artifacts
4) SDK sends artifacts to artifact staging service
5) Artifact staging service converts any artifacts to types that the
environment understands
6) Environment is started and gets artifacts from the artifact
retrieval service.


This is a very interesting proposal. I would add:
(5.5) artifact staging service resolves conflicts/duplicates for 
artifacts needed by different transforms of the same pipeline


BTW what are the next steps here ? Heejong or Max, will one of you be 
able to come up with a detailed proposal around this ?


In the meantime I suggest we add temporary pipeline options for staging 
Java dependencies from Python (and vice versa) to unblock development 
and testing of rest of the cross-language transforms stack. For example, 
https://github.com/apache/beam/pull/8340


Thanks,
Cham


On Wed, Apr 24, 2019 at 4:44 AM Robert Bradshaw mailto:rober...@google.com>> wrote:

On Wed, Apr 24, 2019 at 12:21 PM Maximilian Michels
mailto:m...@apache.org>> wrote:
 >
 > Good idea to let the client expose an artifact staging
service that the
 > ExpansionService could use to stage artifacts. This solves
two problems:
 >
 > (1) The Expansion Service not being able to access the Job Server
 > artifact staging service
 > (2) The client not having access to the dependencies returned
by the
 > Expansion Server
 >
 > The downside is that it adds an additional indirection. The
alternative
 > to let the client handle staging the artifacts returned by
the Expansion
 > Server is more transparent and easier to implement.

The other downside is that it may not always be possible for the
expansion service to connect to the artifact staging service (e.g.
when constructing a pipeline locally against a remote expansion
service).


Just to make sure, your saying the expansion service would return
all the artifacts (bytes, urls, ...) as part of the response since
the expansion service wouldn't be able to connect to the SDK that is
running locally either.


 > Ideally, the Expansion Service won't return any dependencies
because the
 > environment already contains the required dependencies. We
could make it
 > a requirement for the expansion to be performed inside an
environment.
 > Then we would already ensure during expansion time that the
runtime
 > dependencies are available.

Yes, it's cleanest if the expansion service provides an environment
without all the dependencies provided. Interesting idea to make
  

Re: Artifact staging in cross-language pipelines

2019-05-02 Thread Lukasz Cwik
On Thu, May 2, 2019 at 7:20 AM Robert Bradshaw  wrote:

> On Sat, Apr 27, 2019 at 1:14 AM Lukasz Cwik  wrote:
> >
> > We should stick with URN + payload + artifact metadata[1] where the only
> mandatory one that all SDKs and expansion services understand is the
> "bytes" artifact type. This allows us to add optional URNs for file://,
> http://, Maven, PyPi, ... in the future. I would make the artifact
> staging service use the same URN + payload mechanism to get compatibility
> of artifacts across the different services and also have the artifact
> staging service be able to be queried for the list of artifact types it
> supports.
>
> +1
>
> > Finally, we would need to have environments enumerate the artifact types
> that they support.
>
> Meaning at runtime, or as another field statically set in the proto?
>

I don't believe runners/SDKs should have to know what artifacts each
environment supports at runtime and instead have environments enumerate
them explicitly in the proto. I have been thinking about a more general
"capabilities" block on environments which allow them to enumerate URNs
that the environment understands. This would include artifact type URNs,
PTransform URNs, coder URNs, ... I haven't proposed anything specific down
this line yet because I was wondering how environment resources (CPU, min
memory, hardware like GPU, AWS/GCP/Azure/... machine types) should/could
tie into this.


> > Having everyone have the same "artifact" representation would be
> beneficial since:
> > a) Python environments could install dependencies from a
> requirements.txt file (something that the Google Cloud Dataflow Python
> docker container allows for today)
> > b) It provides an extensible and versioned mechanism for SDKs,
> environments, and artifact staging/retrieval services to support additional
> artifact types
> > c) Allow for expressing a canonical representation of an artifact like a
> Maven package so a runner could merge environments that the runner deems
> compatible.
> >
> > The flow I could see is:
> > 1) (optional) query artifact staging service for supported artifact types
> > 2) SDK request expansion service to expand transform passing in a list
> of artifact types the SDK and artifact staging service support, the
> expansion service returns a list of artifact types limited to those
> supported types + any supported by the environment
>
> The crux of the issue seems to be how the expansion service returns
> the artifacts themselves. Is this going with the approach that the
> caller of the expansion service must host an artifact staging service?
>

The caller would not need to host an artifact staging service (but would
become effectively a proxy service, see my comment below for more details)
as I would have expected this to be part of the expansion service response.


> There is also the question here is how the returned artifacts get
> attached to the various environments, or whether they get implicitly
> applied to all returned stages (which need not have a consistent
> environment)?
>

I would suggest returning additional information that says what artifact is
for which environment. Applying all artifacts to all environments is likely
to cause issues since some environments may not understand certain artifact
types or may get conflicting versions of artifacts. I would see this
happening since an expansion service that aggregates other expansion
services seems likely, for example:
 /-> ExpansionSerivce(Python)
ExpansionService(Aggregator) --> ExpansionService(Java)
 \-> ExpansionSerivce(Go)

> 3) SDK converts any artifact types that the artifact staging service or
> environment doesn't understand, e.g. pulls down Maven dependencies and
> converts them to "bytes" artifacts
>
> Here I think we're conflating two things. The "type" of an artifact is
> both (1) how to fetch the bytes and (2) how to interpret them (e.g. is
> this a jar file, or a pip tarball, or just some data needed by a DoFn,
> or ...) Only (1) can be freely transmuted.


Your right. Thinking about this some more, general artifact conversion is
unlikely to be practical because how to interpret an artifact is
environment dependent. For example, a requirements.txt used to install pip
packages for a Python docker container depends on the filesystem layout of
that specific docker container. One could simulate doing a pip install on
the same filesystem, see the diff and then of all the packages in
requirements.txt but this quickly becomes impractical.


> > 4) SDK sends artifacts to artifact staging service
> > 5) Artifact staging service converts any artifacts to types that the
> environment understands
> > 6) Environment is started and gets artifacts from the artifact retrieval
> service.
> >
> > On Wed, Apr 24, 2019 at 4:44 AM Robert Bradshaw 
> wrote:
> >>
> >> On Wed, Apr 24, 2019 at 12:21 PM Maximilian Michels 
> wrote:
> >> >
> >> > Good idea to let the client expose an 

Re: Artifact staging in cross-language pipelines

2019-05-02 Thread Robert Bradshaw
On Sat, Apr 27, 2019 at 1:14 AM Lukasz Cwik  wrote:
>
> We should stick with URN + payload + artifact metadata[1] where the only 
> mandatory one that all SDKs and expansion services understand is the "bytes" 
> artifact type. This allows us to add optional URNs for file://, http://, 
> Maven, PyPi, ... in the future. I would make the artifact staging service use 
> the same URN + payload mechanism to get compatibility of artifacts across the 
> different services and also have the artifact staging service be able to be 
> queried for the list of artifact types it supports.

+1

> Finally, we would need to have environments enumerate the artifact types that 
> they support.

Meaning at runtime, or as another field statically set in the proto?

> Having everyone have the same "artifact" representation would be beneficial 
> since:
> a) Python environments could install dependencies from a requirements.txt 
> file (something that the Google Cloud Dataflow Python docker container allows 
> for today)
> b) It provides an extensible and versioned mechanism for SDKs, environments, 
> and artifact staging/retrieval services to support additional artifact types
> c) Allow for expressing a canonical representation of an artifact like a 
> Maven package so a runner could merge environments that the runner deems 
> compatible.
>
> The flow I could see is:
> 1) (optional) query artifact staging service for supported artifact types
> 2) SDK request expansion service to expand transform passing in a list of 
> artifact types the SDK and artifact staging service support, the expansion 
> service returns a list of artifact types limited to those supported types + 
> any supported by the environment

The crux of the issue seems to be how the expansion service returns
the artifacts themselves. Is this going with the approach that the
caller of the expansion service must host an artifact staging service?
There is also the question here is how the returned artifacts get
attached to the various environments, or whether they get implicitly
applied to all returned stages (which need not have a consistent
environment)?

> 3) SDK converts any artifact types that the artifact staging service or 
> environment doesn't understand, e.g. pulls down Maven dependencies and 
> converts them to "bytes" artifacts

Here I think we're conflating two things. The "type" of an artifact is
both (1) how to fetch the bytes and (2) how to interpret them (e.g. is
this a jar file, or a pip tarball, or just some data needed by a DoFn,
or ...) Only (1) can be freely transmuted.

> 4) SDK sends artifacts to artifact staging service
> 5) Artifact staging service converts any artifacts to types that the 
> environment understands
> 6) Environment is started and gets artifacts from the artifact retrieval 
> service.
>
> On Wed, Apr 24, 2019 at 4:44 AM Robert Bradshaw  wrote:
>>
>> On Wed, Apr 24, 2019 at 12:21 PM Maximilian Michels  wrote:
>> >
>> > Good idea to let the client expose an artifact staging service that the
>> > ExpansionService could use to stage artifacts. This solves two problems:
>> >
>> > (1) The Expansion Service not being able to access the Job Server
>> > artifact staging service
>> > (2) The client not having access to the dependencies returned by the
>> > Expansion Server
>> >
>> > The downside is that it adds an additional indirection. The alternative
>> > to let the client handle staging the artifacts returned by the Expansion
>> > Server is more transparent and easier to implement.
>>
>> The other downside is that it may not always be possible for the
>> expansion service to connect to the artifact staging service (e.g.
>> when constructing a pipeline locally against a remote expansion
>> service).
>
> Just to make sure, your saying the expansion service would return all the 
> artifacts (bytes, urls, ...) as part of the response since the expansion 
> service wouldn't be able to connect to the SDK that is running locally either.

Yes. Well, more I'm asking how the expansion service would return any
artifacts.

What we have is

Runner <--- SDK ---> Expansion service.

Where the unidirectional arrow means "instantiates a connection with"
and the other direction (and missing arrows) may not be possible.

>> > Ideally, the Expansion Service won't return any dependencies because the
>> > environment already contains the required dependencies. We could make it
>> > a requirement for the expansion to be performed inside an environment.
>> > Then we would already ensure during expansion time that the runtime
>> > dependencies are available.
>>
>> Yes, it's cleanest if the expansion service provides an environment
>> without all the dependencies provided. Interesting idea to make this a
>> property of the expansion service itself.
>
> I had thought this too but an opaque docker container that was built on top 
> of a base Beam docker container would be very difficult for a runner to 
> introspect and check to see if its compatible to 

Re: Artifact staging in cross-language pipelines

2019-04-30 Thread Lukasz Cwik
Agree on adding the 5.5 and the resolution of conflicts/duplicates could be
done by either the runner or the artifact staging service.

On Tue, Apr 30, 2019 at 10:03 AM Chamikara Jayalath 
wrote:

>
> On Fri, Apr 26, 2019 at 4:14 PM Lukasz Cwik  wrote:
>
>> We should stick with URN + payload + artifact metadata[1] where the only
>> mandatory one that all SDKs and expansion services understand is the
>> "bytes" artifact type. This allows us to add optional URNs for file://,
>> http://, Maven, PyPi, ... in the future. I would make the artifact
>> staging service use the same URN + payload mechanism to get compatibility
>> of artifacts across the different services and also have the artifact
>> staging service be able to be queried for the list of artifact types it
>> supports. Finally, we would need to have environments enumerate the
>> artifact types that they support.
>>
>> Having everyone have the same "artifact" representation would be
>> beneficial since:
>> a) Python environments could install dependencies from a requirements.txt
>> file (something that the Google Cloud Dataflow Python docker container
>> allows for today)
>> b) It provides an extensible and versioned mechanism for SDKs,
>> environments, and artifact staging/retrieval services to support additional
>> artifact types
>> c) Allow for expressing a canonical representation of an artifact like a
>> Maven package so a runner could merge environments that the runner deems
>> compatible.
>>
>> The flow I could see is:
>> 1) (optional) query artifact staging service for supported artifact types
>> 2) SDK request expansion service to expand transform passing in a list of
>> artifact types the SDK and artifact staging service support, the expansion
>> service returns a list of artifact types limited to those supported types +
>> any supported by the environment
>> 3) SDK converts any artifact types that the artifact staging service or
>> environment doesn't understand, e.g. pulls down Maven dependencies and
>> converts them to "bytes" artifacts
>> 4) SDK sends artifacts to artifact staging service
>> 5) Artifact staging service converts any artifacts to types that the
>> environment understands
>> 6) Environment is started and gets artifacts from the artifact retrieval
>> service.
>>
>
> This is a very interesting proposal. I would add:
> (5.5) artifact staging service resolves conflicts/duplicates for artifacts
> needed by different transforms of the same pipeline
>
> BTW what are the next steps here ? Heejong or Max, will one of you be able
> to come up with a detailed proposal around this ?
>
> In the meantime I suggest we add temporary pipeline options for staging
> Java dependencies from Python (and vice versa) to unblock development and
> testing of rest of the cross-language transforms stack. For example,
> https://github.com/apache/beam/pull/8340
>
> Thanks,
> Cham
>
>
>>
>> On Wed, Apr 24, 2019 at 4:44 AM Robert Bradshaw 
>> wrote:
>>
>>> On Wed, Apr 24, 2019 at 12:21 PM Maximilian Michels 
>>> wrote:
>>> >
>>> > Good idea to let the client expose an artifact staging service that the
>>> > ExpansionService could use to stage artifacts. This solves two
>>> problems:
>>> >
>>> > (1) The Expansion Service not being able to access the Job Server
>>> > artifact staging service
>>> > (2) The client not having access to the dependencies returned by the
>>> > Expansion Server
>>> >
>>> > The downside is that it adds an additional indirection. The alternative
>>> > to let the client handle staging the artifacts returned by the
>>> Expansion
>>> > Server is more transparent and easier to implement.
>>>
>>> The other downside is that it may not always be possible for the
>>> expansion service to connect to the artifact staging service (e.g.
>>> when constructing a pipeline locally against a remote expansion
>>> service).
>>>
>>
>> Just to make sure, your saying the expansion service would return all the
>> artifacts (bytes, urls, ...) as part of the response since the expansion
>> service wouldn't be able to connect to the SDK that is running locally
>> either.
>>
>>
>>> > Ideally, the Expansion Service won't return any dependencies because
>>> the
>>> > environment already contains the required dependencies. We could make
>>> it
>>> > a requirement for the expansion to be performed inside an environment.
>>> > Then we would already ensure during expansion time that the runtime
>>> > dependencies are available.
>>>
>>> Yes, it's cleanest if the expansion service provides an environment
>>> without all the dependencies provided. Interesting idea to make this a
>>> property of the expansion service itself.
>>>
>>
>> I had thought this too but an opaque docker container that was built on
>> top of a base Beam docker container would be very difficult for a runner to
>> introspect and check to see if its compatible to allow for fusion across
>> PTransforms. I think artifacts need to be communicated in their canonical
>> representation.
>>
>>
>>> > > In 

Re: Artifact staging in cross-language pipelines

2019-04-30 Thread Chamikara Jayalath
On Fri, Apr 26, 2019 at 4:14 PM Lukasz Cwik  wrote:

> We should stick with URN + payload + artifact metadata[1] where the only
> mandatory one that all SDKs and expansion services understand is the
> "bytes" artifact type. This allows us to add optional URNs for file://,
> http://, Maven, PyPi, ... in the future. I would make the artifact
> staging service use the same URN + payload mechanism to get compatibility
> of artifacts across the different services and also have the artifact
> staging service be able to be queried for the list of artifact types it
> supports. Finally, we would need to have environments enumerate the
> artifact types that they support.
>
> Having everyone have the same "artifact" representation would be
> beneficial since:
> a) Python environments could install dependencies from a requirements.txt
> file (something that the Google Cloud Dataflow Python docker container
> allows for today)
> b) It provides an extensible and versioned mechanism for SDKs,
> environments, and artifact staging/retrieval services to support additional
> artifact types
> c) Allow for expressing a canonical representation of an artifact like a
> Maven package so a runner could merge environments that the runner deems
> compatible.
>
> The flow I could see is:
> 1) (optional) query artifact staging service for supported artifact types
> 2) SDK request expansion service to expand transform passing in a list of
> artifact types the SDK and artifact staging service support, the expansion
> service returns a list of artifact types limited to those supported types +
> any supported by the environment
> 3) SDK converts any artifact types that the artifact staging service or
> environment doesn't understand, e.g. pulls down Maven dependencies and
> converts them to "bytes" artifacts
> 4) SDK sends artifacts to artifact staging service
> 5) Artifact staging service converts any artifacts to types that the
> environment understands
> 6) Environment is started and gets artifacts from the artifact retrieval
> service.
>

This is a very interesting proposal. I would add:
(5.5) artifact staging service resolves conflicts/duplicates for artifacts
needed by different transforms of the same pipeline

BTW what are the next steps here ? Heejong or Max, will one of you be able
to come up with a detailed proposal around this ?

In the meantime I suggest we add temporary pipeline options for staging
Java dependencies from Python (and vice versa) to unblock development and
testing of rest of the cross-language transforms stack. For example,
https://github.com/apache/beam/pull/8340

Thanks,
Cham


>
> On Wed, Apr 24, 2019 at 4:44 AM Robert Bradshaw 
> wrote:
>
>> On Wed, Apr 24, 2019 at 12:21 PM Maximilian Michels 
>> wrote:
>> >
>> > Good idea to let the client expose an artifact staging service that the
>> > ExpansionService could use to stage artifacts. This solves two problems:
>> >
>> > (1) The Expansion Service not being able to access the Job Server
>> > artifact staging service
>> > (2) The client not having access to the dependencies returned by the
>> > Expansion Server
>> >
>> > The downside is that it adds an additional indirection. The alternative
>> > to let the client handle staging the artifacts returned by the Expansion
>> > Server is more transparent and easier to implement.
>>
>> The other downside is that it may not always be possible for the
>> expansion service to connect to the artifact staging service (e.g.
>> when constructing a pipeline locally against a remote expansion
>> service).
>>
>
> Just to make sure, your saying the expansion service would return all the
> artifacts (bytes, urls, ...) as part of the response since the expansion
> service wouldn't be able to connect to the SDK that is running locally
> either.
>
>
>> > Ideally, the Expansion Service won't return any dependencies because the
>> > environment already contains the required dependencies. We could make it
>> > a requirement for the expansion to be performed inside an environment.
>> > Then we would already ensure during expansion time that the runtime
>> > dependencies are available.
>>
>> Yes, it's cleanest if the expansion service provides an environment
>> without all the dependencies provided. Interesting idea to make this a
>> property of the expansion service itself.
>>
>
> I had thought this too but an opaque docker container that was built on
> top of a base Beam docker container would be very difficult for a runner to
> introspect and check to see if its compatible to allow for fusion across
> PTransforms. I think artifacts need to be communicated in their canonical
> representation.
>
>
>> > > In this case, the runner would (as
>> > > requested by its configuration) be free to merge environments it
>> > > deemed compatible, including swapping out beam-java-X for
>> > > beam-java-embedded if it considers itself compatible with the
>> > > dependency list.
>> >
>> > Could you explain how that would work in practice?
>>
>> Say 

Re: Artifact staging in cross-language pipelines

2019-04-26 Thread Lukasz Cwik
We should stick with URN + payload + artifact metadata[1] where the only
mandatory one that all SDKs and expansion services understand is the
"bytes" artifact type. This allows us to add optional URNs for file://,
http://, Maven, PyPi, ... in the future. I would make the artifact staging
service use the same URN + payload mechanism to get compatibility of
artifacts across the different services and also have the artifact staging
service be able to be queried for the list of artifact types it supports.
Finally, we would need to have environments enumerate the artifact types
that they support.

Having everyone have the same "artifact" representation would be beneficial
since:
a) Python environments could install dependencies from a requirements.txt
file (something that the Google Cloud Dataflow Python docker container
allows for today)
b) It provides an extensible and versioned mechanism for SDKs,
environments, and artifact staging/retrieval services to support additional
artifact types
c) Allow for expressing a canonical representation of an artifact like a
Maven package so a runner could merge environments that the runner deems
compatible.

The flow I could see is:
1) (optional) query artifact staging service for supported artifact types
2) SDK request expansion service to expand transform passing in a list of
artifact types the SDK and artifact staging service support, the expansion
service returns a list of artifact types limited to those supported types +
any supported by the environment
3) SDK converts any artifact types that the artifact staging service or
environment doesn't understand, e.g. pulls down Maven dependencies and
converts them to "bytes" artifacts
4) SDK sends artifacts to artifact staging service
5) Artifact staging service converts any artifacts to types that the
environment understands
6) Environment is started and gets artifacts from the artifact retrieval
service.

On Wed, Apr 24, 2019 at 4:44 AM Robert Bradshaw  wrote:

> On Wed, Apr 24, 2019 at 12:21 PM Maximilian Michels 
> wrote:
> >
> > Good idea to let the client expose an artifact staging service that the
> > ExpansionService could use to stage artifacts. This solves two problems:
> >
> > (1) The Expansion Service not being able to access the Job Server
> > artifact staging service
> > (2) The client not having access to the dependencies returned by the
> > Expansion Server
> >
> > The downside is that it adds an additional indirection. The alternative
> > to let the client handle staging the artifacts returned by the Expansion
> > Server is more transparent and easier to implement.
>
> The other downside is that it may not always be possible for the
> expansion service to connect to the artifact staging service (e.g.
> when constructing a pipeline locally against a remote expansion
> service).
>

Just to make sure, your saying the expansion service would return all the
artifacts (bytes, urls, ...) as part of the response since the expansion
service wouldn't be able to connect to the SDK that is running locally
either.


> > Ideally, the Expansion Service won't return any dependencies because the
> > environment already contains the required dependencies. We could make it
> > a requirement for the expansion to be performed inside an environment.
> > Then we would already ensure during expansion time that the runtime
> > dependencies are available.
>
> Yes, it's cleanest if the expansion service provides an environment
> without all the dependencies provided. Interesting idea to make this a
> property of the expansion service itself.
>

I had thought this too but an opaque docker container that was built on top
of a base Beam docker container would be very difficult for a runner to
introspect and check to see if its compatible to allow for fusion across
PTransforms. I think artifacts need to be communicated in their canonical
representation.


> > > In this case, the runner would (as
> > > requested by its configuration) be free to merge environments it
> > > deemed compatible, including swapping out beam-java-X for
> > > beam-java-embedded if it considers itself compatible with the
> > > dependency list.
> >
> > Could you explain how that would work in practice?
>
> Say one has a pipeline with environments
>
> A: beam-java-sdk-2.12-docker
> B: beam-java-sdk-2.12-docker + dep1
> C: beam-java-sdk-2.12-docker + dep2
> D: beam-java-sdk-2.12-docker + dep3
>
> A runner could (conceivably) be intelligent enough to know that dep1
> and dep2 are indeed compatible, and run A, B, and C in a single
> beam-java-sdk-2.12-docker + dep1 + dep2 environment (with the
> corresponding fusion and lower overhead benefits). If a certain
> pipeline option is set, it might further note that dep1 and dep2 are
> compatible with its own workers, which are build against sdk-2.12, and
> choose to run these in embedded + dep1 + dep2 environment.
>

We have been talking about the expansion service and cross language
transforms a lot lately but I believe it 

Re: Artifact staging in cross-language pipelines

2019-04-24 Thread Robert Bradshaw
On Wed, Apr 24, 2019 at 12:21 PM Maximilian Michels  wrote:
>
> Good idea to let the client expose an artifact staging service that the
> ExpansionService could use to stage artifacts. This solves two problems:
>
> (1) The Expansion Service not being able to access the Job Server
> artifact staging service
> (2) The client not having access to the dependencies returned by the
> Expansion Server
>
> The downside is that it adds an additional indirection. The alternative
> to let the client handle staging the artifacts returned by the Expansion
> Server is more transparent and easier to implement.

The other downside is that it may not always be possible for the
expansion service to connect to the artifact staging service (e.g.
when constructing a pipeline locally against a remote expansion
service).

> Ideally, the Expansion Service won't return any dependencies because the
> environment already contains the required dependencies. We could make it
> a requirement for the expansion to be performed inside an environment.
> Then we would already ensure during expansion time that the runtime
> dependencies are available.

Yes, it's cleanest if the expansion service provides an environment
without all the dependencies provided. Interesting idea to make this a
property of the expansion service itself.

> > In this case, the runner would (as
> > requested by its configuration) be free to merge environments it
> > deemed compatible, including swapping out beam-java-X for
> > beam-java-embedded if it considers itself compatible with the
> > dependency list.
>
> Could you explain how that would work in practice?

Say one has a pipeline with environments

A: beam-java-sdk-2.12-docker
B: beam-java-sdk-2.12-docker + dep1
C: beam-java-sdk-2.12-docker + dep2
D: beam-java-sdk-2.12-docker + dep3

A runner could (conceivably) be intelligent enough to know that dep1
and dep2 are indeed compatible, and run A, B, and C in a single
beam-java-sdk-2.12-docker + dep1 + dep2 environment (with the
corresponding fusion and lower overhead benefits). If a certain
pipeline option is set, it might further note that dep1 and dep2 are
compatible with its own workers, which are build against sdk-2.12, and
choose to run these in embedded + dep1 + dep2 environment.


Re: Artifact staging in cross-language pipelines

2019-04-24 Thread Maximilian Michels
Good idea to let the client expose an artifact staging service that the 
ExpansionService could use to stage artifacts. This solves two problems:


(1) The Expansion Service not being able to access the Job Server 
artifact staging service
(2) The client not having access to the dependencies returned by the 
Expansion Server


The downside is that it adds an additional indirection. The alternative 
to let the client handle staging the artifacts returned by the Expansion 
Server is more transparent and easier to implement.


Ideally, the Expansion Service won't return any dependencies because the 
environment already contains the required dependencies. We could make it 
a requirement for the expansion to be performed inside an environment. 
Then we would already ensure during expansion time that the runtime 
dependencies are available.



In this case, the runner would (as
requested by its configuration) be free to merge environments it
deemed compatible, including swapping out beam-java-X for
beam-java-embedded if it considers itself compatible with the
dependency list.


Could you explain how that would work in practice?

-Max

On 24.04.19 04:11, Heejong Lee wrote:



2019년 4월 23일 (화) 오전 2:07, Robert Bradshaw >님이 작성:


I've been out, so coming a bit late to the discussion, but here's my
thoughts.

The expansion service absolutely needs to be able to provide the
dependencies for the transform(s) it expands. It seems the default,
foolproof way of doing this is via the environment, which can be a
docker image with all the required dependencies. More than this an
(arguably important, but possibly messy) optimization.

The standard way to provide artifacts outside of the environment is
via the artifact staging service. Of course, the expansion service may
not have access to the (final) artifact staging service (due to
permissions, locality, or it may not even be started up yet) but the
SDK invoking the expansion service could offer an artifact staging
environment for the SDK to publish artifacts to. However, there are
some difficulties here, in particular avoiding name collisions with
staged artifacts, assigning semantic meaning to the artifacts (e.g.
should jar files get automatically placed in the classpath, or Python
packages recognized and installed at startup). The alternative is
going with a (type, pointer) scheme for naming dependencies; if we go
this route I think we should consider migrating all artifact staging
to this style. I am concerned that the "file" version will be less
than useful for what will become the most convenient expansion
services (namely, hosted and docker image). I am still at a loss,
however, as to how to solve the diamond dependency problem among
dependencies--perhaps the information is there if one walks
maven/pypi/go modules/... but do we expect every runner to know about
every packaging platform? This also wouldn't solve the issue if fat
jars are used as dependencies. The only safe thing to do here is to
force distinct dependency sets to live in different environments,
which could be too conservative.

This all leads me to think that perhaps the environment itself should
be docker image (often one of "vanilla" beam-java-x.y ones) +
dependency list, rather than have the dependency/artifact list as some
kind of data off to the side. In this case, the runner would (as
requested by its configuration) be free to merge environments it
deemed compatible, including swapping out beam-java-X for
beam-java-embedded if it considers itself compatible with the
dependency list.


Like this idea to build multiple docker environments on top of a bare 
minimum SDK harness container and allow runners to pick a suitable one 
based on a dependency list.




I agree with Thomas that we'll want to make expansion services, and
the transforms they offer, more discoverable. The whole lifetime cycle
of expansion services is something that has yet to be fully fleshed
out, and may influence some of these decisions.

As for adding --jar_package to the Python SDK, this seems really
specific to calling java-from-python (would we have O(n^2) such
options?) as well as out-of-place for a Python user to specify. I
would really hope we can figure out a more generic solution. If we
need this option in the meantime, let's at least make it clear
(probably in the name) that it's temporary.


Good points. I second that we need a more generic solution than 
python-to-java specific option. I think instead of naming differently we 
can make --jar_package a secondary option under --experiment in the 
meantime. WDYT?



On Tue, Apr 23, 2019 at 1:08 AM Thomas Weise mailto:t...@apache.org>> wrote:
 >
 > One more suggestion:
 >
 > It would be nice to be able to select the environment for the
  

Re: Artifact staging in cross-language pipelines

2019-04-23 Thread Heejong Lee
2019년 4월 23일 (화) 오전 2:07, Robert Bradshaw 님이 작성:

> I've been out, so coming a bit late to the discussion, but here's my
> thoughts.
>
> The expansion service absolutely needs to be able to provide the
> dependencies for the transform(s) it expands. It seems the default,
> foolproof way of doing this is via the environment, which can be a
> docker image with all the required dependencies. More than this an
> (arguably important, but possibly messy) optimization.
>
> The standard way to provide artifacts outside of the environment is
> via the artifact staging service. Of course, the expansion service may
> not have access to the (final) artifact staging service (due to
> permissions, locality, or it may not even be started up yet) but the
> SDK invoking the expansion service could offer an artifact staging
> environment for the SDK to publish artifacts to. However, there are
> some difficulties here, in particular avoiding name collisions with
> staged artifacts, assigning semantic meaning to the artifacts (e.g.
> should jar files get automatically placed in the classpath, or Python
> packages recognized and installed at startup). The alternative is
> going with a (type, pointer) scheme for naming dependencies; if we go
> this route I think we should consider migrating all artifact staging
> to this style. I am concerned that the "file" version will be less
> than useful for what will become the most convenient expansion
> services (namely, hosted and docker image). I am still at a loss,
> however, as to how to solve the diamond dependency problem among
> dependencies--perhaps the information is there if one walks
> maven/pypi/go modules/... but do we expect every runner to know about
> every packaging platform? This also wouldn't solve the issue if fat
> jars are used as dependencies. The only safe thing to do here is to
> force distinct dependency sets to live in different environments,
> which could be too conservative.
>
> This all leads me to think that perhaps the environment itself should
> be docker image (often one of "vanilla" beam-java-x.y ones) +
> dependency list, rather than have the dependency/artifact list as some
> kind of data off to the side. In this case, the runner would (as
> requested by its configuration) be free to merge environments it
> deemed compatible, including swapping out beam-java-X for
> beam-java-embedded if it considers itself compatible with the
> dependency list.


Like this idea to build multiple docker environments on top of a bare
minimum SDK harness container and allow runners to pick a suitable one
based on a dependency list.


>
> I agree with Thomas that we'll want to make expansion services, and
> the transforms they offer, more discoverable. The whole lifetime cycle
> of expansion services is something that has yet to be fully fleshed
> out, and may influence some of these decisions.
>
> As for adding --jar_package to the Python SDK, this seems really
> specific to calling java-from-python (would we have O(n^2) such
> options?) as well as out-of-place for a Python user to specify. I
> would really hope we can figure out a more generic solution. If we
> need this option in the meantime, let's at least make it clear
> (probably in the name) that it's temporary.
>

Good points. I second that we need a more generic solution than
python-to-java specific option. I think instead of naming differently we
can make --jar_package a secondary option under --experiment in the
meantime. WDYT?


> On Tue, Apr 23, 2019 at 1:08 AM Thomas Weise  wrote:
> >
> > One more suggestion:
> >
> > It would be nice to be able to select the environment for the external
> transforms. For example, I would like to be able to use EMBEDDED for Flink.
> That's implicit for sources which are runner native unbounded read
> translations, but it should also be possible for writes. That would then be
> similar to how pipelines are packaged and run with the "legacy" runner.
> >
> > Thomas
> >
> >
> > On Mon, Apr 22, 2019 at 1:18 PM Ankur Goenka  wrote:
> >>
> >> Great discussion!
> >> I have a few points around the structure of proto but that is less
> important as it can evolve.
> >> However, I think that artifact compatibility is another important
> aspect to look at.
> >> Example: TransformA uses Guava 1.6>< 1.7, TransformB uses 1.8><1.9 and
> TransformC uses 1.6><1.8. As sdk provide the environment for each
> transform, it can not simply say EnvironmentJava for both TransformA and
> TransformB as the dependencies are not compatible.
> >> We should have separate environment associated with TransformA and
> TransformB in this case.
> >>
> >> To support this case, we need 2 things.
> >> 1: Granular metadata about the dependency including type.
> >> 2: Complete list of the transforms to be expanded.
> >>
> >> Elaboration:
> >> The compatibility check can be done in a crude way if we provide all
> the metadata about the dependency to expansion service.
> >> Also, the expansion service should expand 

Re: Artifact staging in cross-language pipelines

2019-04-23 Thread Robert Bradshaw
I've been out, so coming a bit late to the discussion, but here's my thoughts.

The expansion service absolutely needs to be able to provide the
dependencies for the transform(s) it expands. It seems the default,
foolproof way of doing this is via the environment, which can be a
docker image with all the required dependencies. More than this an
(arguably important, but possibly messy) optimization.

The standard way to provide artifacts outside of the environment is
via the artifact staging service. Of course, the expansion service may
not have access to the (final) artifact staging service (due to
permissions, locality, or it may not even be started up yet) but the
SDK invoking the expansion service could offer an artifact staging
environment for the SDK to publish artifacts to. However, there are
some difficulties here, in particular avoiding name collisions with
staged artifacts, assigning semantic meaning to the artifacts (e.g.
should jar files get automatically placed in the classpath, or Python
packages recognized and installed at startup). The alternative is
going with a (type, pointer) scheme for naming dependencies; if we go
this route I think we should consider migrating all artifact staging
to this style. I am concerned that the "file" version will be less
than useful for what will become the most convenient expansion
services (namely, hosted and docker image). I am still at a loss,
however, as to how to solve the diamond dependency problem among
dependencies--perhaps the information is there if one walks
maven/pypi/go modules/... but do we expect every runner to know about
every packaging platform? This also wouldn't solve the issue if fat
jars are used as dependencies. The only safe thing to do here is to
force distinct dependency sets to live in different environments,
which could be too conservative.

This all leads me to think that perhaps the environment itself should
be docker image (often one of "vanilla" beam-java-x.y ones) +
dependency list, rather than have the dependency/artifact list as some
kind of data off to the side. In this case, the runner would (as
requested by its configuration) be free to merge environments it
deemed compatible, including swapping out beam-java-X for
beam-java-embedded if it considers itself compatible with the
dependency list.

I agree with Thomas that we'll want to make expansion services, and
the transforms they offer, more discoverable. The whole lifetime cycle
of expansion services is something that has yet to be fully fleshed
out, and may influence some of these decisions.

As for adding --jar_package to the Python SDK, this seems really
specific to calling java-from-python (would we have O(n^2) such
options?) as well as out-of-place for a Python user to specify. I
would really hope we can figure out a more generic solution. If we
need this option in the meantime, let's at least make it clear
(probably in the name) that it's temporary.

On Tue, Apr 23, 2019 at 1:08 AM Thomas Weise  wrote:
>
> One more suggestion:
>
> It would be nice to be able to select the environment for the external 
> transforms. For example, I would like to be able to use EMBEDDED for Flink. 
> That's implicit for sources which are runner native unbounded read 
> translations, but it should also be possible for writes. That would then be 
> similar to how pipelines are packaged and run with the "legacy" runner.
>
> Thomas
>
>
> On Mon, Apr 22, 2019 at 1:18 PM Ankur Goenka  wrote:
>>
>> Great discussion!
>> I have a few points around the structure of proto but that is less important 
>> as it can evolve.
>> However, I think that artifact compatibility is another important aspect to 
>> look at.
>> Example: TransformA uses Guava 1.6>< 1.7, TransformB uses 1.8><1.9 and 
>> TransformC uses 1.6><1.8. As sdk provide the environment for each transform, 
>> it can not simply say EnvironmentJava for both TransformA and TransformB as 
>> the dependencies are not compatible.
>> We should have separate environment associated with TransformA and 
>> TransformB in this case.
>>
>> To support this case, we need 2 things.
>> 1: Granular metadata about the dependency including type.
>> 2: Complete list of the transforms to be expanded.
>>
>> Elaboration:
>> The compatibility check can be done in a crude way if we provide all the 
>> metadata about the dependency to expansion service.
>> Also, the expansion service should expand all the applicable transforms in a 
>> single call so that it knows about incompatibility and create separate 
>> environments for these transforms. So in the above example, expansion 
>> service will associate EnvA to TransformA and EnvB to TransformB and EnvA to 
>> TransformC. This will ofcource require changes to Expansion service proto 
>> but giving all the information to expansion service will make it support 
>> more case and make it a bit more future proof.
>>
>>
>> On Mon, Apr 22, 2019 at 10:16 AM Maximilian Michels  wrote:
>>>
>>> Thanks for the summary Cham. 

Re: Artifact staging in cross-language pipelines

2019-04-22 Thread Thomas Weise
One more suggestion:

It would be nice to be able to select the environment for the external
transforms. For example, I would like to be able to use EMBEDDED for Flink.
That's implicit for sources which are runner native unbounded read
translations, but it should also be possible for writes. That would then be
similar to how pipelines are packaged and run with the "legacy" runner.

Thomas


On Mon, Apr 22, 2019 at 1:18 PM Ankur Goenka  wrote:

> Great discussion!
> I have a few points around the structure of proto but that is less
> important as it can evolve.
> However, I think that artifact compatibility is another important aspect
> to look at.
> Example: TransformA uses Guava 1.6>< 1.7, TransformB uses 1.8><1.9 and
> TransformC uses 1.6><1.8. As sdk provide the environment for each
> transform, it can not simply say EnvironmentJava for both TransformA and
> TransformB as the dependencies are not compatible.
> We should have separate environment associated with TransformA and
> TransformB in this case.
>
> To support this case, we need 2 things.
> 1: Granular metadata about the dependency including type.
> 2: Complete list of the transforms to be expanded.
>
> Elaboration:
> The compatibility check can be done in a crude way if we provide all the
> metadata about the dependency to expansion service.
> Also, the expansion service should expand all the applicable transforms in
> a single call so that it knows about incompatibility and create separate
> environments for these transforms. So in the above example, expansion
> service will associate EnvA to TransformA and EnvB to TransformB and EnvA
> to TransformC. This will ofcource require changes to Expansion service
> proto but giving all the information to expansion service will make it
> support more case and make it a bit more future proof.
>
>
> On Mon, Apr 22, 2019 at 10:16 AM Maximilian Michels 
> wrote:
>
>> Thanks for the summary Cham. All makes sense. I agree that we want to
>> keep the option to manually specify artifacts.
>>
>> > There are few unanswered questions though.
>> > (1) In what form will a transform author specify dependencies ? For
>> example, URL to a Maven repo, URL to a local file, blob ?
>>
>> Going forward, we probably want to support multiple ways. For now, we
>> could stick with a URL-based approach with support for different file
>> systems. In the future a list of packages to retrieve from Maven/PyPi
>> would be useful.
>>
>> We can ask user for (type, metadata). For maven it can be something like
> (MAVEN, {groupId:com.google.guava, artifactId: guava, version: 19}) or
> (FILE, file://myfile)
> To begin with, we can only support a few types like File and can add more
> types in future.
>
>> > (2) How will dependencies be included in the expansion response proto ?
>> String (URL), bytes (blob) ?
>>
>> I'd go for a list of Protobuf strings first but the format would have to
>> evolve for other dependency types.
>>
>> Here also (type, payload) should suffice. We can have interpreter for
> each type to translate the payload.
>
>> > (3) How will we manage/share transitive dependencies required at
>> runtime ?
>>
>> I'd say transitive dependencies have to be included in the list. In case
>> of fat jars, they are reduced to a single jar.
>>
> Makes sense.
>
>>
>> > (4) How will dependencies be staged for various runner/SDK combinations
>> ? (for example, portable runner/Flink, Dataflow runner)
>>
>> Staging should be no different than it is now, i.e. go through Beam's
>> artifact staging service. As long as the protocol is stable, there could
>> also be different implementations.
>>
> Makes sense.
>
>>
>> -Max
>>
>> On 20.04.19 03:08, Chamikara Jayalath wrote:
>> > OK, sounds like this is a good path forward then.
>> >
>> > * When starting up the expansion service, user (that starts up the
>> > service) provide dependencies necessary to expand transforms. We will
>> > later add support for adding new transforms to an already running
>> > expansion service.
>> > * As a part of transform configuration, transform author have the
>> option
>> > of providing a list of dependencies that will be needed to run the
>> > transform.
>> > * These dependencies will be send back to the pipeline SDK as a part of
>> > expansion response and pipeline SDK will stage these resources.
>> > * Pipeline author have the option of specifying the dependencies using
>> a
>> > pipeline option. (for example, https://github.com/apache/beam/pull/8340
>> )
>> >
>> > I think last option is important to (1) make existing transform easily
>> > available for cross-language usage without additional configurations
>> (2)
>> > allow pipeline authors to override dependency versions specified by in
>> > the transform configuration (for example, to apply security patches)
>> > without updating the expansion service.
>> >
>> > There are few unanswered questions though.
>> > (1) In what form will a transform author specify dependencies ? For
>> > example, URL to a Maven 

Re: Artifact staging in cross-language pipelines

2019-04-22 Thread Ankur Goenka
Great discussion!
I have a few points around the structure of proto but that is less
important as it can evolve.
However, I think that artifact compatibility is another important aspect to
look at.
Example: TransformA uses Guava 1.6>< 1.7, TransformB uses 1.8><1.9 and
TransformC uses 1.6><1.8. As sdk provide the environment for each
transform, it can not simply say EnvironmentJava for both TransformA and
TransformB as the dependencies are not compatible.
We should have separate environment associated with TransformA and
TransformB in this case.

To support this case, we need 2 things.
1: Granular metadata about the dependency including type.
2: Complete list of the transforms to be expanded.

Elaboration:
The compatibility check can be done in a crude way if we provide all the
metadata about the dependency to expansion service.
Also, the expansion service should expand all the applicable transforms in
a single call so that it knows about incompatibility and create separate
environments for these transforms. So in the above example, expansion
service will associate EnvA to TransformA and EnvB to TransformB and EnvA
to TransformC. This will ofcource require changes to Expansion service
proto but giving all the information to expansion service will make it
support more case and make it a bit more future proof.


On Mon, Apr 22, 2019 at 10:16 AM Maximilian Michels  wrote:

> Thanks for the summary Cham. All makes sense. I agree that we want to
> keep the option to manually specify artifacts.
>
> > There are few unanswered questions though.
> > (1) In what form will a transform author specify dependencies ? For
> example, URL to a Maven repo, URL to a local file, blob ?
>
> Going forward, we probably want to support multiple ways. For now, we
> could stick with a URL-based approach with support for different file
> systems. In the future a list of packages to retrieve from Maven/PyPi
> would be useful.
>
> We can ask user for (type, metadata). For maven it can be something like
(MAVEN, {groupId:com.google.guava, artifactId: guava, version: 19}) or
(FILE, file://myfile)
To begin with, we can only support a few types like File and can add more
types in future.

> > (2) How will dependencies be included in the expansion response proto ?
> String (URL), bytes (blob) ?
>
> I'd go for a list of Protobuf strings first but the format would have to
> evolve for other dependency types.
>
> Here also (type, payload) should suffice. We can have interpreter for each
type to translate the payload.

> > (3) How will we manage/share transitive dependencies required at runtime
> ?
>
> I'd say transitive dependencies have to be included in the list. In case
> of fat jars, they are reduced to a single jar.
>
Makes sense.

>
> > (4) How will dependencies be staged for various runner/SDK combinations
> ? (for example, portable runner/Flink, Dataflow runner)
>
> Staging should be no different than it is now, i.e. go through Beam's
> artifact staging service. As long as the protocol is stable, there could
> also be different implementations.
>
Makes sense.

>
> -Max
>
> On 20.04.19 03:08, Chamikara Jayalath wrote:
> > OK, sounds like this is a good path forward then.
> >
> > * When starting up the expansion service, user (that starts up the
> > service) provide dependencies necessary to expand transforms. We will
> > later add support for adding new transforms to an already running
> > expansion service.
> > * As a part of transform configuration, transform author have the option
> > of providing a list of dependencies that will be needed to run the
> > transform.
> > * These dependencies will be send back to the pipeline SDK as a part of
> > expansion response and pipeline SDK will stage these resources.
> > * Pipeline author have the option of specifying the dependencies using a
> > pipeline option. (for example, https://github.com/apache/beam/pull/8340)
> >
> > I think last option is important to (1) make existing transform easily
> > available for cross-language usage without additional configurations (2)
> > allow pipeline authors to override dependency versions specified by in
> > the transform configuration (for example, to apply security patches)
> > without updating the expansion service.
> >
> > There are few unanswered questions though.
> > (1) In what form will a transform author specify dependencies ? For
> > example, URL to a Maven repo, URL to a local file, blob ?
> > (2) How will dependencies be included in the expansion response proto ?
> > String (URL), bytes (blob) ?
> > (3) How will we manage/share transitive dependencies required at runtime
> ?
> > (4) How will dependencies be staged for various runner/SDK combinations
> > ? (for example, portable runner/Flink, Dataflow runner)
> >
> > Thanks,
> > Cham
> >
> > On Fri, Apr 19, 2019 at 4:49 AM Maximilian Michels  > > wrote:
> >
> > Thank you for your replies.
> >
> > I did not suggest that the Expansion Service does the staging, but 

Re: Artifact staging in cross-language pipelines

2019-04-22 Thread Maximilian Michels
Thanks for the summary Cham. All makes sense. I agree that we want to 
keep the option to manually specify artifacts.



There are few unanswered questions though.
(1) In what form will a transform author specify dependencies ? For example, 
URL to a Maven repo, URL to a local file, blob ?


Going forward, we probably want to support multiple ways. For now, we 
could stick with a URL-based approach with support for different file 
systems. In the future a list of packages to retrieve from Maven/PyPi 
would be useful.



(2) How will dependencies be included in the expansion response proto ? String 
(URL), bytes (blob) ?


I'd go for a list of Protobuf strings first but the format would have to 
evolve for other dependency types.


(3) How will we manage/share transitive dependencies required at runtime ? 


I'd say transitive dependencies have to be included in the list. In case 
of fat jars, they are reduced to a single jar.



(4) How will dependencies be staged for various runner/SDK combinations ? (for 
example, portable runner/Flink, Dataflow runner)


Staging should be no different than it is now, i.e. go through Beam's 
artifact staging service. As long as the protocol is stable, there could 
also be different implementations.


-Max

On 20.04.19 03:08, Chamikara Jayalath wrote:

OK, sounds like this is a good path forward then.

* When starting up the expansion service, user (that starts up the 
service) provide dependencies necessary to expand transforms. We will 
later add support for adding new transforms to an already running 
expansion service.
* As a part of transform configuration, transform author have the option 
of providing a list of dependencies that will be needed to run the 
transform.
* These dependencies will be send back to the pipeline SDK as a part of 
expansion response and pipeline SDK will stage these resources.
* Pipeline author have the option of specifying the dependencies using a 
pipeline option. (for example, https://github.com/apache/beam/pull/8340)


I think last option is important to (1) make existing transform easily 
available for cross-language usage without additional configurations (2) 
allow pipeline authors to override dependency versions specified by in 
the transform configuration (for example, to apply security patches) 
without updating the expansion service.


There are few unanswered questions though.
(1) In what form will a transform author specify dependencies ? For 
example, URL to a Maven repo, URL to a local file, blob ?
(2) How will dependencies be included in the expansion response proto ? 
String (URL), bytes (blob) ?

(3) How will we manage/share transitive dependencies required at runtime ?
(4) How will dependencies be staged for various runner/SDK combinations 
? (for example, portable runner/Flink, Dataflow runner)


Thanks,
Cham

On Fri, Apr 19, 2019 at 4:49 AM Maximilian Michels > wrote:


Thank you for your replies.

I did not suggest that the Expansion Service does the staging, but it
would return the required resources (e.g. jars) for the external
transform's runtime environment. The client then has to take care of
staging the resources.

The Expansion Service itself also needs resources to do the
expansion. I
assumed those to be provided when starting the expansion service. I
consider it less important but we could also provide a way to add new
transforms to the Expansion Service after startup.

Good point on Docker vs externally provided environments. For the PR
[1]
it will suffice then to add Kafka to the container dependencies. The
"--jar_package" pipeline option is ok for now but I'd like to see work
towards staging resources for external transforms via information
returned by the Expansion Service. That avoids users having to take
care
of including the correct jars in their pipeline options.

These issues are related and we could discuss them in separate threads:

* Auto-discovery of Expansion Service and its external transforms
* Credentials required during expansion / runtime

Thanks,
Max

[1] ttps://github.com/apache/beam/pull/8322


On 19.04.19 07:35, Thomas Weise wrote:
 > Good discussion :)
 >
 > Initially the expansion service was considered a user
responsibility,
 > but I think that isn't necessarily the case. I can also see the
 > expansion service provided as part of the infrastructure and the
user
 > not wanting to deal with it at all. For example, users may want
to write
 > Python transforms and use external IOs, without being concerned how
 > these IOs are provided. Under such scenario it would be good if:
 >
 > * Expansion service(s) can be auto-discovered via the job service
endpoint
 > * Available external transforms can be discovered via the expansion
 > service(s)
 > * 

Re: Artifact staging in cross-language pipelines

2019-04-19 Thread Chamikara Jayalath
OK, sounds like this is a good path forward then.

* When starting up the expansion service, user (that starts up the service)
provide dependencies necessary to expand transforms. We will later add
support for adding new transforms to an already running expansion service.
* As a part of transform configuration, transform author have the option of
providing a list of dependencies that will be needed to run the transform.
* These dependencies will be send back to the pipeline SDK as a part of
expansion response and pipeline SDK will stage these resources.
* Pipeline author have the option of specifying the dependencies using a
pipeline option. (for example, https://github.com/apache/beam/pull/8340)

I think last option is important to (1) make existing transform easily
available for cross-language usage without additional configurations (2)
allow pipeline authors to override dependency versions specified by in the
transform configuration (for example, to apply security patches) without
updating the expansion service.

There are few unanswered questions though.
(1) In what form will a transform author specify dependencies ? For
example, URL to a Maven repo, URL to a local file, blob ?
(2) How will dependencies be included in the expansion response proto ?
String (URL), bytes (blob) ?
(3) How will we manage/share transitive dependencies required at runtime ?
(4) How will dependencies be staged for various runner/SDK combinations ?
(for example, portable runner/Flink, Dataflow runner)

Thanks,
Cham

On Fri, Apr 19, 2019 at 4:49 AM Maximilian Michels  wrote:

> Thank you for your replies.
>
> I did not suggest that the Expansion Service does the staging, but it
> would return the required resources (e.g. jars) for the external
> transform's runtime environment. The client then has to take care of
> staging the resources.
>
> The Expansion Service itself also needs resources to do the expansion. I
> assumed those to be provided when starting the expansion service. I
> consider it less important but we could also provide a way to add new
> transforms to the Expansion Service after startup.
>
> Good point on Docker vs externally provided environments. For the PR [1]
> it will suffice then to add Kafka to the container dependencies. The
> "--jar_package" pipeline option is ok for now but I'd like to see work
> towards staging resources for external transforms via information
> returned by the Expansion Service. That avoids users having to take care
> of including the correct jars in their pipeline options.
>
> These issues are related and we could discuss them in separate threads:
>
> * Auto-discovery of Expansion Service and its external transforms
> * Credentials required during expansion / runtime
>
> Thanks,
> Max
>
> [1] ttps://github.com/apache/beam/pull/8322
>
> On 19.04.19 07:35, Thomas Weise wrote:
> > Good discussion :)
> >
> > Initially the expansion service was considered a user responsibility,
> > but I think that isn't necessarily the case. I can also see the
> > expansion service provided as part of the infrastructure and the user
> > not wanting to deal with it at all. For example, users may want to write
> > Python transforms and use external IOs, without being concerned how
> > these IOs are provided. Under such scenario it would be good if:
> >
> > * Expansion service(s) can be auto-discovered via the job service
> endpoint
> > * Available external transforms can be discovered via the expansion
> > service(s)
> > * Dependencies for external transforms are part of the metadata returned
> > by expansion service
> >
> > Dependencies could then be staged either by the SDK client or the
> > expansion service. The expansion service could provide the locations to
> > stage to the SDK, it would still be transparent to the user.
> >
> > I also agree with Luke regarding the environments. Docker is the choice
> > for generic deployment. Other environments are used when the flexibility
> > offered by Docker isn't needed (or gets into the way). Then the
> > dependencies are provided in different ways. Whether these are Python
> > packages or jar files, by opting out of Docker the decision is made to
> > manage dependencies externally.
> >
> > Thomas
> >
> >
> > On Thu, Apr 18, 2019 at 6:01 PM Chamikara Jayalath  > > wrote:
> >
> >
> >
> > On Thu, Apr 18, 2019 at 5:21 PM Chamikara Jayalath
> > mailto:chamik...@google.com>> wrote:
> >
> > Thanks for raising the concern about credentials Ankur, I agree
> > that this is a significant issue.
> >
> > On Thu, Apr 18, 2019 at 4:23 PM Lukasz Cwik  > > wrote:
> >
> > I can understand the concern about credentials, the same
> > access concern will exist for several cross language
> > transforms (mostly IOs) since some will need access to
> > credentials to read/write to an external service.
> >
> > Are there any ideas 

Re: Artifact staging in cross-language pipelines

2019-04-19 Thread Maximilian Michels

Thank you for your replies.

I did not suggest that the Expansion Service does the staging, but it 
would return the required resources (e.g. jars) for the external 
transform's runtime environment. The client then has to take care of 
staging the resources.


The Expansion Service itself also needs resources to do the expansion. I 
assumed those to be provided when starting the expansion service. I 
consider it less important but we could also provide a way to add new 
transforms to the Expansion Service after startup.


Good point on Docker vs externally provided environments. For the PR [1] 
it will suffice then to add Kafka to the container dependencies. The 
"--jar_package" pipeline option is ok for now but I'd like to see work 
towards staging resources for external transforms via information 
returned by the Expansion Service. That avoids users having to take care 
of including the correct jars in their pipeline options.


These issues are related and we could discuss them in separate threads:

* Auto-discovery of Expansion Service and its external transforms
* Credentials required during expansion / runtime

Thanks,
Max

[1] ttps://github.com/apache/beam/pull/8322

On 19.04.19 07:35, Thomas Weise wrote:

Good discussion :)

Initially the expansion service was considered a user responsibility, 
but I think that isn't necessarily the case. I can also see the 
expansion service provided as part of the infrastructure and the user 
not wanting to deal with it at all. For example, users may want to write 
Python transforms and use external IOs, without being concerned how 
these IOs are provided. Under such scenario it would be good if:


* Expansion service(s) can be auto-discovered via the job service endpoint
* Available external transforms can be discovered via the expansion 
service(s)
* Dependencies for external transforms are part of the metadata returned 
by expansion service


Dependencies could then be staged either by the SDK client or the 
expansion service. The expansion service could provide the locations to 
stage to the SDK, it would still be transparent to the user.


I also agree with Luke regarding the environments. Docker is the choice 
for generic deployment. Other environments are used when the flexibility 
offered by Docker isn't needed (or gets into the way). Then the 
dependencies are provided in different ways. Whether these are Python 
packages or jar files, by opting out of Docker the decision is made to 
manage dependencies externally.


Thomas


On Thu, Apr 18, 2019 at 6:01 PM Chamikara Jayalath > wrote:




On Thu, Apr 18, 2019 at 5:21 PM Chamikara Jayalath
mailto:chamik...@google.com>> wrote:

Thanks for raising the concern about credentials Ankur, I agree
that this is a significant issue.

On Thu, Apr 18, 2019 at 4:23 PM Lukasz Cwik mailto:lc...@google.com>> wrote:

I can understand the concern about credentials, the same
access concern will exist for several cross language
transforms (mostly IOs) since some will need access to
credentials to read/write to an external service.

Are there any ideas on how credential propagation could work
to these IOs?


There are some cases where existing IO transforms need
credentials to access remote resources, for example, size
estimation, validation, etc. But usually these are optional (or
transform can be configured to not perform these functions).


To clarify, I'm only talking about transform expansion here. Many IO
transforms need read/write access to remote services at run time. So
probably we need to figure out a way to propagate these credentials
anyways.

Can we use these mechanisms for staging? 



I think we'll have to find a way to do one of (1) propagate
credentials to other SDKs (2) allow users to configure SDK
containers to have necessary credentials (3) do the artifact
staging from the pipeline SDK environment which already have
credentials. I prefer (1) or (2) since this will given a
transform same feature set whether used directly (in the same
SDK language as the transform) or remotely but it might be hard
to do this for an arbitrary service that a transform might
connect to considering the number of ways users can configure
credentials (after an offline discussion with Ankur).


On Thu, Apr 18, 2019 at 3:47 PM Ankur Goenka
mailto:goe...@google.com>> wrote:

I agree that the Expansion service knows about the
artifacts required for a cross language transform and
having a prepackage folder/Zip for transforms based on
language makes sense.

One think to note here is that expansion service might
not have the same access privilege as 

Re: Artifact staging in cross-language pipelines

2019-04-18 Thread Thomas Weise
Good discussion :)

Initially the expansion service was considered a user responsibility, but I
think that isn't necessarily the case. I can also see the expansion service
provided as part of the infrastructure and the user not wanting to deal
with it at all. For example, users may want to write Python transforms and
use external IOs, without being concerned how these IOs are provided. Under
such scenario it would be good if:

* Expansion service(s) can be auto-discovered via the job service endpoint
* Available external transforms can be discovered via the expansion
service(s)
* Dependencies for external transforms are part of the metadata returned by
expansion service

Dependencies could then be staged either by the SDK client or the expansion
service. The expansion service could provide the locations to stage to the
SDK, it would still be transparent to the user.

I also agree with Luke regarding the environments. Docker is the choice for
generic deployment. Other environments are used when the flexibility
offered by Docker isn't needed (or gets into the way). Then the
dependencies are provided in different ways. Whether these are Python
packages or jar files, by opting out of Docker the decision is made to
manage dependencies externally.

Thomas



On Thu, Apr 18, 2019 at 6:01 PM Chamikara Jayalath 
wrote:

>
>
> On Thu, Apr 18, 2019 at 5:21 PM Chamikara Jayalath 
> wrote:
>
>> Thanks for raising the concern about credentials Ankur, I agree that this
>> is a significant issue.
>>
>> On Thu, Apr 18, 2019 at 4:23 PM Lukasz Cwik  wrote:
>>
>>> I can understand the concern about credentials, the same access concern
>>> will exist for several cross language transforms (mostly IOs) since some
>>> will need access to credentials to read/write to an external service.
>>>
>>> Are there any ideas on how credential propagation could work to these
>>> IOs?
>>>
>>
>> There are some cases where existing IO transforms need credentials to
>> access remote resources, for example, size estimation, validation, etc. But
>> usually these are optional (or transform can be configured to not perform
>> these functions).
>>
>
> To clarify, I'm only talking about transform expansion here. Many IO
> transforms need read/write access to remote services at run time. So
> probably we need to figure out a way to propagate these credentials anyways.
>
>
>>
>>
>>> Can we use these mechanisms for staging?
>>>
>>
>> I think we'll have to find a way to do one of (1) propagate credentials
>> to other SDKs (2) allow users to configure SDK containers to have necessary
>> credentials (3) do the artifact staging from the pipeline SDK environment
>> which already have credentials. I prefer (1) or (2) since this will given a
>> transform same feature set whether used directly (in the same SDK language
>> as the transform) or remotely but it might be hard to do this for an
>> arbitrary service that a transform might connect to considering the number
>> of ways users can configure credentials (after an offline discussion with
>> Ankur).
>>
>>
>>>
>>>
>>
>>> On Thu, Apr 18, 2019 at 3:47 PM Ankur Goenka  wrote:
>>>
 I agree that the Expansion service knows about the artifacts required
 for a cross language transform and having a prepackage folder/Zip for
 transforms based on language makes sense.

 One think to note here is that expansion service might not have the
 same access privilege as the pipeline author and hence might not be able to
 stage artifacts by itself.
 Keeping this in mind I am leaning towards making Expansion service
 provide all the required artifacts to the user and let the user stage the
 artifacts as regular artifacts.
 At this time, we only have Beam File System based artifact staging
 which users local credentials to access different file systems. Even a
 docker based expansion service running on local machine might not have the
 same access privileges.

 In brief this is what I am leaning toward.
 User call for pipeline submission -> Expansion service provide cross
 language transforms and relevant artifacts to the Sdk -> Sdk Submits the
 pipeline to Jobserver and Stages user and cross language artifacts to
 artifacts staging service


 On Thu, Apr 18, 2019 at 2:33 PM Chamikara Jayalath <
 chamik...@google.com> wrote:

>
>
> On Thu, Apr 18, 2019 at 2:12 PM Lukasz Cwik  wrote:
>
>> Note that Max did ask whether making the expansion service do the
>> staging made sense, and my first line was agreeing with that direction 
>> and
>> expanding on how it could be done (so this is really Max's idea or from
>> whomever he got the idea from).
>>
>
> +1 to what Max said then :)
>
>
>>
>> I believe a lot of the value of the expansion service is not having
>> users need to be aware of all the SDK specific dependencies when they are
>> trying to create a pipeline, 

Re: Artifact staging in cross-language pipelines

2019-04-18 Thread Chamikara Jayalath
On Thu, Apr 18, 2019 at 5:21 PM Chamikara Jayalath 
wrote:

> Thanks for raising the concern about credentials Ankur, I agree that this
> is a significant issue.
>
> On Thu, Apr 18, 2019 at 4:23 PM Lukasz Cwik  wrote:
>
>> I can understand the concern about credentials, the same access concern
>> will exist for several cross language transforms (mostly IOs) since some
>> will need access to credentials to read/write to an external service.
>>
>> Are there any ideas on how credential propagation could work to these IOs?
>>
>
> There are some cases where existing IO transforms need credentials to
> access remote resources, for example, size estimation, validation, etc. But
> usually these are optional (or transform can be configured to not perform
> these functions).
>

To clarify, I'm only talking about transform expansion here. Many IO
transforms need read/write access to remote services at run time. So
probably we need to figure out a way to propagate these credentials anyways.


>
>
>> Can we use these mechanisms for staging?
>>
>
> I think we'll have to find a way to do one of (1) propagate credentials to
> other SDKs (2) allow users to configure SDK containers to have necessary
> credentials (3) do the artifact staging from the pipeline SDK environment
> which already have credentials. I prefer (1) or (2) since this will given a
> transform same feature set whether used directly (in the same SDK language
> as the transform) or remotely but it might be hard to do this for an
> arbitrary service that a transform might connect to considering the number
> of ways users can configure credentials (after an offline discussion with
> Ankur).
>
>
>>
>>
>
>> On Thu, Apr 18, 2019 at 3:47 PM Ankur Goenka  wrote:
>>
>>> I agree that the Expansion service knows about the artifacts required
>>> for a cross language transform and having a prepackage folder/Zip for
>>> transforms based on language makes sense.
>>>
>>> One think to note here is that expansion service might not have the same
>>> access privilege as the pipeline author and hence might not be able to
>>> stage artifacts by itself.
>>> Keeping this in mind I am leaning towards making Expansion service
>>> provide all the required artifacts to the user and let the user stage the
>>> artifacts as regular artifacts.
>>> At this time, we only have Beam File System based artifact staging which
>>> users local credentials to access different file systems. Even a docker
>>> based expansion service running on local machine might not have the same
>>> access privileges.
>>>
>>> In brief this is what I am leaning toward.
>>> User call for pipeline submission -> Expansion service provide cross
>>> language transforms and relevant artifacts to the Sdk -> Sdk Submits the
>>> pipeline to Jobserver and Stages user and cross language artifacts to
>>> artifacts staging service
>>>
>>>
>>> On Thu, Apr 18, 2019 at 2:33 PM Chamikara Jayalath 
>>> wrote:
>>>


 On Thu, Apr 18, 2019 at 2:12 PM Lukasz Cwik  wrote:

> Note that Max did ask whether making the expansion service do the
> staging made sense, and my first line was agreeing with that direction and
> expanding on how it could be done (so this is really Max's idea or from
> whomever he got the idea from).
>

 +1 to what Max said then :)


>
> I believe a lot of the value of the expansion service is not having
> users need to be aware of all the SDK specific dependencies when they are
> trying to create a pipeline, only the "user" who is launching the 
> expansion
> service may need to. And in that case we can have a prepackaged expansion
> service application that does what most users would want (e.g. expansion
> service as a docker container, a single bundled jar, ...). We (the Apache
> Beam community) could choose to host a default implementation of the
> expansion service as well.
>

 I'm not against this. But I think this is a secondary more advanced
 use-case. For a Beam users that needs to use a Java transform that they
 already have in a Python pipeline, we should provide a way to allow
 starting up a expansion service (with dependencies needed for that) and
 running a pipeline that uses this external Java transform (with
 dependencies that are needed at runtime). Probably, it'll be enough to
 allow providing all dependencies when starting up the expansion service and
 allow expansion service to do the staging of jars are well. I don't see a
 need to include the list of jars in the ExpansionResponse sent to the
 Python SDK.


>
> On Thu, Apr 18, 2019 at 2:02 PM Chamikara Jayalath <
> chamik...@google.com> wrote:
>
>> I think there are two kind of dependencies we have to consider.
>>
>> (1) Dependencies that are needed to expand the transform.
>>
>> These have to be provided when we start the expansion service so that
>> available 

Re: Artifact staging in cross-language pipelines

2019-04-18 Thread Chamikara Jayalath
Thanks for raising the concern about credentials Ankur, I agree that this
is a significant issue.

On Thu, Apr 18, 2019 at 4:23 PM Lukasz Cwik  wrote:

> I can understand the concern about credentials, the same access concern
> will exist for several cross language transforms (mostly IOs) since some
> will need access to credentials to read/write to an external service.
>
> Are there any ideas on how credential propagation could work to these IOs?
>

There are some cases where existing IO transforms need credentials to
access remote resources, for example, size estimation, validation, etc. But
usually these are optional (or transform can be configured to not perform
these functions).


> Can we use these mechanisms for staging?
>

I think we'll have to find a way to do one of (1) propagate credentials to
other SDKs (2) allow users to configure SDK containers to have necessary
credentials (3) do the artifact staging from the pipeline SDK environment
which already have credentials. I prefer (1) or (2) since this will given a
transform same feature set whether used directly (in the same SDK language
as the transform) or remotely but it might be hard to do this for an
arbitrary service that a transform might connect to considering the number
of ways users can configure credentials (after an offline discussion with
Ankur).


>
>

> On Thu, Apr 18, 2019 at 3:47 PM Ankur Goenka  wrote:
>
>> I agree that the Expansion service knows about the artifacts required for
>> a cross language transform and having a prepackage folder/Zip for
>> transforms based on language makes sense.
>>
>> One think to note here is that expansion service might not have the same
>> access privilege as the pipeline author and hence might not be able to
>> stage artifacts by itself.
>> Keeping this in mind I am leaning towards making Expansion service
>> provide all the required artifacts to the user and let the user stage the
>> artifacts as regular artifacts.
>> At this time, we only have Beam File System based artifact staging which
>> users local credentials to access different file systems. Even a docker
>> based expansion service running on local machine might not have the same
>> access privileges.
>>
>> In brief this is what I am leaning toward.
>> User call for pipeline submission -> Expansion service provide cross
>> language transforms and relevant artifacts to the Sdk -> Sdk Submits the
>> pipeline to Jobserver and Stages user and cross language artifacts to
>> artifacts staging service
>>
>>
>> On Thu, Apr 18, 2019 at 2:33 PM Chamikara Jayalath 
>> wrote:
>>
>>>
>>>
>>> On Thu, Apr 18, 2019 at 2:12 PM Lukasz Cwik  wrote:
>>>
 Note that Max did ask whether making the expansion service do the
 staging made sense, and my first line was agreeing with that direction and
 expanding on how it could be done (so this is really Max's idea or from
 whomever he got the idea from).

>>>
>>> +1 to what Max said then :)
>>>
>>>

 I believe a lot of the value of the expansion service is not having
 users need to be aware of all the SDK specific dependencies when they are
 trying to create a pipeline, only the "user" who is launching the expansion
 service may need to. And in that case we can have a prepackaged expansion
 service application that does what most users would want (e.g. expansion
 service as a docker container, a single bundled jar, ...). We (the Apache
 Beam community) could choose to host a default implementation of the
 expansion service as well.

>>>
>>> I'm not against this. But I think this is a secondary more advanced
>>> use-case. For a Beam users that needs to use a Java transform that they
>>> already have in a Python pipeline, we should provide a way to allow
>>> starting up a expansion service (with dependencies needed for that) and
>>> running a pipeline that uses this external Java transform (with
>>> dependencies that are needed at runtime). Probably, it'll be enough to
>>> allow providing all dependencies when starting up the expansion service and
>>> allow expansion service to do the staging of jars are well. I don't see a
>>> need to include the list of jars in the ExpansionResponse sent to the
>>> Python SDK.
>>>
>>>

 On Thu, Apr 18, 2019 at 2:02 PM Chamikara Jayalath <
 chamik...@google.com> wrote:

> I think there are two kind of dependencies we have to consider.
>
> (1) Dependencies that are needed to expand the transform.
>
> These have to be provided when we start the expansion service so that
> available external transforms are correctly registered with the expansion
> service.
>
> (2) Dependencies that are not needed at expansion but may be needed at
> runtime.
>
> I think in both cases, users have to provide these dependencies either
> when expansion service is started or when a pipeline is being executed.
>
> Max, I'm not sure why expansion service will need to provide

Re: Artifact staging in cross-language pipelines

2019-04-18 Thread Lukasz Cwik
I can understand the concern about credentials, the same access concern
will exist for several cross language transforms (mostly IOs) since some
will need access to credentials to read/write to an external service.

Are there any ideas on how credential propagation could work to these IOs?
Can we use these mechanisms for staging?

On Thu, Apr 18, 2019 at 3:47 PM Ankur Goenka  wrote:

> I agree that the Expansion service knows about the artifacts required for
> a cross language transform and having a prepackage folder/Zip for
> transforms based on language makes sense.
>
> One think to note here is that expansion service might not have the same
> access privilege as the pipeline author and hence might not be able to
> stage artifacts by itself.
> Keeping this in mind I am leaning towards making Expansion service provide
> all the required artifacts to the user and let the user stage the artifacts
> as regular artifacts.
> At this time, we only have Beam File System based artifact staging which
> users local credentials to access different file systems. Even a docker
> based expansion service running on local machine might not have the same
> access privileges.
>
> In brief this is what I am leaning toward.
> User call for pipeline submission -> Expansion service provide cross
> language transforms and relevant artifacts to the Sdk -> Sdk Submits the
> pipeline to Jobserver and Stages user and cross language artifacts to
> artifacts staging service
>
>
> On Thu, Apr 18, 2019 at 2:33 PM Chamikara Jayalath 
> wrote:
>
>>
>>
>> On Thu, Apr 18, 2019 at 2:12 PM Lukasz Cwik  wrote:
>>
>>> Note that Max did ask whether making the expansion service do the
>>> staging made sense, and my first line was agreeing with that direction and
>>> expanding on how it could be done (so this is really Max's idea or from
>>> whomever he got the idea from).
>>>
>>
>> +1 to what Max said then :)
>>
>>
>>>
>>> I believe a lot of the value of the expansion service is not having
>>> users need to be aware of all the SDK specific dependencies when they are
>>> trying to create a pipeline, only the "user" who is launching the expansion
>>> service may need to. And in that case we can have a prepackaged expansion
>>> service application that does what most users would want (e.g. expansion
>>> service as a docker container, a single bundled jar, ...). We (the Apache
>>> Beam community) could choose to host a default implementation of the
>>> expansion service as well.
>>>
>>
>> I'm not against this. But I think this is a secondary more advanced
>> use-case. For a Beam users that needs to use a Java transform that they
>> already have in a Python pipeline, we should provide a way to allow
>> starting up a expansion service (with dependencies needed for that) and
>> running a pipeline that uses this external Java transform (with
>> dependencies that are needed at runtime). Probably, it'll be enough to
>> allow providing all dependencies when starting up the expansion service and
>> allow expansion service to do the staging of jars are well. I don't see a
>> need to include the list of jars in the ExpansionResponse sent to the
>> Python SDK.
>>
>>
>>>
>>> On Thu, Apr 18, 2019 at 2:02 PM Chamikara Jayalath 
>>> wrote:
>>>
 I think there are two kind of dependencies we have to consider.

 (1) Dependencies that are needed to expand the transform.

 These have to be provided when we start the expansion service so that
 available external transforms are correctly registered with the expansion
 service.

 (2) Dependencies that are not needed at expansion but may be needed at
 runtime.

 I think in both cases, users have to provide these dependencies either
 when expansion service is started or when a pipeline is being executed.

 Max, I'm not sure why expansion service will need to provide
 dependencies to the user since user will already be aware of these. Are you
 talking about a expansion service that is readily available that will be
 used by many Beam users ? I think such a (possibly long running) service
 will have to maintain a repository of transforms and should have mechanism
 for registering new transforms and discovering already registered
 transforms etc. I think there's more design work needed to make transform
 expansion service support such use-cases. Currently, I think allowing
 pipeline author to provide the jars when starting the expansion service and
 when executing the pipeline will be adequate.

 Regarding the entity that will perform the staging, I like Luke's idea
 of allowing expansion service to do the staging (of jars provided by the
 user). Notion of artifacts and how they are extracted/represented is SDK
 dependent. So if the pipeline SDK tries to do this we have to add n x (n
 -1) configurations (for n SDKs).

 - Cham

 On Thu, Apr 18, 2019 at 11:45 AM Lukasz Cwik  wrote:

> 

Re: Artifact staging in cross-language pipelines

2019-04-18 Thread Ankur Goenka
I agree that the Expansion service knows about the artifacts required for a
cross language transform and having a prepackage folder/Zip for transforms
based on language makes sense.

One think to note here is that expansion service might not have the same
access privilege as the pipeline author and hence might not be able to
stage artifacts by itself.
Keeping this in mind I am leaning towards making Expansion service provide
all the required artifacts to the user and let the user stage the artifacts
as regular artifacts.
At this time, we only have Beam File System based artifact staging which
users local credentials to access different file systems. Even a docker
based expansion service running on local machine might not have the same
access privileges.

In brief this is what I am leaning toward.
User call for pipeline submission -> Expansion service provide cross
language transforms and relevant artifacts to the Sdk -> Sdk Submits the
pipeline to Jobserver and Stages user and cross language artifacts to
artifacts staging service


On Thu, Apr 18, 2019 at 2:33 PM Chamikara Jayalath 
wrote:

>
>
> On Thu, Apr 18, 2019 at 2:12 PM Lukasz Cwik  wrote:
>
>> Note that Max did ask whether making the expansion service do the staging
>> made sense, and my first line was agreeing with that direction and
>> expanding on how it could be done (so this is really Max's idea or from
>> whomever he got the idea from).
>>
>
> +1 to what Max said then :)
>
>
>>
>> I believe a lot of the value of the expansion service is not having users
>> need to be aware of all the SDK specific dependencies when they are trying
>> to create a pipeline, only the "user" who is launching the expansion
>> service may need to. And in that case we can have a prepackaged expansion
>> service application that does what most users would want (e.g. expansion
>> service as a docker container, a single bundled jar, ...). We (the Apache
>> Beam community) could choose to host a default implementation of the
>> expansion service as well.
>>
>
> I'm not against this. But I think this is a secondary more advanced
> use-case. For a Beam users that needs to use a Java transform that they
> already have in a Python pipeline, we should provide a way to allow
> starting up a expansion service (with dependencies needed for that) and
> running a pipeline that uses this external Java transform (with
> dependencies that are needed at runtime). Probably, it'll be enough to
> allow providing all dependencies when starting up the expansion service and
> allow expansion service to do the staging of jars are well. I don't see a
> need to include the list of jars in the ExpansionResponse sent to the
> Python SDK.
>
>
>>
>> On Thu, Apr 18, 2019 at 2:02 PM Chamikara Jayalath 
>> wrote:
>>
>>> I think there are two kind of dependencies we have to consider.
>>>
>>> (1) Dependencies that are needed to expand the transform.
>>>
>>> These have to be provided when we start the expansion service so that
>>> available external transforms are correctly registered with the expansion
>>> service.
>>>
>>> (2) Dependencies that are not needed at expansion but may be needed at
>>> runtime.
>>>
>>> I think in both cases, users have to provide these dependencies either
>>> when expansion service is started or when a pipeline is being executed.
>>>
>>> Max, I'm not sure why expansion service will need to provide
>>> dependencies to the user since user will already be aware of these. Are you
>>> talking about a expansion service that is readily available that will be
>>> used by many Beam users ? I think such a (possibly long running) service
>>> will have to maintain a repository of transforms and should have mechanism
>>> for registering new transforms and discovering already registered
>>> transforms etc. I think there's more design work needed to make transform
>>> expansion service support such use-cases. Currently, I think allowing
>>> pipeline author to provide the jars when starting the expansion service and
>>> when executing the pipeline will be adequate.
>>>
>>> Regarding the entity that will perform the staging, I like Luke's idea
>>> of allowing expansion service to do the staging (of jars provided by the
>>> user). Notion of artifacts and how they are extracted/represented is SDK
>>> dependent. So if the pipeline SDK tries to do this we have to add n x (n
>>> -1) configurations (for n SDKs).
>>>
>>> - Cham
>>>
>>> On Thu, Apr 18, 2019 at 11:45 AM Lukasz Cwik  wrote:
>>>
 We can expose the artifact staging endpoint and artifact token to allow
 the expansion service to upload any resources its environment may need. For
 example, the expansion service for the Beam Java SDK would be able to
 upload jars.

 In the "docker" environment, the Apache Beam Java SDK harness container
 would fetch the relevant artifacts for itself and be able to execute the
 pipeline. (Note that a docker environment could skip all this artifact
 staging if the docker 

Re: Artifact staging in cross-language pipelines

2019-04-18 Thread Chamikara Jayalath
On Thu, Apr 18, 2019 at 2:12 PM Lukasz Cwik  wrote:

> Note that Max did ask whether making the expansion service do the staging
> made sense, and my first line was agreeing with that direction and
> expanding on how it could be done (so this is really Max's idea or from
> whomever he got the idea from).
>

+1 to what Max said then :)


>
> I believe a lot of the value of the expansion service is not having users
> need to be aware of all the SDK specific dependencies when they are trying
> to create a pipeline, only the "user" who is launching the expansion
> service may need to. And in that case we can have a prepackaged expansion
> service application that does what most users would want (e.g. expansion
> service as a docker container, a single bundled jar, ...). We (the Apache
> Beam community) could choose to host a default implementation of the
> expansion service as well.
>

I'm not against this. But I think this is a secondary more advanced
use-case. For a Beam users that needs to use a Java transform that they
already have in a Python pipeline, we should provide a way to allow
starting up a expansion service (with dependencies needed for that) and
running a pipeline that uses this external Java transform (with
dependencies that are needed at runtime). Probably, it'll be enough to
allow providing all dependencies when starting up the expansion service and
allow expansion service to do the staging of jars are well. I don't see a
need to include the list of jars in the ExpansionResponse sent to the
Python SDK.


>
> On Thu, Apr 18, 2019 at 2:02 PM Chamikara Jayalath 
> wrote:
>
>> I think there are two kind of dependencies we have to consider.
>>
>> (1) Dependencies that are needed to expand the transform.
>>
>> These have to be provided when we start the expansion service so that
>> available external transforms are correctly registered with the expansion
>> service.
>>
>> (2) Dependencies that are not needed at expansion but may be needed at
>> runtime.
>>
>> I think in both cases, users have to provide these dependencies either
>> when expansion service is started or when a pipeline is being executed.
>>
>> Max, I'm not sure why expansion service will need to provide dependencies
>> to the user since user will already be aware of these. Are you talking
>> about a expansion service that is readily available that will be used by
>> many Beam users ? I think such a (possibly long running) service will have
>> to maintain a repository of transforms and should have mechanism for
>> registering new transforms and discovering already registered transforms
>> etc. I think there's more design work needed to make transform expansion
>> service support such use-cases. Currently, I think allowing pipeline author
>> to provide the jars when starting the expansion service and when executing
>> the pipeline will be adequate.
>>
>> Regarding the entity that will perform the staging, I like Luke's idea of
>> allowing expansion service to do the staging (of jars provided by the
>> user). Notion of artifacts and how they are extracted/represented is SDK
>> dependent. So if the pipeline SDK tries to do this we have to add n x (n
>> -1) configurations (for n SDKs).
>>
>> - Cham
>>
>> On Thu, Apr 18, 2019 at 11:45 AM Lukasz Cwik  wrote:
>>
>>> We can expose the artifact staging endpoint and artifact token to allow
>>> the expansion service to upload any resources its environment may need. For
>>> example, the expansion service for the Beam Java SDK would be able to
>>> upload jars.
>>>
>>> In the "docker" environment, the Apache Beam Java SDK harness container
>>> would fetch the relevant artifacts for itself and be able to execute the
>>> pipeline. (Note that a docker environment could skip all this artifact
>>> staging if the docker environment contained all necessary artifacts).
>>>
>>> For the existing "external" environment, it should already come with all
>>> the resources prepackaged wherever "external" points to. The "process"
>>> based environment could choose to use the artifact staging service to fetch
>>> those resources associated with its process or it could follow the same
>>> pattern that "external" would do and already contain all the prepackaged
>>> resources. Note that both "external" and "process" will require the
>>> instance of the expansion service to be specialized for those environments
>>> which is why the default should for the expansion service to be the
>>> "docker" environment.
>>>
>>> Note that a major reason for going with docker containers as the
>>> environment that all runners should support is that containers provides a
>>> solution for this exact issue. Both the "process" and "external"
>>> environments are explicitly limiting and expanding their capabilities will
>>> quickly have us building something like a docker container because we'll
>>> quickly find ourselves solving the same problems that docker containers
>>> provide (resources, file layout, permissions, ...)
>>>
>>>
>>>

Re: Artifact staging in cross-language pipelines

2019-04-18 Thread Lukasz Cwik
Note that Max did ask whether making the expansion service do the staging
made sense, and my first line was agreeing with that direction and
expanding on how it could be done (so this is really Max's idea or from
whomever he got the idea from).

I believe a lot of the value of the expansion service is not having users
need to be aware of all the SDK specific dependencies when they are trying
to create a pipeline, only the "user" who is launching the expansion
service may need to. And in that case we can have a prepackaged expansion
service application that does what most users would want (e.g. expansion
service as a docker container, a single bundled jar, ...). We (the Apache
Beam community) could choose to host a default implementation of the
expansion service as well.

On Thu, Apr 18, 2019 at 2:02 PM Chamikara Jayalath 
wrote:

> I think there are two kind of dependencies we have to consider.
>
> (1) Dependencies that are needed to expand the transform.
>
> These have to be provided when we start the expansion service so that
> available external transforms are correctly registered with the expansion
> service.
>
> (2) Dependencies that are not needed at expansion but may be needed at
> runtime.
>
> I think in both cases, users have to provide these dependencies either
> when expansion service is started or when a pipeline is being executed.
>
> Max, I'm not sure why expansion service will need to provide dependencies
> to the user since user will already be aware of these. Are you talking
> about a expansion service that is readily available that will be used by
> many Beam users ? I think such a (possibly long running) service will have
> to maintain a repository of transforms and should have mechanism for
> registering new transforms and discovering already registered transforms
> etc. I think there's more design work needed to make transform expansion
> service support such use-cases. Currently, I think allowing pipeline author
> to provide the jars when starting the expansion service and when executing
> the pipeline will be adequate.
>
> Regarding the entity that will perform the staging, I like Luke's idea of
> allowing expansion service to do the staging (of jars provided by the
> user). Notion of artifacts and how they are extracted/represented is SDK
> dependent. So if the pipeline SDK tries to do this we have to add n x (n
> -1) configurations (for n SDKs).
>
> - Cham
>
> On Thu, Apr 18, 2019 at 11:45 AM Lukasz Cwik  wrote:
>
>> We can expose the artifact staging endpoint and artifact token to allow
>> the expansion service to upload any resources its environment may need. For
>> example, the expansion service for the Beam Java SDK would be able to
>> upload jars.
>>
>> In the "docker" environment, the Apache Beam Java SDK harness container
>> would fetch the relevant artifacts for itself and be able to execute the
>> pipeline. (Note that a docker environment could skip all this artifact
>> staging if the docker environment contained all necessary artifacts).
>>
>> For the existing "external" environment, it should already come with all
>> the resources prepackaged wherever "external" points to. The "process"
>> based environment could choose to use the artifact staging service to fetch
>> those resources associated with its process or it could follow the same
>> pattern that "external" would do and already contain all the prepackaged
>> resources. Note that both "external" and "process" will require the
>> instance of the expansion service to be specialized for those environments
>> which is why the default should for the expansion service to be the
>> "docker" environment.
>>
>> Note that a major reason for going with docker containers as the
>> environment that all runners should support is that containers provides a
>> solution for this exact issue. Both the "process" and "external"
>> environments are explicitly limiting and expanding their capabilities will
>> quickly have us building something like a docker container because we'll
>> quickly find ourselves solving the same problems that docker containers
>> provide (resources, file layout, permissions, ...)
>>
>>
>>
>>
>> On Thu, Apr 18, 2019 at 11:21 AM Maximilian Michels 
>> wrote:
>>
>>> Hi everyone,
>>>
>>> We have previously merged support for configuring transforms across
>>> languages. Please see Cham's summary on the discussion [1]. There is
>>> also a design document [2].
>>>
>>> Subsequently, we've added wrappers for cross-language transforms to the
>>> Python SDK, i.e. GenerateSequence, ReadFromKafka, and there is a pending
>>> PR [1] for WriteToKafka. All of them utilize Java transforms via
>>> cross-language configuration.
>>>
>>> That is all pretty exciting :)
>>>
>>> We still have some issues to solve, one being how to stage artifact from
>>> a foreign environment. When we run external transforms which are part of
>>> Beam's core (e.g. GenerateSequence), we have them available in the SDK
>>> Harness. However, when they 

Re: Artifact staging in cross-language pipelines

2019-04-18 Thread Chamikara Jayalath
I think there are two kind of dependencies we have to consider.

(1) Dependencies that are needed to expand the transform.

These have to be provided when we start the expansion service so that
available external transforms are correctly registered with the expansion
service.

(2) Dependencies that are not needed at expansion but may be needed at
runtime.

I think in both cases, users have to provide these dependencies either when
expansion service is started or when a pipeline is being executed.

Max, I'm not sure why expansion service will need to provide dependencies
to the user since user will already be aware of these. Are you talking
about a expansion service that is readily available that will be used by
many Beam users ? I think such a (possibly long running) service will have
to maintain a repository of transforms and should have mechanism for
registering new transforms and discovering already registered transforms
etc. I think there's more design work needed to make transform expansion
service support such use-cases. Currently, I think allowing pipeline author
to provide the jars when starting the expansion service and when executing
the pipeline will be adequate.

Regarding the entity that will perform the staging, I like Luke's idea of
allowing expansion service to do the staging (of jars provided by the
user). Notion of artifacts and how they are extracted/represented is SDK
dependent. So if the pipeline SDK tries to do this we have to add n x (n
-1) configurations (for n SDKs).

- Cham

On Thu, Apr 18, 2019 at 11:45 AM Lukasz Cwik  wrote:

> We can expose the artifact staging endpoint and artifact token to allow
> the expansion service to upload any resources its environment may need. For
> example, the expansion service for the Beam Java SDK would be able to
> upload jars.
>
> In the "docker" environment, the Apache Beam Java SDK harness container
> would fetch the relevant artifacts for itself and be able to execute the
> pipeline. (Note that a docker environment could skip all this artifact
> staging if the docker environment contained all necessary artifacts).
>
> For the existing "external" environment, it should already come with all
> the resources prepackaged wherever "external" points to. The "process"
> based environment could choose to use the artifact staging service to fetch
> those resources associated with its process or it could follow the same
> pattern that "external" would do and already contain all the prepackaged
> resources. Note that both "external" and "process" will require the
> instance of the expansion service to be specialized for those environments
> which is why the default should for the expansion service to be the
> "docker" environment.
>
> Note that a major reason for going with docker containers as the
> environment that all runners should support is that containers provides a
> solution for this exact issue. Both the "process" and "external"
> environments are explicitly limiting and expanding their capabilities will
> quickly have us building something like a docker container because we'll
> quickly find ourselves solving the same problems that docker containers
> provide (resources, file layout, permissions, ...)
>
>
>
>
> On Thu, Apr 18, 2019 at 11:21 AM Maximilian Michels 
> wrote:
>
>> Hi everyone,
>>
>> We have previously merged support for configuring transforms across
>> languages. Please see Cham's summary on the discussion [1]. There is
>> also a design document [2].
>>
>> Subsequently, we've added wrappers for cross-language transforms to the
>> Python SDK, i.e. GenerateSequence, ReadFromKafka, and there is a pending
>> PR [1] for WriteToKafka. All of them utilize Java transforms via
>> cross-language configuration.
>>
>> That is all pretty exciting :)
>>
>> We still have some issues to solve, one being how to stage artifact from
>> a foreign environment. When we run external transforms which are part of
>> Beam's core (e.g. GenerateSequence), we have them available in the SDK
>> Harness. However, when they are not (e.g. KafkaIO) we need to stage the
>> necessary files.
>>
>> For my PR [3] I've naively added ":beam-sdks-java-io-kafka" to the SDK
>> Harness which caused dependency problems [4]. Those could be resolved
>> but the bigger question is how to stage artifacts for external
>> transforms programmatically?
>>
>> Heejong has solved this by adding a "--jar_package" option to the Python
>> SDK to stage Java files [5]. I think that is a better solution than
>> adding required Jars to the SDK Harness directly, but it is not very
>> convenient for users.
>>
>> I've discussed this today with Thomas and we both figured that the
>> expansion service needs to provide a list of required Jars with the
>> ExpansionResponse it provides. It's not entirely clear, how we determine
>> which artifacts are necessary for an external transform. We could just
>> dump the entire classpath like we do in PipelineResources for Java
>> pipelines. This provides many 

Re: Artifact staging in cross-language pipelines

2019-04-18 Thread Lukasz Cwik
We can expose the artifact staging endpoint and artifact token to allow the
expansion service to upload any resources its environment may need. For
example, the expansion service for the Beam Java SDK would be able to
upload jars.

In the "docker" environment, the Apache Beam Java SDK harness container
would fetch the relevant artifacts for itself and be able to execute the
pipeline. (Note that a docker environment could skip all this artifact
staging if the docker environment contained all necessary artifacts).

For the existing "external" environment, it should already come with all
the resources prepackaged wherever "external" points to. The "process"
based environment could choose to use the artifact staging service to fetch
those resources associated with its process or it could follow the same
pattern that "external" would do and already contain all the prepackaged
resources. Note that both "external" and "process" will require the
instance of the expansion service to be specialized for those environments
which is why the default should for the expansion service to be the
"docker" environment.

Note that a major reason for going with docker containers as the
environment that all runners should support is that containers provides a
solution for this exact issue. Both the "process" and "external"
environments are explicitly limiting and expanding their capabilities will
quickly have us building something like a docker container because we'll
quickly find ourselves solving the same problems that docker containers
provide (resources, file layout, permissions, ...)




On Thu, Apr 18, 2019 at 11:21 AM Maximilian Michels  wrote:

> Hi everyone,
>
> We have previously merged support for configuring transforms across
> languages. Please see Cham's summary on the discussion [1]. There is
> also a design document [2].
>
> Subsequently, we've added wrappers for cross-language transforms to the
> Python SDK, i.e. GenerateSequence, ReadFromKafka, and there is a pending
> PR [1] for WriteToKafka. All of them utilize Java transforms via
> cross-language configuration.
>
> That is all pretty exciting :)
>
> We still have some issues to solve, one being how to stage artifact from
> a foreign environment. When we run external transforms which are part of
> Beam's core (e.g. GenerateSequence), we have them available in the SDK
> Harness. However, when they are not (e.g. KafkaIO) we need to stage the
> necessary files.
>
> For my PR [3] I've naively added ":beam-sdks-java-io-kafka" to the SDK
> Harness which caused dependency problems [4]. Those could be resolved
> but the bigger question is how to stage artifacts for external
> transforms programmatically?
>
> Heejong has solved this by adding a "--jar_package" option to the Python
> SDK to stage Java files [5]. I think that is a better solution than
> adding required Jars to the SDK Harness directly, but it is not very
> convenient for users.
>
> I've discussed this today with Thomas and we both figured that the
> expansion service needs to provide a list of required Jars with the
> ExpansionResponse it provides. It's not entirely clear, how we determine
> which artifacts are necessary for an external transform. We could just
> dump the entire classpath like we do in PipelineResources for Java
> pipelines. This provides many unneeded classes but would work.
>
> Do you think it makes sense for the expansion service to provide the
> artifacts? Perhaps you have a better idea how to resolve the staging
> problem in cross-language pipelines?
>
> Thanks,
> Max
>
> [1]
>
> https://lists.apache.org/thread.html/b99ba8527422e31ec7bb7ad9dc3a6583551ea392ebdc5527b5fb4a67@%3Cdev.beam.apache.org%3E
>
> [2] https://s.apache.org/beam-cross-language-io
>
> [3] https://github.com/apache/beam/pull/8322#discussion_r276336748
>
> [4] Dependency graph for beam-runners-direct-java:
>
> beam-runners-direct-java -> sdks-java-harness -> beam-sdks-java-io-kafka
> -> beam-runners-direct-java ... the cycle continues
>
> Beam-runners-direct-java depends on sdks-java-harness due
> to the infamous Universal Local Runner. Beam-sdks-java-io-kafka depends
> on beam-runners-direct-java for running tests.
>
> [5] https://github.com/apache/beam/pull/8340
>