Re: Artifact staging in cross-language pipelines
Good discussion :) Initially the expansion service was considered a user responsibility, but I think that isn't necessarily the case. I can also see the expansion service provided as part of the infrastructure and the user not wanting to deal with it at all. For example, users may want to write Python transforms and use external IOs, without being concerned how these IOs are provided. Under such scenario it would be good if: * Expansion service(s) can be auto-discovered via the job service endpoint * Available external transforms can be discovered via the expansion service(s) * Dependencies for external transforms are part of the metadata returned by expansion service Dependencies could then be staged either by the SDK client or the expansion service. The expansion service could provide the locations to stage to the SDK, it would still be transparent to the user. I also agree with Luke regarding the environments. Docker is the choice for generic deployment. Other environments are used when the flexibility offered by Docker isn't needed (or gets into the way). Then the dependencies are provided in different ways. Whether these are Python packages or jar files, by opting out of Docker the decision is made to manage dependencies externally. Thomas On Thu, Apr 18, 2019 at 6:01 PM Chamikara Jayalath wrote: > > > On Thu, Apr 18, 2019 at 5:21 PM Chamikara Jayalath > wrote: > >> Thanks for raising the concern about credentials Ankur, I agree that this >> is a significant issue. >> >> On Thu, Apr 18, 2019 at 4:23 PM Lukasz Cwik wrote: >> >>> I can understand the concern about credentials, the same access concern >>> will exist for several cross language transforms (mostly IOs) since some >>> will need access to credentials to read/write to an external service. >>> >>> Are there any ideas on how credential propagation could work to these >>> IOs? >>> >> >> There are some cases where existing IO transforms need credentials to >> access remote resources, for example, size estimation, validation, etc. But >> usually these are optional (or transform can be configured to not perform >> these functions). >> > > To clarify, I'm only talking about transform expansion here. Many IO > transforms need read/write access to remote services at run time. So > probably we need to figure out a way to propagate these credentials anyways. > > >> >> >>> Can we use these mechanisms for staging? >>> >> >> I think we'll have to find a way to do one of (1) propagate credentials >> to other SDKs (2) allow users to configure SDK containers to have necessary >> credentials (3) do the artifact staging from the pipeline SDK environment >> which already have credentials. I prefer (1) or (2) since this will given a >> transform same feature set whether used directly (in the same SDK language >> as the transform) or remotely but it might be hard to do this for an >> arbitrary service that a transform might connect to considering the number >> of ways users can configure credentials (after an offline discussion with >> Ankur). >> >> >>> >>> >> >>> On Thu, Apr 18, 2019 at 3:47 PM Ankur Goenka wrote: >>> I agree that the Expansion service knows about the artifacts required for a cross language transform and having a prepackage folder/Zip for transforms based on language makes sense. One think to note here is that expansion service might not have the same access privilege as the pipeline author and hence might not be able to stage artifacts by itself. Keeping this in mind I am leaning towards making Expansion service provide all the required artifacts to the user and let the user stage the artifacts as regular artifacts. At this time, we only have Beam File System based artifact staging which users local credentials to access different file systems. Even a docker based expansion service running on local machine might not have the same access privileges. In brief this is what I am leaning toward. User call for pipeline submission -> Expansion service provide cross language transforms and relevant artifacts to the Sdk -> Sdk Submits the pipeline to Jobserver and Stages user and cross language artifacts to artifacts staging service On Thu, Apr 18, 2019 at 2:33 PM Chamikara Jayalath < chamik...@google.com> wrote: > > > On Thu, Apr 18, 2019 at 2:12 PM Lukasz Cwik wrote: > >> Note that Max did ask whether making the expansion service do the >> staging made sense, and my first line was agreeing with that direction >> and >> expanding on how it could be done (so this is really Max's idea or from >> whomever he got the idea from). >> > > +1 to what Max said then :) > > >> >> I believe a lot of the value of the expansion service is not having >> users need to be aware of all the SDK specific dependencies when they are >> trying to create a pipeline,
Re: investigating python precommit wordcount_it failure
I am working on a postcommit worcount it failure in BEAM-7063. On Thu, Apr 18, 2019 at 6:05 PM Udi Meiri wrote: > Correction: it's a postcommit failure > > On Thu, Apr 18, 2019 at 5:43 PM Udi Meiri wrote: > >> in https://issues.apache.org/jira/browse/BEAM-7111 >> >> If anyone has state please lmk >> >
Re: investigating python precommit wordcount_it failure
Correction: it's a postcommit failure On Thu, Apr 18, 2019 at 5:43 PM Udi Meiri wrote: > in https://issues.apache.org/jira/browse/BEAM-7111 > > If anyone has state please lmk > smime.p7s Description: S/MIME Cryptographic Signature
Re: Artifact staging in cross-language pipelines
On Thu, Apr 18, 2019 at 5:21 PM Chamikara Jayalath wrote: > Thanks for raising the concern about credentials Ankur, I agree that this > is a significant issue. > > On Thu, Apr 18, 2019 at 4:23 PM Lukasz Cwik wrote: > >> I can understand the concern about credentials, the same access concern >> will exist for several cross language transforms (mostly IOs) since some >> will need access to credentials to read/write to an external service. >> >> Are there any ideas on how credential propagation could work to these IOs? >> > > There are some cases where existing IO transforms need credentials to > access remote resources, for example, size estimation, validation, etc. But > usually these are optional (or transform can be configured to not perform > these functions). > To clarify, I'm only talking about transform expansion here. Many IO transforms need read/write access to remote services at run time. So probably we need to figure out a way to propagate these credentials anyways. > > >> Can we use these mechanisms for staging? >> > > I think we'll have to find a way to do one of (1) propagate credentials to > other SDKs (2) allow users to configure SDK containers to have necessary > credentials (3) do the artifact staging from the pipeline SDK environment > which already have credentials. I prefer (1) or (2) since this will given a > transform same feature set whether used directly (in the same SDK language > as the transform) or remotely but it might be hard to do this for an > arbitrary service that a transform might connect to considering the number > of ways users can configure credentials (after an offline discussion with > Ankur). > > >> >> > >> On Thu, Apr 18, 2019 at 3:47 PM Ankur Goenka wrote: >> >>> I agree that the Expansion service knows about the artifacts required >>> for a cross language transform and having a prepackage folder/Zip for >>> transforms based on language makes sense. >>> >>> One think to note here is that expansion service might not have the same >>> access privilege as the pipeline author and hence might not be able to >>> stage artifacts by itself. >>> Keeping this in mind I am leaning towards making Expansion service >>> provide all the required artifacts to the user and let the user stage the >>> artifacts as regular artifacts. >>> At this time, we only have Beam File System based artifact staging which >>> users local credentials to access different file systems. Even a docker >>> based expansion service running on local machine might not have the same >>> access privileges. >>> >>> In brief this is what I am leaning toward. >>> User call for pipeline submission -> Expansion service provide cross >>> language transforms and relevant artifacts to the Sdk -> Sdk Submits the >>> pipeline to Jobserver and Stages user and cross language artifacts to >>> artifacts staging service >>> >>> >>> On Thu, Apr 18, 2019 at 2:33 PM Chamikara Jayalath >>> wrote: >>> On Thu, Apr 18, 2019 at 2:12 PM Lukasz Cwik wrote: > Note that Max did ask whether making the expansion service do the > staging made sense, and my first line was agreeing with that direction and > expanding on how it could be done (so this is really Max's idea or from > whomever he got the idea from). > +1 to what Max said then :) > > I believe a lot of the value of the expansion service is not having > users need to be aware of all the SDK specific dependencies when they are > trying to create a pipeline, only the "user" who is launching the > expansion > service may need to. And in that case we can have a prepackaged expansion > service application that does what most users would want (e.g. expansion > service as a docker container, a single bundled jar, ...). We (the Apache > Beam community) could choose to host a default implementation of the > expansion service as well. > I'm not against this. But I think this is a secondary more advanced use-case. For a Beam users that needs to use a Java transform that they already have in a Python pipeline, we should provide a way to allow starting up a expansion service (with dependencies needed for that) and running a pipeline that uses this external Java transform (with dependencies that are needed at runtime). Probably, it'll be enough to allow providing all dependencies when starting up the expansion service and allow expansion service to do the staging of jars are well. I don't see a need to include the list of jars in the ExpansionResponse sent to the Python SDK. > > On Thu, Apr 18, 2019 at 2:02 PM Chamikara Jayalath < > chamik...@google.com> wrote: > >> I think there are two kind of dependencies we have to consider. >> >> (1) Dependencies that are needed to expand the transform. >> >> These have to be provided when we start the expansion service so that >> available
investigating python precommit wordcount_it failure
in https://issues.apache.org/jira/browse/BEAM-7111 If anyone has state please lmk smime.p7s Description: S/MIME Cryptographic Signature
Re: Artifact staging in cross-language pipelines
Thanks for raising the concern about credentials Ankur, I agree that this is a significant issue. On Thu, Apr 18, 2019 at 4:23 PM Lukasz Cwik wrote: > I can understand the concern about credentials, the same access concern > will exist for several cross language transforms (mostly IOs) since some > will need access to credentials to read/write to an external service. > > Are there any ideas on how credential propagation could work to these IOs? > There are some cases where existing IO transforms need credentials to access remote resources, for example, size estimation, validation, etc. But usually these are optional (or transform can be configured to not perform these functions). > Can we use these mechanisms for staging? > I think we'll have to find a way to do one of (1) propagate credentials to other SDKs (2) allow users to configure SDK containers to have necessary credentials (3) do the artifact staging from the pipeline SDK environment which already have credentials. I prefer (1) or (2) since this will given a transform same feature set whether used directly (in the same SDK language as the transform) or remotely but it might be hard to do this for an arbitrary service that a transform might connect to considering the number of ways users can configure credentials (after an offline discussion with Ankur). > > > On Thu, Apr 18, 2019 at 3:47 PM Ankur Goenka wrote: > >> I agree that the Expansion service knows about the artifacts required for >> a cross language transform and having a prepackage folder/Zip for >> transforms based on language makes sense. >> >> One think to note here is that expansion service might not have the same >> access privilege as the pipeline author and hence might not be able to >> stage artifacts by itself. >> Keeping this in mind I am leaning towards making Expansion service >> provide all the required artifacts to the user and let the user stage the >> artifacts as regular artifacts. >> At this time, we only have Beam File System based artifact staging which >> users local credentials to access different file systems. Even a docker >> based expansion service running on local machine might not have the same >> access privileges. >> >> In brief this is what I am leaning toward. >> User call for pipeline submission -> Expansion service provide cross >> language transforms and relevant artifacts to the Sdk -> Sdk Submits the >> pipeline to Jobserver and Stages user and cross language artifacts to >> artifacts staging service >> >> >> On Thu, Apr 18, 2019 at 2:33 PM Chamikara Jayalath >> wrote: >> >>> >>> >>> On Thu, Apr 18, 2019 at 2:12 PM Lukasz Cwik wrote: >>> Note that Max did ask whether making the expansion service do the staging made sense, and my first line was agreeing with that direction and expanding on how it could be done (so this is really Max's idea or from whomever he got the idea from). >>> >>> +1 to what Max said then :) >>> >>> I believe a lot of the value of the expansion service is not having users need to be aware of all the SDK specific dependencies when they are trying to create a pipeline, only the "user" who is launching the expansion service may need to. And in that case we can have a prepackaged expansion service application that does what most users would want (e.g. expansion service as a docker container, a single bundled jar, ...). We (the Apache Beam community) could choose to host a default implementation of the expansion service as well. >>> >>> I'm not against this. But I think this is a secondary more advanced >>> use-case. For a Beam users that needs to use a Java transform that they >>> already have in a Python pipeline, we should provide a way to allow >>> starting up a expansion service (with dependencies needed for that) and >>> running a pipeline that uses this external Java transform (with >>> dependencies that are needed at runtime). Probably, it'll be enough to >>> allow providing all dependencies when starting up the expansion service and >>> allow expansion service to do the staging of jars are well. I don't see a >>> need to include the list of jars in the ExpansionResponse sent to the >>> Python SDK. >>> >>> On Thu, Apr 18, 2019 at 2:02 PM Chamikara Jayalath < chamik...@google.com> wrote: > I think there are two kind of dependencies we have to consider. > > (1) Dependencies that are needed to expand the transform. > > These have to be provided when we start the expansion service so that > available external transforms are correctly registered with the expansion > service. > > (2) Dependencies that are not needed at expansion but may be needed at > runtime. > > I think in both cases, users have to provide these dependencies either > when expansion service is started or when a pipeline is being executed. > > Max, I'm not sure why expansion service will need to provide
Re: Artifact staging in cross-language pipelines
I can understand the concern about credentials, the same access concern will exist for several cross language transforms (mostly IOs) since some will need access to credentials to read/write to an external service. Are there any ideas on how credential propagation could work to these IOs? Can we use these mechanisms for staging? On Thu, Apr 18, 2019 at 3:47 PM Ankur Goenka wrote: > I agree that the Expansion service knows about the artifacts required for > a cross language transform and having a prepackage folder/Zip for > transforms based on language makes sense. > > One think to note here is that expansion service might not have the same > access privilege as the pipeline author and hence might not be able to > stage artifacts by itself. > Keeping this in mind I am leaning towards making Expansion service provide > all the required artifacts to the user and let the user stage the artifacts > as regular artifacts. > At this time, we only have Beam File System based artifact staging which > users local credentials to access different file systems. Even a docker > based expansion service running on local machine might not have the same > access privileges. > > In brief this is what I am leaning toward. > User call for pipeline submission -> Expansion service provide cross > language transforms and relevant artifacts to the Sdk -> Sdk Submits the > pipeline to Jobserver and Stages user and cross language artifacts to > artifacts staging service > > > On Thu, Apr 18, 2019 at 2:33 PM Chamikara Jayalath > wrote: > >> >> >> On Thu, Apr 18, 2019 at 2:12 PM Lukasz Cwik wrote: >> >>> Note that Max did ask whether making the expansion service do the >>> staging made sense, and my first line was agreeing with that direction and >>> expanding on how it could be done (so this is really Max's idea or from >>> whomever he got the idea from). >>> >> >> +1 to what Max said then :) >> >> >>> >>> I believe a lot of the value of the expansion service is not having >>> users need to be aware of all the SDK specific dependencies when they are >>> trying to create a pipeline, only the "user" who is launching the expansion >>> service may need to. And in that case we can have a prepackaged expansion >>> service application that does what most users would want (e.g. expansion >>> service as a docker container, a single bundled jar, ...). We (the Apache >>> Beam community) could choose to host a default implementation of the >>> expansion service as well. >>> >> >> I'm not against this. But I think this is a secondary more advanced >> use-case. For a Beam users that needs to use a Java transform that they >> already have in a Python pipeline, we should provide a way to allow >> starting up a expansion service (with dependencies needed for that) and >> running a pipeline that uses this external Java transform (with >> dependencies that are needed at runtime). Probably, it'll be enough to >> allow providing all dependencies when starting up the expansion service and >> allow expansion service to do the staging of jars are well. I don't see a >> need to include the list of jars in the ExpansionResponse sent to the >> Python SDK. >> >> >>> >>> On Thu, Apr 18, 2019 at 2:02 PM Chamikara Jayalath >>> wrote: >>> I think there are two kind of dependencies we have to consider. (1) Dependencies that are needed to expand the transform. These have to be provided when we start the expansion service so that available external transforms are correctly registered with the expansion service. (2) Dependencies that are not needed at expansion but may be needed at runtime. I think in both cases, users have to provide these dependencies either when expansion service is started or when a pipeline is being executed. Max, I'm not sure why expansion service will need to provide dependencies to the user since user will already be aware of these. Are you talking about a expansion service that is readily available that will be used by many Beam users ? I think such a (possibly long running) service will have to maintain a repository of transforms and should have mechanism for registering new transforms and discovering already registered transforms etc. I think there's more design work needed to make transform expansion service support such use-cases. Currently, I think allowing pipeline author to provide the jars when starting the expansion service and when executing the pipeline will be adequate. Regarding the entity that will perform the staging, I like Luke's idea of allowing expansion service to do the staging (of jars provided by the user). Notion of artifacts and how they are extracted/represented is SDK dependent. So if the pipeline SDK tries to do this we have to add n x (n -1) configurations (for n SDKs). - Cham On Thu, Apr 18, 2019 at 11:45 AM Lukasz Cwik wrote: >
Re: Artifact staging in cross-language pipelines
I agree that the Expansion service knows about the artifacts required for a cross language transform and having a prepackage folder/Zip for transforms based on language makes sense. One think to note here is that expansion service might not have the same access privilege as the pipeline author and hence might not be able to stage artifacts by itself. Keeping this in mind I am leaning towards making Expansion service provide all the required artifacts to the user and let the user stage the artifacts as regular artifacts. At this time, we only have Beam File System based artifact staging which users local credentials to access different file systems. Even a docker based expansion service running on local machine might not have the same access privileges. In brief this is what I am leaning toward. User call for pipeline submission -> Expansion service provide cross language transforms and relevant artifacts to the Sdk -> Sdk Submits the pipeline to Jobserver and Stages user and cross language artifacts to artifacts staging service On Thu, Apr 18, 2019 at 2:33 PM Chamikara Jayalath wrote: > > > On Thu, Apr 18, 2019 at 2:12 PM Lukasz Cwik wrote: > >> Note that Max did ask whether making the expansion service do the staging >> made sense, and my first line was agreeing with that direction and >> expanding on how it could be done (so this is really Max's idea or from >> whomever he got the idea from). >> > > +1 to what Max said then :) > > >> >> I believe a lot of the value of the expansion service is not having users >> need to be aware of all the SDK specific dependencies when they are trying >> to create a pipeline, only the "user" who is launching the expansion >> service may need to. And in that case we can have a prepackaged expansion >> service application that does what most users would want (e.g. expansion >> service as a docker container, a single bundled jar, ...). We (the Apache >> Beam community) could choose to host a default implementation of the >> expansion service as well. >> > > I'm not against this. But I think this is a secondary more advanced > use-case. For a Beam users that needs to use a Java transform that they > already have in a Python pipeline, we should provide a way to allow > starting up a expansion service (with dependencies needed for that) and > running a pipeline that uses this external Java transform (with > dependencies that are needed at runtime). Probably, it'll be enough to > allow providing all dependencies when starting up the expansion service and > allow expansion service to do the staging of jars are well. I don't see a > need to include the list of jars in the ExpansionResponse sent to the > Python SDK. > > >> >> On Thu, Apr 18, 2019 at 2:02 PM Chamikara Jayalath >> wrote: >> >>> I think there are two kind of dependencies we have to consider. >>> >>> (1) Dependencies that are needed to expand the transform. >>> >>> These have to be provided when we start the expansion service so that >>> available external transforms are correctly registered with the expansion >>> service. >>> >>> (2) Dependencies that are not needed at expansion but may be needed at >>> runtime. >>> >>> I think in both cases, users have to provide these dependencies either >>> when expansion service is started or when a pipeline is being executed. >>> >>> Max, I'm not sure why expansion service will need to provide >>> dependencies to the user since user will already be aware of these. Are you >>> talking about a expansion service that is readily available that will be >>> used by many Beam users ? I think such a (possibly long running) service >>> will have to maintain a repository of transforms and should have mechanism >>> for registering new transforms and discovering already registered >>> transforms etc. I think there's more design work needed to make transform >>> expansion service support such use-cases. Currently, I think allowing >>> pipeline author to provide the jars when starting the expansion service and >>> when executing the pipeline will be adequate. >>> >>> Regarding the entity that will perform the staging, I like Luke's idea >>> of allowing expansion service to do the staging (of jars provided by the >>> user). Notion of artifacts and how they are extracted/represented is SDK >>> dependent. So if the pipeline SDK tries to do this we have to add n x (n >>> -1) configurations (for n SDKs). >>> >>> - Cham >>> >>> On Thu, Apr 18, 2019 at 11:45 AM Lukasz Cwik wrote: >>> We can expose the artifact staging endpoint and artifact token to allow the expansion service to upload any resources its environment may need. For example, the expansion service for the Beam Java SDK would be able to upload jars. In the "docker" environment, the Apache Beam Java SDK harness container would fetch the relevant artifacts for itself and be able to execute the pipeline. (Note that a docker environment could skip all this artifact staging if the docker
Re: Artifact staging in cross-language pipelines
On Thu, Apr 18, 2019 at 2:12 PM Lukasz Cwik wrote: > Note that Max did ask whether making the expansion service do the staging > made sense, and my first line was agreeing with that direction and > expanding on how it could be done (so this is really Max's idea or from > whomever he got the idea from). > +1 to what Max said then :) > > I believe a lot of the value of the expansion service is not having users > need to be aware of all the SDK specific dependencies when they are trying > to create a pipeline, only the "user" who is launching the expansion > service may need to. And in that case we can have a prepackaged expansion > service application that does what most users would want (e.g. expansion > service as a docker container, a single bundled jar, ...). We (the Apache > Beam community) could choose to host a default implementation of the > expansion service as well. > I'm not against this. But I think this is a secondary more advanced use-case. For a Beam users that needs to use a Java transform that they already have in a Python pipeline, we should provide a way to allow starting up a expansion service (with dependencies needed for that) and running a pipeline that uses this external Java transform (with dependencies that are needed at runtime). Probably, it'll be enough to allow providing all dependencies when starting up the expansion service and allow expansion service to do the staging of jars are well. I don't see a need to include the list of jars in the ExpansionResponse sent to the Python SDK. > > On Thu, Apr 18, 2019 at 2:02 PM Chamikara Jayalath > wrote: > >> I think there are two kind of dependencies we have to consider. >> >> (1) Dependencies that are needed to expand the transform. >> >> These have to be provided when we start the expansion service so that >> available external transforms are correctly registered with the expansion >> service. >> >> (2) Dependencies that are not needed at expansion but may be needed at >> runtime. >> >> I think in both cases, users have to provide these dependencies either >> when expansion service is started or when a pipeline is being executed. >> >> Max, I'm not sure why expansion service will need to provide dependencies >> to the user since user will already be aware of these. Are you talking >> about a expansion service that is readily available that will be used by >> many Beam users ? I think such a (possibly long running) service will have >> to maintain a repository of transforms and should have mechanism for >> registering new transforms and discovering already registered transforms >> etc. I think there's more design work needed to make transform expansion >> service support such use-cases. Currently, I think allowing pipeline author >> to provide the jars when starting the expansion service and when executing >> the pipeline will be adequate. >> >> Regarding the entity that will perform the staging, I like Luke's idea of >> allowing expansion service to do the staging (of jars provided by the >> user). Notion of artifacts and how they are extracted/represented is SDK >> dependent. So if the pipeline SDK tries to do this we have to add n x (n >> -1) configurations (for n SDKs). >> >> - Cham >> >> On Thu, Apr 18, 2019 at 11:45 AM Lukasz Cwik wrote: >> >>> We can expose the artifact staging endpoint and artifact token to allow >>> the expansion service to upload any resources its environment may need. For >>> example, the expansion service for the Beam Java SDK would be able to >>> upload jars. >>> >>> In the "docker" environment, the Apache Beam Java SDK harness container >>> would fetch the relevant artifacts for itself and be able to execute the >>> pipeline. (Note that a docker environment could skip all this artifact >>> staging if the docker environment contained all necessary artifacts). >>> >>> For the existing "external" environment, it should already come with all >>> the resources prepackaged wherever "external" points to. The "process" >>> based environment could choose to use the artifact staging service to fetch >>> those resources associated with its process or it could follow the same >>> pattern that "external" would do and already contain all the prepackaged >>> resources. Note that both "external" and "process" will require the >>> instance of the expansion service to be specialized for those environments >>> which is why the default should for the expansion service to be the >>> "docker" environment. >>> >>> Note that a major reason for going with docker containers as the >>> environment that all runners should support is that containers provides a >>> solution for this exact issue. Both the "process" and "external" >>> environments are explicitly limiting and expanding their capabilities will >>> quickly have us building something like a docker container because we'll >>> quickly find ourselves solving the same problems that docker containers >>> provide (resources, file layout, permissions, ...) >>> >>> >>>
Re: Artifact staging in cross-language pipelines
Note that Max did ask whether making the expansion service do the staging made sense, and my first line was agreeing with that direction and expanding on how it could be done (so this is really Max's idea or from whomever he got the idea from). I believe a lot of the value of the expansion service is not having users need to be aware of all the SDK specific dependencies when they are trying to create a pipeline, only the "user" who is launching the expansion service may need to. And in that case we can have a prepackaged expansion service application that does what most users would want (e.g. expansion service as a docker container, a single bundled jar, ...). We (the Apache Beam community) could choose to host a default implementation of the expansion service as well. On Thu, Apr 18, 2019 at 2:02 PM Chamikara Jayalath wrote: > I think there are two kind of dependencies we have to consider. > > (1) Dependencies that are needed to expand the transform. > > These have to be provided when we start the expansion service so that > available external transforms are correctly registered with the expansion > service. > > (2) Dependencies that are not needed at expansion but may be needed at > runtime. > > I think in both cases, users have to provide these dependencies either > when expansion service is started or when a pipeline is being executed. > > Max, I'm not sure why expansion service will need to provide dependencies > to the user since user will already be aware of these. Are you talking > about a expansion service that is readily available that will be used by > many Beam users ? I think such a (possibly long running) service will have > to maintain a repository of transforms and should have mechanism for > registering new transforms and discovering already registered transforms > etc. I think there's more design work needed to make transform expansion > service support such use-cases. Currently, I think allowing pipeline author > to provide the jars when starting the expansion service and when executing > the pipeline will be adequate. > > Regarding the entity that will perform the staging, I like Luke's idea of > allowing expansion service to do the staging (of jars provided by the > user). Notion of artifacts and how they are extracted/represented is SDK > dependent. So if the pipeline SDK tries to do this we have to add n x (n > -1) configurations (for n SDKs). > > - Cham > > On Thu, Apr 18, 2019 at 11:45 AM Lukasz Cwik wrote: > >> We can expose the artifact staging endpoint and artifact token to allow >> the expansion service to upload any resources its environment may need. For >> example, the expansion service for the Beam Java SDK would be able to >> upload jars. >> >> In the "docker" environment, the Apache Beam Java SDK harness container >> would fetch the relevant artifacts for itself and be able to execute the >> pipeline. (Note that a docker environment could skip all this artifact >> staging if the docker environment contained all necessary artifacts). >> >> For the existing "external" environment, it should already come with all >> the resources prepackaged wherever "external" points to. The "process" >> based environment could choose to use the artifact staging service to fetch >> those resources associated with its process or it could follow the same >> pattern that "external" would do and already contain all the prepackaged >> resources. Note that both "external" and "process" will require the >> instance of the expansion service to be specialized for those environments >> which is why the default should for the expansion service to be the >> "docker" environment. >> >> Note that a major reason for going with docker containers as the >> environment that all runners should support is that containers provides a >> solution for this exact issue. Both the "process" and "external" >> environments are explicitly limiting and expanding their capabilities will >> quickly have us building something like a docker container because we'll >> quickly find ourselves solving the same problems that docker containers >> provide (resources, file layout, permissions, ...) >> >> >> >> >> On Thu, Apr 18, 2019 at 11:21 AM Maximilian Michels >> wrote: >> >>> Hi everyone, >>> >>> We have previously merged support for configuring transforms across >>> languages. Please see Cham's summary on the discussion [1]. There is >>> also a design document [2]. >>> >>> Subsequently, we've added wrappers for cross-language transforms to the >>> Python SDK, i.e. GenerateSequence, ReadFromKafka, and there is a pending >>> PR [1] for WriteToKafka. All of them utilize Java transforms via >>> cross-language configuration. >>> >>> That is all pretty exciting :) >>> >>> We still have some issues to solve, one being how to stage artifact from >>> a foreign environment. When we run external transforms which are part of >>> Beam's core (e.g. GenerateSequence), we have them available in the SDK >>> Harness. However, when they
Re: Artifact staging in cross-language pipelines
I think there are two kind of dependencies we have to consider. (1) Dependencies that are needed to expand the transform. These have to be provided when we start the expansion service so that available external transforms are correctly registered with the expansion service. (2) Dependencies that are not needed at expansion but may be needed at runtime. I think in both cases, users have to provide these dependencies either when expansion service is started or when a pipeline is being executed. Max, I'm not sure why expansion service will need to provide dependencies to the user since user will already be aware of these. Are you talking about a expansion service that is readily available that will be used by many Beam users ? I think such a (possibly long running) service will have to maintain a repository of transforms and should have mechanism for registering new transforms and discovering already registered transforms etc. I think there's more design work needed to make transform expansion service support such use-cases. Currently, I think allowing pipeline author to provide the jars when starting the expansion service and when executing the pipeline will be adequate. Regarding the entity that will perform the staging, I like Luke's idea of allowing expansion service to do the staging (of jars provided by the user). Notion of artifacts and how they are extracted/represented is SDK dependent. So if the pipeline SDK tries to do this we have to add n x (n -1) configurations (for n SDKs). - Cham On Thu, Apr 18, 2019 at 11:45 AM Lukasz Cwik wrote: > We can expose the artifact staging endpoint and artifact token to allow > the expansion service to upload any resources its environment may need. For > example, the expansion service for the Beam Java SDK would be able to > upload jars. > > In the "docker" environment, the Apache Beam Java SDK harness container > would fetch the relevant artifacts for itself and be able to execute the > pipeline. (Note that a docker environment could skip all this artifact > staging if the docker environment contained all necessary artifacts). > > For the existing "external" environment, it should already come with all > the resources prepackaged wherever "external" points to. The "process" > based environment could choose to use the artifact staging service to fetch > those resources associated with its process or it could follow the same > pattern that "external" would do and already contain all the prepackaged > resources. Note that both "external" and "process" will require the > instance of the expansion service to be specialized for those environments > which is why the default should for the expansion service to be the > "docker" environment. > > Note that a major reason for going with docker containers as the > environment that all runners should support is that containers provides a > solution for this exact issue. Both the "process" and "external" > environments are explicitly limiting and expanding their capabilities will > quickly have us building something like a docker container because we'll > quickly find ourselves solving the same problems that docker containers > provide (resources, file layout, permissions, ...) > > > > > On Thu, Apr 18, 2019 at 11:21 AM Maximilian Michels > wrote: > >> Hi everyone, >> >> We have previously merged support for configuring transforms across >> languages. Please see Cham's summary on the discussion [1]. There is >> also a design document [2]. >> >> Subsequently, we've added wrappers for cross-language transforms to the >> Python SDK, i.e. GenerateSequence, ReadFromKafka, and there is a pending >> PR [1] for WriteToKafka. All of them utilize Java transforms via >> cross-language configuration. >> >> That is all pretty exciting :) >> >> We still have some issues to solve, one being how to stage artifact from >> a foreign environment. When we run external transforms which are part of >> Beam's core (e.g. GenerateSequence), we have them available in the SDK >> Harness. However, when they are not (e.g. KafkaIO) we need to stage the >> necessary files. >> >> For my PR [3] I've naively added ":beam-sdks-java-io-kafka" to the SDK >> Harness which caused dependency problems [4]. Those could be resolved >> but the bigger question is how to stage artifacts for external >> transforms programmatically? >> >> Heejong has solved this by adding a "--jar_package" option to the Python >> SDK to stage Java files [5]. I think that is a better solution than >> adding required Jars to the SDK Harness directly, but it is not very >> convenient for users. >> >> I've discussed this today with Thomas and we both figured that the >> expansion service needs to provide a list of required Jars with the >> ExpansionResponse it provides. It's not entirely clear, how we determine >> which artifacts are necessary for an external transform. We could just >> dump the entire classpath like we do in PipelineResources for Java >> pipelines. This provides many
Hazelcast Jet Runner
Hi. We at Hazelcast Jet have been working for a while now to implement a Java Beam Runner (non-portable) based on Hazelcast Jet ( https://jet.hazelcast.org/). The process is still ongoing ( https://github.com/hazelcast/hazelcast-jet-beam-runner), but we are aiming for a fully functional, reliable Runner which can proudly join the Capability Matrix. For that purpose I would like to ask what’s your process of validating runners? We are already running the @ValidatesRunner tests and the Nexmark test suite, but beyond that what other steps do we need to take to get our Runner to the level it needs to be at?
Re: Artifact staging in cross-language pipelines
We can expose the artifact staging endpoint and artifact token to allow the expansion service to upload any resources its environment may need. For example, the expansion service for the Beam Java SDK would be able to upload jars. In the "docker" environment, the Apache Beam Java SDK harness container would fetch the relevant artifacts for itself and be able to execute the pipeline. (Note that a docker environment could skip all this artifact staging if the docker environment contained all necessary artifacts). For the existing "external" environment, it should already come with all the resources prepackaged wherever "external" points to. The "process" based environment could choose to use the artifact staging service to fetch those resources associated with its process or it could follow the same pattern that "external" would do and already contain all the prepackaged resources. Note that both "external" and "process" will require the instance of the expansion service to be specialized for those environments which is why the default should for the expansion service to be the "docker" environment. Note that a major reason for going with docker containers as the environment that all runners should support is that containers provides a solution for this exact issue. Both the "process" and "external" environments are explicitly limiting and expanding their capabilities will quickly have us building something like a docker container because we'll quickly find ourselves solving the same problems that docker containers provide (resources, file layout, permissions, ...) On Thu, Apr 18, 2019 at 11:21 AM Maximilian Michels wrote: > Hi everyone, > > We have previously merged support for configuring transforms across > languages. Please see Cham's summary on the discussion [1]. There is > also a design document [2]. > > Subsequently, we've added wrappers for cross-language transforms to the > Python SDK, i.e. GenerateSequence, ReadFromKafka, and there is a pending > PR [1] for WriteToKafka. All of them utilize Java transforms via > cross-language configuration. > > That is all pretty exciting :) > > We still have some issues to solve, one being how to stage artifact from > a foreign environment. When we run external transforms which are part of > Beam's core (e.g. GenerateSequence), we have them available in the SDK > Harness. However, when they are not (e.g. KafkaIO) we need to stage the > necessary files. > > For my PR [3] I've naively added ":beam-sdks-java-io-kafka" to the SDK > Harness which caused dependency problems [4]. Those could be resolved > but the bigger question is how to stage artifacts for external > transforms programmatically? > > Heejong has solved this by adding a "--jar_package" option to the Python > SDK to stage Java files [5]. I think that is a better solution than > adding required Jars to the SDK Harness directly, but it is not very > convenient for users. > > I've discussed this today with Thomas and we both figured that the > expansion service needs to provide a list of required Jars with the > ExpansionResponse it provides. It's not entirely clear, how we determine > which artifacts are necessary for an external transform. We could just > dump the entire classpath like we do in PipelineResources for Java > pipelines. This provides many unneeded classes but would work. > > Do you think it makes sense for the expansion service to provide the > artifacts? Perhaps you have a better idea how to resolve the staging > problem in cross-language pipelines? > > Thanks, > Max > > [1] > > https://lists.apache.org/thread.html/b99ba8527422e31ec7bb7ad9dc3a6583551ea392ebdc5527b5fb4a67@%3Cdev.beam.apache.org%3E > > [2] https://s.apache.org/beam-cross-language-io > > [3] https://github.com/apache/beam/pull/8322#discussion_r276336748 > > [4] Dependency graph for beam-runners-direct-java: > > beam-runners-direct-java -> sdks-java-harness -> beam-sdks-java-io-kafka > -> beam-runners-direct-java ... the cycle continues > > Beam-runners-direct-java depends on sdks-java-harness due > to the infamous Universal Local Runner. Beam-sdks-java-io-kafka depends > on beam-runners-direct-java for running tests. > > [5] https://github.com/apache/beam/pull/8340 >
Artifact staging in cross-language pipelines
Hi everyone, We have previously merged support for configuring transforms across languages. Please see Cham's summary on the discussion [1]. There is also a design document [2]. Subsequently, we've added wrappers for cross-language transforms to the Python SDK, i.e. GenerateSequence, ReadFromKafka, and there is a pending PR [1] for WriteToKafka. All of them utilize Java transforms via cross-language configuration. That is all pretty exciting :) We still have some issues to solve, one being how to stage artifact from a foreign environment. When we run external transforms which are part of Beam's core (e.g. GenerateSequence), we have them available in the SDK Harness. However, when they are not (e.g. KafkaIO) we need to stage the necessary files. For my PR [3] I've naively added ":beam-sdks-java-io-kafka" to the SDK Harness which caused dependency problems [4]. Those could be resolved but the bigger question is how to stage artifacts for external transforms programmatically? Heejong has solved this by adding a "--jar_package" option to the Python SDK to stage Java files [5]. I think that is a better solution than adding required Jars to the SDK Harness directly, but it is not very convenient for users. I've discussed this today with Thomas and we both figured that the expansion service needs to provide a list of required Jars with the ExpansionResponse it provides. It's not entirely clear, how we determine which artifacts are necessary for an external transform. We could just dump the entire classpath like we do in PipelineResources for Java pipelines. This provides many unneeded classes but would work. Do you think it makes sense for the expansion service to provide the artifacts? Perhaps you have a better idea how to resolve the staging problem in cross-language pipelines? Thanks, Max [1] https://lists.apache.org/thread.html/b99ba8527422e31ec7bb7ad9dc3a6583551ea392ebdc5527b5fb4a67@%3Cdev.beam.apache.org%3E [2] https://s.apache.org/beam-cross-language-io [3] https://github.com/apache/beam/pull/8322#discussion_r276336748 [4] Dependency graph for beam-runners-direct-java: beam-runners-direct-java -> sdks-java-harness -> beam-sdks-java-io-kafka -> beam-runners-direct-java ... the cycle continues Beam-runners-direct-java depends on sdks-java-harness due to the infamous Universal Local Runner. Beam-sdks-java-io-kafka depends on beam-runners-direct-java for running tests. [5] https://github.com/apache/beam/pull/8340
Re: SNAPSHOTS have not been updated since february
The origin build nodes were updated in Jan 24 and the nexus credentials were removed from the filesystem because they are not supposed to be on external build nodes (nodes Infra does not own). We now need to set up the role account on the new Beam JNLP nodes. I am still contacting Infra to bring the snapshot back. Yifan On Thu, Apr 18, 2019 at 10:09 AM Lukasz Cwik wrote: > The permissions issue is that the credentials needed to publish to the > maven repository are only deployed on machines managed by Apache Infra. Now > that the machines have been given back to each project to manage Yifan was > investigating some other way to get the permissions on to the machine. > > On Thu, Apr 18, 2019 at 10:06 AM Boyuan Zhang wrote: > >> There is a test target >> https://builds.apache.org/job/beam_Release_NightlySnapshot/ in beam, >> which builds and pushes snapshot to maven every day. Current failure is >> like, the jenkin machine cannot publish artifacts into maven owing to some >> weird permission issue. I think +Yifan Zou is >> working on it actively. >> >> On Thu, Apr 18, 2019 at 9:44 AM Ismaël Mejía wrote: >> >>> And is there a way we can detect SNAPSHOTS not been published daily in >>> the future? >>> >>> On Thu, Apr 18, 2019 at 6:37 PM Ismaël Mejía wrote: >>> > >>> > Any progress on this? >>> > >>> > On Wed, Mar 27, 2019 at 5:38 AM Daniel Oliveira < >>> danolive...@google.com> wrote: >>> > > >>> > > I made a bug for this specific issue (artifacts not publishing to >>> the Apache Maven repo): https://issues.apache.org/jira/browse/BEAM-6919 >>> > > >>> > > While I was gathering info for the bug report I also noticed +Yifan >>> Zou has an experimental PR testing a fix: >>> https://github.com/apache/beam/pull/8148 >>> > > >>> > > On Tue, Mar 26, 2019 at 11:42 AM Boyuan Zhang >>> wrote: >>> > >> >>> > >> +Daniel Oliveira >>> > >> >>> > >> On Tue, Mar 26, 2019 at 9:57 AM Boyuan Zhang >>> wrote: >>> > >>> >>> > >>> Sorry for the typo. Ideally, the snapshot publish is independent >>> from postrelease_snapshot. >>> > >>> >>> > >>> On Tue, Mar 26, 2019 at 9:55 AM Boyuan Zhang >>> wrote: >>> > >>> > Hey, >>> > >>> > I'm trying to publish the artifacts by commenting "Run Gradle >>> Publish" in my PR, but there are several errors saying "cannot write >>> artifacts into dir", anyone has idea on it? Ideally, the snapshot publish >>> is dependent from postrelease_snapshot. The publish task is to build and >>> publish artifacts and the postrelease_snapshot is to verify whether the >>> snapshot works. >>> > >>> > On Tue, Mar 26, 2019 at 8:45 AM Ahmet Altay >>> wrote: >>> > > >>> > > I believe this is related to >>> https://issues.apache.org/jira/browse/BEAM-6840 and +Boyuan Zhang has a >>> fix in progress https://github.com/apache/beam/pull/8132 >>> > > >>> > > On Tue, Mar 26, 2019 at 7:09 AM Ismaël Mejía >>> wrote: >>> > >> >>> > >> I was trying to validate a fix on the Spark runner and realized >>> that >>> > >> Beam SNAPSHOTS have not been updated since February 24 ! >>> > >> >>> > >> >>> https://repository.apache.org/content/repositories/snapshots/org/apache/beam/beam-sdks-java-core/2.12.0-SNAPSHOT/ >>> > >> >>> > >> Can somebody please take a look at why this is not been updated? >>> > >> >>> > >> Thanks, >>> > >> Ismaël >>> >>
Re: SNAPSHOTS have not been updated since february
The permissions issue is that the credentials needed to publish to the maven repository are only deployed on machines managed by Apache Infra. Now that the machines have been given back to each project to manage Yifan was investigating some other way to get the permissions on to the machine. On Thu, Apr 18, 2019 at 10:06 AM Boyuan Zhang wrote: > There is a test target > https://builds.apache.org/job/beam_Release_NightlySnapshot/ in beam, > which builds and pushes snapshot to maven every day. Current failure is > like, the jenkin machine cannot publish artifacts into maven owing to some > weird permission issue. I think +Yifan Zou is > working on it actively. > > On Thu, Apr 18, 2019 at 9:44 AM Ismaël Mejía wrote: > >> And is there a way we can detect SNAPSHOTS not been published daily in >> the future? >> >> On Thu, Apr 18, 2019 at 6:37 PM Ismaël Mejía wrote: >> > >> > Any progress on this? >> > >> > On Wed, Mar 27, 2019 at 5:38 AM Daniel Oliveira >> wrote: >> > > >> > > I made a bug for this specific issue (artifacts not publishing to the >> Apache Maven repo): https://issues.apache.org/jira/browse/BEAM-6919 >> > > >> > > While I was gathering info for the bug report I also noticed +Yifan >> Zou has an experimental PR testing a fix: >> https://github.com/apache/beam/pull/8148 >> > > >> > > On Tue, Mar 26, 2019 at 11:42 AM Boyuan Zhang >> wrote: >> > >> >> > >> +Daniel Oliveira >> > >> >> > >> On Tue, Mar 26, 2019 at 9:57 AM Boyuan Zhang >> wrote: >> > >>> >> > >>> Sorry for the typo. Ideally, the snapshot publish is independent >> from postrelease_snapshot. >> > >>> >> > >>> On Tue, Mar 26, 2019 at 9:55 AM Boyuan Zhang >> wrote: >> > >> > Hey, >> > >> > I'm trying to publish the artifacts by commenting "Run Gradle >> Publish" in my PR, but there are several errors saying "cannot write >> artifacts into dir", anyone has idea on it? Ideally, the snapshot publish >> is dependent from postrelease_snapshot. The publish task is to build and >> publish artifacts and the postrelease_snapshot is to verify whether the >> snapshot works. >> > >> > On Tue, Mar 26, 2019 at 8:45 AM Ahmet Altay >> wrote: >> > > >> > > I believe this is related to >> https://issues.apache.org/jira/browse/BEAM-6840 and +Boyuan Zhang has a >> fix in progress https://github.com/apache/beam/pull/8132 >> > > >> > > On Tue, Mar 26, 2019 at 7:09 AM Ismaël Mejía >> wrote: >> > >> >> > >> I was trying to validate a fix on the Spark runner and realized >> that >> > >> Beam SNAPSHOTS have not been updated since February 24 ! >> > >> >> > >> >> https://repository.apache.org/content/repositories/snapshots/org/apache/beam/beam-sdks-java-core/2.12.0-SNAPSHOT/ >> > >> >> > >> Can somebody please take a look at why this is not been updated? >> > >> >> > >> Thanks, >> > >> Ismaël >> >
Re: SNAPSHOTS have not been updated since february
There is a test target https://builds.apache.org/job/beam_Release_NightlySnapshot/ in beam, which builds and pushes snapshot to maven every day. Current failure is like, the jenkin machine cannot publish artifacts into maven owing to some weird permission issue. I think +Yifan Zou is working on it actively. On Thu, Apr 18, 2019 at 9:44 AM Ismaël Mejía wrote: > And is there a way we can detect SNAPSHOTS not been published daily in > the future? > > On Thu, Apr 18, 2019 at 6:37 PM Ismaël Mejía wrote: > > > > Any progress on this? > > > > On Wed, Mar 27, 2019 at 5:38 AM Daniel Oliveira > wrote: > > > > > > I made a bug for this specific issue (artifacts not publishing to the > Apache Maven repo): https://issues.apache.org/jira/browse/BEAM-6919 > > > > > > While I was gathering info for the bug report I also noticed +Yifan > Zou has an experimental PR testing a fix: > https://github.com/apache/beam/pull/8148 > > > > > > On Tue, Mar 26, 2019 at 11:42 AM Boyuan Zhang > wrote: > > >> > > >> +Daniel Oliveira > > >> > > >> On Tue, Mar 26, 2019 at 9:57 AM Boyuan Zhang > wrote: > > >>> > > >>> Sorry for the typo. Ideally, the snapshot publish is independent > from postrelease_snapshot. > > >>> > > >>> On Tue, Mar 26, 2019 at 9:55 AM Boyuan Zhang > wrote: > > > > Hey, > > > > I'm trying to publish the artifacts by commenting "Run Gradle > Publish" in my PR, but there are several errors saying "cannot write > artifacts into dir", anyone has idea on it? Ideally, the snapshot publish > is dependent from postrelease_snapshot. The publish task is to build and > publish artifacts and the postrelease_snapshot is to verify whether the > snapshot works. > > > > On Tue, Mar 26, 2019 at 8:45 AM Ahmet Altay > wrote: > > > > > > I believe this is related to > https://issues.apache.org/jira/browse/BEAM-6840 and +Boyuan Zhang has a > fix in progress https://github.com/apache/beam/pull/8132 > > > > > > On Tue, Mar 26, 2019 at 7:09 AM Ismaël Mejía > wrote: > > >> > > >> I was trying to validate a fix on the Spark runner and realized > that > > >> Beam SNAPSHOTS have not been updated since February 24 ! > > >> > > >> > https://repository.apache.org/content/repositories/snapshots/org/apache/beam/beam-sdks-java-core/2.12.0-SNAPSHOT/ > > >> > > >> Can somebody please take a look at why this is not been updated? > > >> > > >> Thanks, > > >> Ismaël >
Re: SNAPSHOTS have not been updated since february
And is there a way we can detect SNAPSHOTS not been published daily in the future? On Thu, Apr 18, 2019 at 6:37 PM Ismaël Mejía wrote: > > Any progress on this? > > On Wed, Mar 27, 2019 at 5:38 AM Daniel Oliveira > wrote: > > > > I made a bug for this specific issue (artifacts not publishing to the > > Apache Maven repo): https://issues.apache.org/jira/browse/BEAM-6919 > > > > While I was gathering info for the bug report I also noticed +Yifan Zou has > > an experimental PR testing a fix: https://github.com/apache/beam/pull/8148 > > > > On Tue, Mar 26, 2019 at 11:42 AM Boyuan Zhang wrote: > >> > >> +Daniel Oliveira > >> > >> On Tue, Mar 26, 2019 at 9:57 AM Boyuan Zhang wrote: > >>> > >>> Sorry for the typo. Ideally, the snapshot publish is independent from > >>> postrelease_snapshot. > >>> > >>> On Tue, Mar 26, 2019 at 9:55 AM Boyuan Zhang wrote: > > Hey, > > I'm trying to publish the artifacts by commenting "Run Gradle Publish" > in my PR, but there are several errors saying "cannot write artifacts > into dir", anyone has idea on it? Ideally, the snapshot publish is > dependent from postrelease_snapshot. The publish task is to build and > publish artifacts and the postrelease_snapshot is to verify whether the > snapshot works. > > On Tue, Mar 26, 2019 at 8:45 AM Ahmet Altay wrote: > > > > I believe this is related to > > https://issues.apache.org/jira/browse/BEAM-6840 and +Boyuan Zhang has a > > fix in progress https://github.com/apache/beam/pull/8132 > > > > On Tue, Mar 26, 2019 at 7:09 AM Ismaël Mejía wrote: > >> > >> I was trying to validate a fix on the Spark runner and realized that > >> Beam SNAPSHOTS have not been updated since February 24 ! > >> > >> https://repository.apache.org/content/repositories/snapshots/org/apache/beam/beam-sdks-java-core/2.12.0-SNAPSHOT/ > >> > >> Can somebody please take a look at why this is not been updated? > >> > >> Thanks, > >> Ismaël
Re: SNAPSHOTS have not been updated since february
Any progress on this? On Wed, Mar 27, 2019 at 5:38 AM Daniel Oliveira wrote: > > I made a bug for this specific issue (artifacts not publishing to the Apache > Maven repo): https://issues.apache.org/jira/browse/BEAM-6919 > > While I was gathering info for the bug report I also noticed +Yifan Zou has > an experimental PR testing a fix: https://github.com/apache/beam/pull/8148 > > On Tue, Mar 26, 2019 at 11:42 AM Boyuan Zhang wrote: >> >> +Daniel Oliveira >> >> On Tue, Mar 26, 2019 at 9:57 AM Boyuan Zhang wrote: >>> >>> Sorry for the typo. Ideally, the snapshot publish is independent from >>> postrelease_snapshot. >>> >>> On Tue, Mar 26, 2019 at 9:55 AM Boyuan Zhang wrote: Hey, I'm trying to publish the artifacts by commenting "Run Gradle Publish" in my PR, but there are several errors saying "cannot write artifacts into dir", anyone has idea on it? Ideally, the snapshot publish is dependent from postrelease_snapshot. The publish task is to build and publish artifacts and the postrelease_snapshot is to verify whether the snapshot works. On Tue, Mar 26, 2019 at 8:45 AM Ahmet Altay wrote: > > I believe this is related to > https://issues.apache.org/jira/browse/BEAM-6840 and +Boyuan Zhang has a > fix in progress https://github.com/apache/beam/pull/8132 > > On Tue, Mar 26, 2019 at 7:09 AM Ismaël Mejía wrote: >> >> I was trying to validate a fix on the Spark runner and realized that >> Beam SNAPSHOTS have not been updated since February 24 ! >> >> https://repository.apache.org/content/repositories/snapshots/org/apache/beam/beam-sdks-java-core/2.12.0-SNAPSHOT/ >> >> Can somebody please take a look at why this is not been updated? >> >> Thanks, >> Ismaël
Re: CassandraIO breakage
Interesting. Let me try a full rebuild, maybe some dependency is not getting rebuilt. I was getting errors like this: /Users/relax/beam/sdks/java/io/cassandra/src/main/java/org/apache/beam/sdk/io/cassandra/CassandraServiceImpl.java:74: error: incompatible types: ValueProvider> cannot be converted to List source.spec.hosts(), On Thu, Apr 18, 2019 at 8:36 AM Jean-Baptiste Onofré wrote: > It builds fine on my machine. > > Let me check on Jenkins. > > Regards > JB > > On 17/04/2019 21:48, Reuven Lax wrote: > > Did something break with CassandraIO? It no longer seems to compile. > > -- > Jean-Baptiste Onofré > jbono...@apache.org > http://blog.nanthrax.net > Talend - http://www.talend.com >
Re: CassandraIO breakage
It builds fine on my machine. Let me check on Jenkins. Regards JB On 17/04/2019 21:48, Reuven Lax wrote: > Did something break with CassandraIO? It no longer seems to compile. -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com
Re: CassandraIO breakage
Let me check if it works on my machine. Regards JB On 17/04/2019 21:48, Reuven Lax wrote: > Did something break with CassandraIO? It no longer seems to compile. -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com
Re: Go SDK status
Hi Robert, Thanks a bunch for providing this comprehensive update. This is exactly the kind of perspective I was looking for, even when overall it means that for potential users of the Go SDK it is even sooner than what I might have hoped for. For more context, my interest was primarily on the streaming side. From the list of missing features you listed, State + Timers + Triggers would probably be highest priority. Unfortunately I won't be able to contribute to the Go SDK anytime soon, so this is mostly fyi in case anyone else does. On improving the IOs, I think it would make a lot of sense to focus on the cross-language route. There has been some work lately to make existing Beam Java IOs available on the Flink runner (Max would be able to share more details on that). Thanks! Thomas On Wed, Apr 17, 2019 at 9:56 PM Robert Burke wrote: > Oh dang. Thanks for mentioning that! Here's an open copy of the versioning > thoughts doc, though there shouldn't be any surprises from the points I > mentioned above. > > > https://docs.google.com/document/d/1ZjP30zNLWTu_WzkWbgY8F_ZXlA_OWAobAD9PuohJxPg/edit#heading=h.drpipq762xi7 > > On Wed, 17 Apr 2019 at 21:20, Nathan Fisher > wrote: > >> Hi Robert, >> >> Great summary on the current state of play. FYI the referenced G doc >> doesn't appear to people outside the org as a default. >> >> Great to hear the Go SDK is still getting love. I last looked at in >> September-October of last year. >> >> Cheers, >> Nathan >> >> On Wed, 17 Apr 2019 at 20:27, Lukasz Cwik wrote: >> >>> Thanks for the indepth summary. >>> >>> On Mon, Apr 15, 2019 at 4:19 PM Robert Burke wrote: >>> Hi Thomas! I'm so glad you asked! The status of the Go SDK is complicated, so this email can't be brief. There's are several dimensions to consider: as a Go Open Source Project, User Libraries and Experience, and on Beam Features. I'm going to be updating the roadmap later this month when I have a spare moment. *tl;dr;* I would *love* help in improving the Go SDK, especially around interactions with Java/Python/Flink. Java and I do not have a good working relationship for operational purposes, and the last time I used Python, I had to re-image my machine. There's lots to do, but shouting out tasks to the void is rarely as productive as it is cathartic. If there's an offer to help, and a preference for/experience with something to work on, I'm willing to find something useful to get started on for you. (Note: The following are simply my opinion as someone who works with the project weekly as a Go programmer, and should not be treated as demands or gospel. I just don't have anyone to talk about Go SDK issues with, and my previous discussions, have largely seemed to fall on uninterested ears.) *The SDK can be considered Alpha when all of the following are true:* * The SDK is tested by the Beam project on a ULR and on Flink as well as Dataflow. * The IOs have received some love to ensure they can scale (either through SDF or reshuffles), and be portable to different environments (eg. using the Go Cloud Development Kit (CDK) libraries). * Cross-Language IO support would also be acceptable. * The SDK is using Go Modules for dependency management, marking it as version 0.Minor (where Minor should probably track the mainline Beam minor version for now). *We can move to calling it Beta when all of the following are true:* * The all implemented Beam features are meaningfully tested on the portable runners (eg. a proper "Validates Runner" suite exists in Go) * The SDK is properly documented on the Beam site, and in it's Go Docs. After this, I'll be more comfortable recommending it as something folks can use for production. That said, there are happy paths that are useable today in batch situations. *Intro* The Go SDK is a purely Beam Portable SDK. If it runs on a distributed system at all, it's being run portably. Currently it's regularly tested on Google Cloud Dataflow (though Dataflow doesn't officially support the SDK at this time), and on it's own single bundle Direct Runner (intended for unit testing purposes). In addition, it's being tested at scale within Google, on an internal runner, where it presently satisfies our performance benchmarks, and correctness tests. I've been working on cases to make the SDK suitable for data processing within Google. This unfortunately makes my contributions more towards general SDK usability, documentation, and performance, rather than "making it usable outside Google". Note this also precludes necessary work to resolve issues with running Go SDK pipelines on Google Cloud Dataflow. I believe that the SDK must become a good member of the Go ecosystem, the Beam ecosystem.
Re: CassandraIO breakage
How it fails? No issue for me with local build against master. > On 17 Apr 2019, at 21:48, Reuven Lax wrote: > > Did something break with CassandraIO? It no longer seems to compile.