I'm running into the issue Kyle points out when I try to run a pipeline that does not use artifact staging:
2019-11-23 01:09:18,442 WARN org.apache.beam.runners.fnexecution.artifact.AbstractArtifactRetrievalService - GetManifest for /tmp/beam-artifact-staging/job_53cad419-a8c0-472c-8486-f795cc88a80f/MANIFEST failed. java.util.concurrent.ExecutionException: java.io.FileNotFoundException: /tmp/beam-artifact-staging/job_53cad419-a8c0-472c-8486-f795cc88a80f/MANIFEST (No such file or directory) at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:531) at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:492) at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:83) at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:196) at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2312) This happens when I use /opt/apache/beam/boot to start the worker in process environment, as it will attempt to retrieve artifacts. The same would be the case for worker pool also. Thomas On Tue, Nov 12, 2019 at 5:07 PM Robert Bradshaw <rober...@google.com> wrote: > FWIW, there are also discussions of adding a preparation phase for sdk > harness (docker) images, such that artifacts could be staged (and > installed, compiled etc.) ahead of time and shipped as part of the sdk > image rather than via a side channel (and on every worker). Anyone not > using these images is probably shipping dependencies in another way > anyways. > > On Tue, Nov 12, 2019 at 5:03 PM Robert Bradshaw <rober...@google.com> > wrote: > > > > Certainly there's a lot to be re-thought in terms of artifact staging, > > especially when it comes to cross-langauge pipelines. I think it would > > makes sense to have a special retrieval token for the "empty" > > manifest, which would mean a staging directory would never have to be > > set up if no artifacts happened to be staged. > > > > The UberJar avoids any artifact staging overhead as well. > > > > On Tue, Nov 12, 2019 at 3:30 PM Kyle Weaver <kcwea...@google.com> wrote: > > > > > > Hi Beamers, > > > > > > We can use artifact staging to make sure SDK workers have access to a > pipeline's dependencies. However, artifact staging is not always necessary. > For example, one can make sure that the environment contains all the > dependencies ahead of time. However, regardless of whether or not artifacts > are used, my understanding is an artifact manifest will be written and read > anyway. For example: > > > > > > INFO AbstractArtifactRetrievalService: GetManifest for > /tmp/beam-artifact-staging/.../MANIFEST -> 0 artifacts > > > > > > This can be a hassle, because users must set up a staging directory > that all workers can access, even if it isn't used aside from the (empty) > manifest [1]. Thomas mentioned that at Lyft they bypass artifact staging > altogether [2]. So I was wondering, do you all think it would be reasonable > or useful to create an "off switch" for artifact staging? > > > > > > Thanks, > > > Kyle > > > > > > [1] > https://lists.apache.org/thread.html/d293b4158f266be1cb6c99c968535706f491fdfcd4bb20c4e30939bb@%3Cdev.beam.apache.org%3E > > > [2] > https://issues.apache.org/jira/browse/BEAM-5187?focusedCommentId=16972715&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16972715 >