[
https://issues.apache.org/jira/browse/BEAM-12875?focusedWorklogId=651927&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-651927
]
ASF GitHub Bot logged work on BEAM-12875:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 16/Sep/21 18:22
Start Date: 16/Sep/21 18:22
Worklog Time Spent: 10m
Work Description: ibzib commented on pull request #15502:
URL: https://github.com/apache/beam/pull/15502#issuecomment-921139167
> Also, the call to FileSystems.setDefaultPipelineOptions seems like a bit
of a hack that is sprinkled all across the codebase; I'm not sure what the plan
is for improving this so sorry for this PR just making it worse.
I don't think there's a plan to improve this. FWIW I think
`FileSystems.setDefaultPipelineOptions` is inadequately named; it should be
probably be called `FileSystems.registerFileSystems` or something that more
clearly conveys its purpose.
I don't think createServerInfo is the best place to do this.
createServerInfo is runner-agnostic, and it shouldn't have to know anything
about file systems or the runner's JVM setup. It's better to put it up front in
SparkExecutableStageFunction::call.
https://github.com/apache/beam/blob/f969f9a3435ed4165f87c61d577dc5dd5de8ab47/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkExecutableStageFunction.java#L125
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 651927)
Time Spent: 40m (was: 0.5h)
> File systems are not registered when ArtifactRetrievalService is created by
> Spark runner
> ----------------------------------------------------------------------------------------
>
> Key: BEAM-12875
> URL: https://issues.apache.org/jira/browse/BEAM-12875
> Project: Beam
> Issue Type: Improvement
> Components: runner-spark
> Affects Versions: 2.32.0
> Reporter: Rogan Morrow
> Priority: P2
> Time Spent: 40m
> Remaining Estimate: 0h
>
> I am new to this codebase so apologies if I have any misunderstandings, but
> from what I can tell when {{SparkExecutableStageFunction}} is called an
> {{ArtifactRetrievalService}} is created (if the job bundle factory's
> environment cache is cold) to be called by the worker harness.
> The issue is that {{FileSystems.setDefaultPipelineOptions}} is not called
> before this, so no filesystems are registered. If one is using cloud storage
> such as S3 to stage artifacts, then the {{ArtifactRetrievalService}} will not
> be able to retrieve the artifacts and throw an exception:
> {{java.lang.IllegalArgumentException: No filesystem found for scheme s3}}
> This doesn't affect other runners such as the Flink runner because it calls
> {{FileSystems.setDefaultPipelineOptions}} [in its executable stage function
> |https://github.com/apache/beam/blob/v2.32.0/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/functions/FlinkExecutableStageFunction.java#L151]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)