[ 
https://issues.apache.org/jira/browse/BEAM-12088?focusedWorklogId=576356&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-576356
 ]

ASF GitHub Bot logged work on BEAM-12088:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 02/Apr/21 23:37
            Start Date: 02/Apr/21 23:37
    Worklog Time Spent: 10m 
      Work Description: iemejia commented on pull request #14417:
URL: https://github.com/apache/beam/pull/14417#issuecomment-812755129


   About the question about the exact cause. The method used on the Structured 
Streaming runner differed from the Classic/Portable implementation in two 
aspects:
   (1) It did not validate the existence of the files to stage first so it 
could end up trying to stage non existent files and (2) if the user had not set 
up a tempLocation it would fail, instead of failing back to a default tmp 
directory as this implementation does:
   ```java
       if (!isLocalSparkMaster(options)) {
         List<String> filesToStage =
             options.getFilesToStage().stream()
                 .map(File::new)
                 .filter(File::exists)
                 .map(
                     file -> {
                       return file.getAbsolutePath();
                     })
                 .collect(Collectors.toList());
         options.setFilesToStage(
             PipelineResources.prepareFilesForStaging(
                 filesToStage,
                 MoreObjects.firstNonNull(
                     options.getTempLocation(), 
System.getProperty("java.io.tmpdir"))));
       }
   ```
   vs
   ```java
       if (!PipelineTranslator.isLocalSparkMaster(options)) {
         options.setFilesToStage(
             PipelineResources.prepareFilesForStaging(
                 options.getFilesToStage(), options.getTempLocation()));
       }
   ```
   I suppose failing back to a local temp directory may be improved but so far 
this is the only way I found to run this on YARN who seems not to follow 
exactly Beam's staging pattern. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 576356)
    Time Spent: 1h 10m  (was: 1h)

> Make file staging uniform among Spark Runners
> ---------------------------------------------
>
>                 Key: BEAM-12088
>                 URL: https://issues.apache.org/jira/browse/BEAM-12088
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-spark
>            Reporter: Ismaël Mejía
>            Assignee: Ismaël Mejía
>            Priority: P2
>             Fix For: 2.30.0
>
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Both the Spark Classic and Portable runners share the file staging logic, but 
> the Structured Streaming runner is using a different logic even if the 
> process should in principle be the same. This manifests on issues when trying 
> to deploy pipelines via Hadoop YARN with exceptions related to file Staging.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to