lostluck opened a new issue, #27839:
URL: https://github.com/apache/beam/issues/27839

   ### What happened?
   
   A Dataflow customer with a large number of `--filesToStage` leads to workers 
unable to boot up, failing with `Java exited: fork/exec 
/opt/java/openjdk/bin/java: argument list too long`.
   
   After some investigation, it's revealed that in Linux, Environment variables 
take up command line length apparently:
   
   
https://stackoverflow.com/questions/28865473/setting-environment-variable-to-a-large-value-argument-list-too-long
   
   And Beam Java serializes the pipeline options in JSON format to an 
evironement variable.
   
   
https://github.com/apache/beam/blob/release-2.49.0/sdks/java/container/boot.go#L128
   
   This also happens for Python:
   
   
https://github.com/apache/beam/blob/90809097260ec4252b746b97bd849efc412950f5/sdks/python/container/boot.go#L206
  but no reports for this as of yet.
   
   Previous work to resolve this was here, focused on the Java class path: 
https://github.com/apache/beam/issues/25582
   
   While that certainly helped the issue, large Pipeline options remain an 
issue.
   
   The proposed fix for Java at least is to write another environment variable 
PIPELINE_OPTIONS_LOCATION, which will contain the file location for a json 
encoded version of the pipeline options, similar to how we've done the pathing 
jar.
   
   The behavior from the portable SDK harness should be to look at this 
environment variable, and if it exists, read the JSON pipeline options from 
them. Otherwise, fall back to the existing behavior.
   
   This allows for slight mismatch in container versions vs Beam versions for 
users who aren't experiencing this issue.
   
   ### Issue Priority
   
   Priority: 1 (data loss / total loss of function)
   
   ### Issue Components
   
   - [X] Component: Python SDK
   - [X] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [ ] Component: IO connector
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to