dongjoon-hyun opened a new pull request #30253:
URL: https://github.com/apache/spark/pull/30253


   ### What changes were proposed in this pull request?
   
   This is a backport of https://github.com/apache/spark/pull/30059 .
   
   This PR aims to use `pre-built image` at Github Action PySpark jobs. To 
isolate the changes, `pyspark` jobs are split from the main job. The docker 
image is built by the following.
   
   | Item                   | URL                |
   | --------------- | ------------- |
   | Dockerfile         | 
https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage/blob/main/Dockerfile
 |
   | Builder               | 
https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage/blob/main/.github/workflows/build.yml
 |
   | Image Location | 
https://hub.docker.com/r/dongjoon/apache-spark-github-action-image |
   
   Please note that.
   1. The community still will use `build_and_test.yml` to add new features 
like as we did until now. The `Dockerfile` will be updated regularly.
   2. When Apache Spark gets an official docker repository location, we will 
use it.
   3. Also, it's the best if we keep this docker file and builder script at a 
new Apache Spark dev branch instead of outside GitHub repository.
   
   ### Why are the changes needed?
   
   Currently, two `pyspark` test jobs take over one and half hour always. In 
total, 3 hours 14 minutes.
   - https://github.com/apache/spark/runs/1240470628 (1 hour 35 mins)
   - https://github.com/apache/spark/runs/1240470634 (1 hour 39 mins)
   
   This PR will remove the package installation steps which takes 16 minutes 
and causes flakiness. Note that `Python 3.6 package installation` is not 
included in the pre-built image and it only takes `20s`.
   
   **BEFORE**
   ![Screen Shot 2020-10-15 at 10 32 17 
AM](https://user-images.githubusercontent.com/9700541/96165634-be625080-0ed1-11eb-974b-940c112152e9.png)
   
   **AFTER**
   ![Screen Shot 2020-10-15 at 10 58 17 
AM](https://user-images.githubusercontent.com/9700541/96168262-5d3c7c00-0ed5-11eb-83c5-e9dc189a156b.png)
   
   In short, `pyspark` GitHub jobs take shorter time. In total, 2 hours 23 
minutes (<- 3 hours 14 minutes, previously).
   - https://github.com/apache/spark/pull/30059/checks?check_run_id=1260512568 
(1 hour 18 mins)
   - https://github.com/apache/spark/pull/30059/checks?check_run_id=1260512582 
(1 hour 5 mins)
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Pass the GitHub Action on this PR without `package installation steps`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to