imbajin opened a new issue, #723:
URL: https://github.com/apache/hugegraph-toolchain/issues/723
Observed while rerunning `loader-ci` during PR #716 review.
## Problem
The `Prepare env and service` step in `loader-ci` appears to spend a large
amount of time repeatedly downloading or rebuilding external dependencies on
each run, even when the versions do not change.
From the current workflow:
- `.github/workflows/loader-ci.yml` only caches `~/.m2`
- `hugegraph-loader/assembly/travis/install-hadoop.sh` always downloads
`hadoop-2.8.5.tar.gz` from `archive.apache.org`
- `hugegraph-loader/assembly/travis/install-mysql.sh` always runs `docker
pull mysql:5.7`
- `hugegraph-loader/assembly/travis/install-hugegraph-from-source.sh` always
clones `apache/hugegraph` and rebuilds the server package from source
The screenshot from the failing/re-run workflow shows `Prepare env and
service` taking about 19 minutes, with a large Hadoop tarball download
dominating the step.
```text
loader-ci
└─ Prepare env and service
├─ install-hadoop.sh
│ └─ wget hadoop-2.8.5.tar.gz (large tarball, repeated)
├─ install-mysql.sh
│ └─ docker pull mysql:5.7 (repeated image pull)
└─ install-hugegraph-from-source.sh
└─ git clone + mvn package (repeated source build)
```
## Why this matters
- CI duration is much longer than necessary
- CI becomes more fragile because it depends on multiple external downloads
during the test phase
- Re-runs are expensive even when the code change is unrelated to loader
integration environments
- Current cache coverage likely does not match the real bottlenecks
## Suggested directions
### Prefer official artifacts / containers over ad-hoc install scripts
- Replace the MySQL setup script with a GitHub Actions `services` container
or another pinned official image
- Replace the Hadoop local install script with a pinned container/image or
other official prebuilt artifact if possible
- For HugeGraph server, prefer a reusable prebuilt tarball/artifact for the
pinned commit/version instead of cloning and packaging from source on every CI
run
### If scripts must remain, make them cache-aware and idempotent
- Add cache coverage for downloaded tarballs or extracted runtime
directories if we still use script-based setup
- Skip `wget` / `docker pull` / clone+build when the required artifact is
already available
- Make the scripts check for existing files/directories before
re-downloading or rebuilding
- Verify whether GitHub Actions cache is currently missing the relevant
paths, or whether restore keys are ineffective for this use case
## Possible scope
- `.github/workflows/loader-ci.yml`
- `hugegraph-loader/assembly/travis/install-hadoop.sh`
- `hugegraph-loader/assembly/travis/install-mysql.sh`
- `hugegraph-loader/assembly/travis/install-hugegraph-from-source.sh`
## Expected outcome
- Repeated `loader-ci` runs should not re-download the same Hadoop tarball
every time
- MySQL setup should rely on a reusable/pinned container path rather than
always pulling inside the script
- HugeGraph server setup should reuse a stable artifact or cacheable output
where possible
- `Prepare env and service` time should drop significantly and become more
stable
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]