[
https://issues.apache.org/jira/browse/MNG-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17472359#comment-17472359
]
Thomas Skjølberg commented on MNG-7389:
---------------------------------------
Another way to explain this is a multi-level cache. I agree that ideally all
artifact stores are located in the same datacenter, but for many, they are not,
for various practical reasons. We buy CI and artifact store as a service
(artifactory) in the cloud, something which is still cost-effective even if
they are not offered in the same datacenter. Some shops do not even have an
artifact store at all, and rely on the global artifact store. Our CI is priced
per minute, the artifact store per GB. CI cache storage capacity is 'free'
(cost is only the load/save time).
So for a practial example, for an incremental cache, if we have the levels
1. CI internal cache
* 50 MB/s read
* 30 MB/s write
2. Nexus style repo manager (i.e. for us Artifactory)
* 5 MB/s read
3. Global artifact store (i.e. Maven central)
* 1 MB/s read
So lets do a new build. First git commit results in:
1. cache key is generated
2. CI cache cannot be loaded (it does not exist)
3. build populates .m2 repository directory during build
4. .m2 repository directory is saved to the CI cache
The second commit does *not* modify the pom:
1. cache key is generated
2. CI cache is loaded
3. build (all artifacts are present)
4. .m2 respository directory is not saved (it did not have any changes)
The third commit modifies the pom, bumping an artifact version from *2.0.0 to
2.0.1*
1. cache key is generated
2. CI cache is loaded
3. the build populates the .m2 respository with artifact version 2.0.1 -
adding a minor delay
4. .m2 respository is saved (containing both {*}2.0.0 and 2.0.1{*})
The forth commit does *not* modify the pom:
1. cache key is generated
2. CI cache is loaded, containing both 2.0.0 and 2.0.1
3. build (all artifacts are present)
4. .m2 respository directory is not saved (it did not have any changes)
The fifth commit modifies the pom, bumping an artifact version from *2.0.1 to
2.0.2*
1. cache key is generated
2. CI cache is loaded, containing both 2.0.0 and 2.0.1
3. the build populates the .m2 respository with the bumped artifact version -
again adding a minor delay
4. .m2 respository is saved (containing both {*}2.0.0, 2.0.1 and 2.0.2{*})
So now the CI cache contains more data than necessary. Loading and saving the
cache takes longer than before. Developers see build times increase, possibly
making them question their build and/or the health of the cache. In the
following weeks, build times steadily increases at the hands of tools like
Renovate, bumping at least a handful artifacts each day, eventually forcing an
manual interaction (cache reset).
In a nutshell, the eviction policy makes it easier to stay at cache level 1 -
keeping build time low and stable ('within acceptable constraints'). This also
mostly holds even if there is no Nexus style repo mananger.
> Incremental .m2 cache cleanup for CI
> ------------------------------------
>
> Key: MNG-7389
> URL: https://issues.apache.org/jira/browse/MNG-7389
> Project: Maven
> Issue Type: New Feature
> Components: Dependencies
> Reporter: Thomas Skjølberg
> Priority: Minor
>
> One or more popular continous integration are unable to properly manage the
> .m2 repository cache, resulting in wasted resources in the form of increased
> CI runtime and bandwidth consumption.
> *CircleCI cache behaviour:*
> - immutable cache entries
> - default behaviour is to wipe the cache each time a pom file is modified
> (i.e. using pom hash as a cache key)
> - cache entries TTL > weeks
> So CircleCI always has a cache containing only the necessary artifacts, but
> has to download all dependencies every time the pom file changes.
> *Github Actions cache behaviour*
> - (effectively) mutable cache entries
> - incremental cache (if it gets too big, it is wiped).
> - cache entries TTL 1 week
> So Github actions work well if the cache entries expire from time to time,
> otherwise the cache keeps growing.
> *Summary*
> Perhaps this does not look so bad at first glance, but for a project under
> active development, with a lot of artifacts, the pom file changes often. For
> example we have apps with 100 dependencies and automatic dependency bumping
> via Renovate, in addition to an hierarchy of libraries.
> Key takeaways; time is wasted
> - saving caches in CI
> - loading cache in CI
> - loading artifacts from external artifact store
> This happens quite a lot. From the artifact store perspective, this probably
> multiplies the load by a factor of 10.
> Possible solution: A way to define a "transaction" for artifact use, i.e.
> 1. run command to mark start of transaction
> 2. run one or more maven commands
> 3. run command to mark end of transaction, deleting artifacts not in use.
> For reference, Gradle has the same problem.
> Proof of concept:
> * CircleCI : [https://github.com/entur/maven-orb]
> * Github actions: [https://github.com/skjolber/tidy-cache-github-action]
> The implementation uses instrumentation to record artifact access, then
> delete the artifacts not recorded.
> *Alternatives:*
> I did try the last-accessed file timestamp first, turns out most CI
> filesystems are mounted without that option. However it should also be
> possible to update the modified timestamp and/or add read access to some
> existing metadata file.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)