[ 
https://issues.apache.org/jira/browse/MNG-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17472217#comment-17472217
 ] 

Thomas Skjølberg commented on MNG-7389:
---------------------------------------

The CI knows when to clean the cache, but has no good way to actually perform 
the operation it desires. What is lacking is some goal like 
[dependency:purge-local-repository|https://maven.apache.org/plugins/maven-dependency-plugin/purge-local-repository-mojo.html],
 but which deletes only unused dependencies. And "unused dependencies" as in 
the dependencies used by more than one Maven invocation. Surely a solution must 
involve the file system in some way.

A very simple approach would be to update the last-modified timestamp when an 
artifact is accessed during a build. Then a script or plugin goal could delete 
all artifacts with an old last-modified timestamp. 

The PoC actually prints artifact access to a text file, and then scans through 
the cache, deleting files not in the text file. I am guessing that there is 
some better way of doing this.

> Incremental .m2 cache cleanup for CI
> ------------------------------------
>
>                 Key: MNG-7389
>                 URL: https://issues.apache.org/jira/browse/MNG-7389
>             Project: Maven
>          Issue Type: New Feature
>          Components: Dependencies
>            Reporter: Thomas Skjølberg
>            Priority: Minor
>
> One or more popular continous integration are unable to properly manage the 
> .m2 repository cache, resulting in wasted resources in the form of increased 
> CI runtime and bandwidth consumption.
> *CircleCI cache behaviour:*
>  - immutable cache entries
>  - default behaviour is to wipe the cache each time a pom file is modified 
> (i.e. using pom hash as a cache key)
>  - cache entries TTL > weeks
> So CircleCI always has a cache containing only the necessary artifacts, but 
> has to download all dependencies every time the pom file changes.
> *Github Actions cache behaviour*
>  - (effectively) mutable cache entries
>  - incremental cache (if it gets too big, it is wiped).
>  - cache entries TTL 1 week
> So Github actions work well if the cache entries expire from time to time, 
> otherwise the cache keeps growing.
> *Summary*
> Perhaps this does not look so bad at first glance, but for a project under 
> active development, with a lot of artifacts, the pom file changes often. For 
> example we have apps with 100 dependencies and automatic dependency bumping 
> via Renovate, in addition to an hierarchy of libraries.
> Key takeaways; time is wasted
>  - saving caches in CI
>  - loading cache in CI
>  - loading artifacts from external artifact store
> This happens quite a lot. From the artifact store perspective, this probably 
> multiplies the load by a factor of 10.
> Possible solution: A way to define a "transaction" for artifact use, i.e.
> 1. run command to mark start of transaction 
> 2. run one or more maven commands
> 3. run command to mark end of transaction, deleting artifacts not in use.
> For reference, Gradle has the same problem.
> Proof of concept:
>  * CircleCI : [https://github.com/entur/maven-orb]
>  * Github actions: [https://github.com/skjolber/tidy-cache-github-action]
> The implementation uses instrumentation to record artifact access, then 
> delete the artifacts not recorded. 
> *Alternatives:*
> I did try the last-accessed file timestamp first, turns out most CI 
> filesystems are mounted without that option. However it should also be 
> possible to update the modified timestamp and/or add read access to some 
> existing metadata file. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to