This is an automated email from the ASF dual-hosted git repository.

potiuk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/airflow.git


The following commit(s) were added to refs/heads/master by this push:
     new cea9e82  Improves deletion of old artifacts. (#11079)
cea9e82 is described below

commit cea9e829b302931d170e64ba5b08e9642c8bc82e
Author: Jarek Potiuk <[email protected]>
AuthorDate: Tue Sep 22 14:31:14 2020 +0200

    Improves deletion of old artifacts. (#11079)
    
    We introduced deletion of the old artifacts as this was
    the suspected culprit of Kubernetes Job failures. It turned out
    eventually that those Kubernetes Job failures were caused by
    the #11017 change, but it's good to do housekeeping of the
    artifacts anyway.
    
    The delete workflow action introduced in a hurry had two problems:
    
    * it runs for every fork if they sync master. This is a bit
      too invasive
    
    * it fails continuously after 10 - 30 minutes every time
      as we have too many old artifacts to delete (GitHub has
      90 days retention policy so we have likely tens of
      thousands of artifacts to delete)
    
    * it runs every hour and it causes occasional API rate limit
      exhaustion (because we have too many artifacts to loop trough)
    
    This PR introduces filtering with the repo, changes the frequency
    of deletion to be 4 times a day. Back of the envelope calculation
    tops 4/day at 2500 artifacts to delete at every run so we have low risk
    of reaching 5000 API calls/hr rate limit. and adds script that we are
    running manually to delete those excessive artifacts now. Eventually
    when the number of artifacts goes down the regular job should delete
    maybe a few hundreds of artifacts appearing within the 6 hours window
    in normal circumstances and it should stop failing then.
---
 .github/workflows/delete_old_artifacts.yml |  3 +-
 CI.rst                                     | 10 ++++
 dev/remove_artifacts.sh                    | 84 ++++++++++++++++++++++++++++++
 3 files changed, 96 insertions(+), 1 deletion(-)

diff --git a/.github/workflows/delete_old_artifacts.yml 
b/.github/workflows/delete_old_artifacts.yml
index c5c9da0..53d43b0 100644
--- a/.github/workflows/delete_old_artifacts.yml
+++ b/.github/workflows/delete_old_artifacts.yml
@@ -1,11 +1,12 @@
 name: 'Delete old artifacts'
 on:
   schedule:
-    - cron: '0 * * * *' # every hour
+    - cron: '27 */6 * * *' # run every 6 hours
 
 jobs:
   delete-artifacts:
     runs-on: ubuntu-latest
+    if: github.repository == 'apache/airflow'
     steps:
       - uses: kolpav/purge-artifacts-action@v1
         with:
diff --git a/CI.rst b/CI.rst
index 8cdd62e..4f518a7 100644
--- a/CI.rst
+++ b/CI.rst
@@ -615,6 +615,16 @@ This is manually triggered workflow (via GitHub UI manual 
run) that should only
 When triggered, it will force-push the "apache/airflow" master to the fork's 
master. It's the easiest
 way to sync your fork master to the Apache Airflow's one.
 
+Delete old artifacts
+--------------------
+
+This workflow is introduced, to delete old artifacts from the Github Actions 
build. We set it to
+delete old artifacts that are > 7 days old. It only runs for the 
'apache/airflow' repository.
+
+We also have a script that can help to clean-up the old artifacts:
+`remove_artifacts.sh <dev/remove_artifacts.sh>`_
+
+
 Naming conventions for stored images
 ====================================
 
diff --git a/dev/remove_artifacts.sh b/dev/remove_artifacts.sh
new file mode 100755
index 0000000..d387eb6
--- /dev/null
+++ b/dev/remove_artifacts.sh
@@ -0,0 +1,84 @@
+#!/usr/bin/env bash
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+set -euo pipefail
+
+# Parameters:
+#
+# GITHUB_REPO - repository to delete the artifacts
+# GITHUB_USER - your personal user name
+# GITHUB_TOKEN - your personal token with `repo` scope
+#
+GITHUB_REPO=https://api.github.com/repos/apache/airflow
+readonly GITHUB_REPO
+
+if [[ -z ${GITHUB_USER} ]]; then
+    echo 2>&1
+    echo 2>&1 "Set GITHUB_USER variable to your user"
+    echo 2>&1
+    exit 1
+fi
+readonly GITHUB_USER
+
+if [[ -z ${GITHUB_TOKEN} ]]; then
+    echo 2>&1
+    echo 2>&1 "Set GITHUB_TOKEN variable to a token with 'repo' scope"
+    echo 2>&1
+    exit 2
+fi
+GITHUB_TOKEN=${GITHUB_TOKEN}
+readonly GITHUB_TOKEN
+
+function github_api_call() {
+    curl --silent --location --user "${GITHUB_USER}:${GITHUB_TOKEN}" "$@"
+}
+
+# A temporary file which receives HTTP response headers.
+TEMPFILE=$(mktemp)
+readonly TEMPFILE
+
+function loop_through_artifacts_and_delete() {
+
+    # Process all artifacts on this repository, loop on returned "pages".
+    artifact_url=${GITHUB_REPO}/actions/artifacts
+
+    while [[ -n "${artifact_url}" ]]; do
+        # Get current page, get response headers in a temporary file.
+        json=$(github_api_call --dump-header "${TEMPFILE}" "$artifact_url")
+
+        # Get artifact_url of next page. Will be empty if we are at the last 
page.
+        artifact_url=$(grep '^Link:' "$TEMPFILE" | tr ',' '\n' | \
+            grep 'rel="next"' | head -1 | sed -e 's/.*<//' -e 's/>.*//')
+        rm -f "${TEMPFILE}"
+
+        # Number of artifacts on this page:
+        count=$(($(jq <<<"${json}" -r '.artifacts | length')))
+
+        # Loop on all artifacts on this page.
+        for ((i = 0; "${i}" < "${count}"; i++)); do
+            # Get the name of artifact and count instances of this name
+            name=$(jq <<<"${json}" -r ".artifacts[$i].name?")
+            id=$(jq <<<"${json}" -r ".artifacts[$i].id?")
+            size=$(($(jq <<<"${json}" -r ".artifacts[$i].size_in_bytes?")))
+            printf "Deleting '%s': [%s] : %'d bytes\n" "${name}" "${id}" 
"${size}"
+            github_api_call -X DELETE "${GITHUB_REPO}/actions/artifacts/${id}"
+            sleep 1 # There is a Github API limit of 5000 calls/hr. This is to 
limit the API calls below that
+        done
+    done
+}
+
+loop_through_artifacts_and_delete

Reply via email to