DCausse has submitted this change and it was merged. ( 
https://gerrit.wikimedia.org/r/388345 )

Change subject: Deploy MjoLniR with scap3
......................................................................


Deploy MjoLniR with scap3

Based roughly on deployments for ORES and Striker. Each machine
the repo is deployed to will build a virtualenv using the
provided wheels, and then install MjoLniR. MjoLniR isn't deployed
as a wheel to allow for some laziness in versioning, only requiring
a submodule bump (and possibly new wheels, if dependencies change).

Python wheels are basically just zip files, so they are deployed to
archiva with maven and a stdlib only python script. Wheels are placed in
the mirrored repository under org.wikimedia.python with artifact set to
the package name and version the remainder of the wheel version. The
upload script refuses to re-deploy assets which could potentially cause
issues, and instead re-downloads a wheel from archiva if the sha1's
don't match.

Change-Id: I420c8942507fc06c516df09e74e17b90ebf7d2c8
---
A .gitattributes
A .gitfat
A .gitignore
A .gitmodules
A .gitreview
A README.md
A artifacts/PyYAML-3.12-cp27-cp27mu-linux_x86_64.whl
A artifacts/certifi-2017.11.5-py2.py3-none-any.whl
A artifacts/chardet-3.0.4-py2.py3-none-any.whl
A artifacts/clickmodels-1.0.2-cp27-none-any.whl
A artifacts/decorator-4.1.2-py2.py3-none-any.whl
A artifacts/future-0.16.0-cp27-none-any.whl
A artifacts/hyperopt-0.1-cp27-none-any.whl
A artifacts/idna-2.6-py2.py3-none-any.whl
A artifacts/kafka-1.3.5-py2.py3-none-any.whl
A artifacts/networkx-1.11-py2.py3-none-any.whl
A artifacts/nose-1.3.7-py2-none-any.whl
A artifacts/numpy-1.13.3-cp27-cp27mu-manylinux1_x86_64.whl
A artifacts/py4j-0.10.6-py2.py3-none-any.whl
A artifacts/pymongo-3.5.1-cp27-cp27mu-manylinux1_x86_64.whl
A artifacts/requests-2.18.4-py2.py3-none-any.whl
A artifacts/scipy-1.0.0-cp27-cp27mu-manylinux1_x86_64.whl
A artifacts/six-1.11.0-py2.py3-none-any.whl
A artifacts/urllib3-1.22-py2.py3-none-any.whl
A make_wheels.sh
A requirements-frozen.txt
A scap/checks.yaml
A scap/checks/virtualenv.sh
A scap/scap.cfg
A spark.yaml
A src
A upload_wheels.py
32 files changed, 436 insertions(+), 0 deletions(-)

Approvals:
  DCausse: Verified; Looks good to me, approved



diff --git a/.gitattributes b/.gitattributes
new file mode 100644
index 0000000..de078af
--- /dev/null
+++ b/.gitattributes
@@ -0,0 +1 @@
+*.whl filter=fat -text
diff --git a/.gitfat b/.gitfat
new file mode 100644
index 0000000..bb2c45d
--- /dev/null
+++ b/.gitfat
@@ -0,0 +1,3 @@
+[rsync]
+       remote = archiva.wikimedia.org::archiva/git-fat
+       options = --copy-links --verbose
diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000..ff82abc
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,5 @@
+.*.sw?
+/_build
+*.pyc
+__pycache__/
+
diff --git a/.gitmodules b/.gitmodules
new file mode 100644
index 0000000..da5937d
--- /dev/null
+++ b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "MjoLniR"]
+       path = src
+       url = https://gerrit.wikimedia.org/r/search/MjoLniR
diff --git a/.gitreview b/.gitreview
new file mode 100644
index 0000000..c9e9736
--- /dev/null
+++ b/.gitreview
@@ -0,0 +1,6 @@
+[gerrit]
+host=gerrit.wikimedia.org
+port=29418
+project=search/MjoLniR/deploy.git
+defaultbranch=master
+defaultrebase=0
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..1a1ba47
--- /dev/null
+++ b/README.md
@@ -0,0 +1,4 @@
+scap3 deployment configuration for MjoLniR data pipeline
+
+Wheel upload requires setting up maven with archiva configuration.
+See https://wikitech.wikimedia.org/wiki/Archiva for instructions.
diff --git a/artifacts/PyYAML-3.12-cp27-cp27mu-linux_x86_64.whl 
b/artifacts/PyYAML-3.12-cp27-cp27mu-linux_x86_64.whl
new file mode 100644
index 0000000..780d8e6
--- /dev/null
+++ b/artifacts/PyYAML-3.12-cp27-cp27mu-linux_x86_64.whl
@@ -0,0 +1 @@
+#$# git-fat b131857bb1a402f3686e3e2a07256d6515c6b066                44225
diff --git a/artifacts/certifi-2017.11.5-py2.py3-none-any.whl 
b/artifacts/certifi-2017.11.5-py2.py3-none-any.whl
new file mode 100644
index 0000000..a0559eb
--- /dev/null
+++ b/artifacts/certifi-2017.11.5-py2.py3-none-any.whl
@@ -0,0 +1 @@
+#$# git-fat 0e09a6f85d5f93ec1012ae8de014af2fecb7e2ea               330630
diff --git a/artifacts/chardet-3.0.4-py2.py3-none-any.whl 
b/artifacts/chardet-3.0.4-py2.py3-none-any.whl
new file mode 100644
index 0000000..9083f3d
--- /dev/null
+++ b/artifacts/chardet-3.0.4-py2.py3-none-any.whl
@@ -0,0 +1 @@
+#$# git-fat 96faab7de7e9a71b37f22adb64daf2898e967e3e               133356
diff --git a/artifacts/clickmodels-1.0.2-cp27-none-any.whl 
b/artifacts/clickmodels-1.0.2-cp27-none-any.whl
new file mode 100644
index 0000000..072552b
--- /dev/null
+++ b/artifacts/clickmodels-1.0.2-cp27-none-any.whl
@@ -0,0 +1 @@
+#$# git-fat fc78d454351523d076a4127065a4811637c953c6                17131
diff --git a/artifacts/decorator-4.1.2-py2.py3-none-any.whl 
b/artifacts/decorator-4.1.2-py2.py3-none-any.whl
new file mode 100644
index 0000000..f69366e
--- /dev/null
+++ b/artifacts/decorator-4.1.2-py2.py3-none-any.whl
@@ -0,0 +1 @@
+#$# git-fat fa479ca92654d5325e224f857a3984953968252f                 9052
diff --git a/artifacts/future-0.16.0-cp27-none-any.whl 
b/artifacts/future-0.16.0-cp27-none-any.whl
new file mode 100644
index 0000000..242596d
--- /dev/null
+++ b/artifacts/future-0.16.0-cp27-none-any.whl
@@ -0,0 +1 @@
+#$# git-fat 95a3a2896c8f579d7def995d8d48958a43a5a8c1               500616
diff --git a/artifacts/hyperopt-0.1-cp27-none-any.whl 
b/artifacts/hyperopt-0.1-cp27-none-any.whl
new file mode 100644
index 0000000..ebc8c0e
--- /dev/null
+++ b/artifacts/hyperopt-0.1-cp27-none-any.whl
@@ -0,0 +1 @@
+#$# git-fat 02ef38d5ea21e69375984d3a971d6dccc9882802               116361
diff --git a/artifacts/idna-2.6-py2.py3-none-any.whl 
b/artifacts/idna-2.6-py2.py3-none-any.whl
new file mode 100644
index 0000000..4dee030
--- /dev/null
+++ b/artifacts/idna-2.6-py2.py3-none-any.whl
@@ -0,0 +1 @@
+#$# git-fat a75f31778ea0bbf218d7ae085f4f961d004d6ff2                56450
diff --git a/artifacts/kafka-1.3.5-py2.py3-none-any.whl 
b/artifacts/kafka-1.3.5-py2.py3-none-any.whl
new file mode 100644
index 0000000..5a75394
--- /dev/null
+++ b/artifacts/kafka-1.3.5-py2.py3-none-any.whl
@@ -0,0 +1 @@
+#$# git-fat 28587187768dbcef65be4843d5c4c1ae55f4a0b8               207188
diff --git a/artifacts/networkx-1.11-py2.py3-none-any.whl 
b/artifacts/networkx-1.11-py2.py3-none-any.whl
new file mode 100644
index 0000000..441dfa5
--- /dev/null
+++ b/artifacts/networkx-1.11-py2.py3-none-any.whl
@@ -0,0 +1 @@
+#$# git-fat 3209bca45fb613b7a4507cff1927b1fd44622e6c              1317927
diff --git a/artifacts/nose-1.3.7-py2-none-any.whl 
b/artifacts/nose-1.3.7-py2-none-any.whl
new file mode 100644
index 0000000..4924c1c
--- /dev/null
+++ b/artifacts/nose-1.3.7-py2-none-any.whl
@@ -0,0 +1 @@
+#$# git-fat 0b50cb2fd834bc4e206eea830af6445135cc7a5c               154663
diff --git a/artifacts/numpy-1.13.3-cp27-cp27mu-manylinux1_x86_64.whl 
b/artifacts/numpy-1.13.3-cp27-cp27mu-manylinux1_x86_64.whl
new file mode 100644
index 0000000..ac2d457
--- /dev/null
+++ b/artifacts/numpy-1.13.3-cp27-cp27mu-manylinux1_x86_64.whl
@@ -0,0 +1 @@
+#$# git-fat a37b66da64a4b5626c837acdaadbd29f2b4b7d84             16645838
diff --git a/artifacts/py4j-0.10.6-py2.py3-none-any.whl 
b/artifacts/py4j-0.10.6-py2.py3-none-any.whl
new file mode 100644
index 0000000..5065b14
--- /dev/null
+++ b/artifacts/py4j-0.10.6-py2.py3-none-any.whl
@@ -0,0 +1 @@
+#$# git-fat 8a81705af03037a0f4d327df4e3bcaabbaa10da8               189601
diff --git a/artifacts/pymongo-3.5.1-cp27-cp27mu-manylinux1_x86_64.whl 
b/artifacts/pymongo-3.5.1-cp27-cp27mu-manylinux1_x86_64.whl
new file mode 100644
index 0000000..b278f85
--- /dev/null
+++ b/artifacts/pymongo-3.5.1-cp27-cp27mu-manylinux1_x86_64.whl
@@ -0,0 +1 @@
+#$# git-fat 00677115f8c9d488c0e9e2b21099401fd96ffd28               368294
diff --git a/artifacts/requests-2.18.4-py2.py3-none-any.whl 
b/artifacts/requests-2.18.4-py2.py3-none-any.whl
new file mode 100644
index 0000000..a1b949d
--- /dev/null
+++ b/artifacts/requests-2.18.4-py2.py3-none-any.whl
@@ -0,0 +1 @@
+#$# git-fat 52ccdd6ee808bddd0c6eabc6eda79e79381266df                88704
diff --git a/artifacts/scipy-1.0.0-cp27-cp27mu-manylinux1_x86_64.whl 
b/artifacts/scipy-1.0.0-cp27-cp27mu-manylinux1_x86_64.whl
new file mode 100644
index 0000000..642ba1d
--- /dev/null
+++ b/artifacts/scipy-1.0.0-cp27-cp27mu-manylinux1_x86_64.whl
@@ -0,0 +1 @@
+#$# git-fat b6179c66388f877320afa8be7999a9a2768e4155             46719666
diff --git a/artifacts/six-1.11.0-py2.py3-none-any.whl 
b/artifacts/six-1.11.0-py2.py3-none-any.whl
new file mode 100644
index 0000000..4d82f7c
--- /dev/null
+++ b/artifacts/six-1.11.0-py2.py3-none-any.whl
@@ -0,0 +1 @@
+#$# git-fat fa2683a24d4a7422add33400048fc375b2afe57b                10702
diff --git a/artifacts/urllib3-1.22-py2.py3-none-any.whl 
b/artifacts/urllib3-1.22-py2.py3-none-any.whl
new file mode 100644
index 0000000..769bfb5
--- /dev/null
+++ b/artifacts/urllib3-1.22-py2.py3-none-any.whl
@@ -0,0 +1 @@
+#$# git-fat ae6715ae61c34b72d5e0c3241abfb20c2c4d1313               132332
diff --git a/make_wheels.sh b/make_wheels.sh
new file mode 100755
index 0000000..7ad71c5
--- /dev/null
+++ b/make_wheels.sh
@@ -0,0 +1,30 @@
+#!/usr/bin/env bash
+
+set -e
+set -o errexit
+set -o nounset
+set -o pipefail
+
+BASE="$(realpath $(dirname $0))"
+BUILD="${BASE}/_build"
+VENV="${BUILD}/venv"
+MJOLNIR="${BASE}/src"
+WHEEL_DIR="${BASE}/artifacts"
+REQUIREMENTS="${BASE}/requirements-frozen.txt"
+
+PIP="${VENV}/bin/pip"
+
+# Used by wheel >= 0.25 to normalize timestamps. Timestamp
+# taken from original debian patch:
+# 
https://bugs.debian.org/cgi-bin/bugreport.cgi?att=1;bug=776026;filename=wheel_reproducible.patch;msg=5
+export SOURCE_DATE_EPOCH=315576060
+
+mkdir -p "${VENV}"
+virtualenv --python python2.7 $VENV || /bin/true
+$PIP install "${MJOLNIR}"
+$PIP freeze --local | grep -v mjolnir | grep -v pkg-resources > $REQUIREMENTS
+$PIP install wheel
+$PIP wheel --find-links "${WHEEL_DIR}" \
+        --wheel-dir "${WHEEL_DIR}" \
+        --requirement "${REQUIREMENTS}"
+
diff --git a/requirements-frozen.txt b/requirements-frozen.txt
new file mode 100644
index 0000000..32a4c7c
--- /dev/null
+++ b/requirements-frozen.txt
@@ -0,0 +1,18 @@
+certifi==2017.11.5
+chardet==3.0.4
+clickmodels==1.0.2
+decorator==4.1.2
+future==0.16.0
+hyperopt==0.1
+idna==2.6
+kafka==1.3.5
+networkx==1.11
+nose==1.3.7
+numpy==1.13.3
+py4j==0.10.6
+pymongo==3.5.1
+PyYAML==3.12
+requests==2.18.4
+scipy==1.0.0
+six==1.11.0
+urllib3==1.22
diff --git a/scap/checks.yaml b/scap/checks.yaml
new file mode 100644
index 0000000..8255555
--- /dev/null
+++ b/scap/checks.yaml
@@ -0,0 +1,8 @@
+checks:
+    virtualenv:
+        type: command
+        stage: promote
+        timeout: 300
+        group: default
+        command: bash 
/srv/deployment/search/mjolnir/deploy/scap/checks/virtualenv.sh
+
diff --git a/scap/checks/virtualenv.sh b/scap/checks/virtualenv.sh
new file mode 100644
index 0000000..eb05944
--- /dev/null
+++ b/scap/checks/virtualenv.sh
@@ -0,0 +1,34 @@
+set -e
+set -o errexit
+set -o nounset
+set -o pipefail
+
+BASE_DIR="/srv/deployment/search/mjolnir"
+VENV="${BASE_DIR}/venv"
+DEPLOY_DIR="${BASE_DIR}/deploy"
+MJOLNIR_DIR="${BASE_DIR}/deploy/src"
+WHEEL_DIR="${DEPLOY_DIR}/artifacts"
+REQUIREMENTS="${DEPLOY_DIR}/requirements-frozen.txt"
+MJOLNIR_ZIP="${BASE_DIR}/mjolnir_venv.zip"
+
+PIP="${VENV}/bin/pip"
+
+# Ensure that the virtual environment exists
+mkdir -p "$VENV"
+virtualenv --never-download --python python2.7 $VENV || /bin/true
+
+# Install or upgrade our packages
+$PIP install \
+    --no-cache-dir \
+    --no-index \
+    --find-links "${WHEEL_DIR}" \
+    --upgrade \
+    --force-reinstall \
+    -r "${REQUIREMENTS}"
+
+$PIP install --upgrade --no-deps "${MJOLNIR_DIR}"
+
+# Build a .zip of the virtualenv that can be shipped to spark workers
+cd "${VENV}"
+zip -qr ${MJOLNIR_ZIP}.tmp .
+mv -T ${MJOLNIR_ZIP}.tmp ${MJOLNIR_ZIP}
diff --git a/scap/scap.cfg b/scap/scap.cfg
new file mode 100644
index 0000000..23f370e
--- /dev/null
+++ b/scap/scap.cfg
@@ -0,0 +1,8 @@
+[global]
+git_repo: search/MjoLniR/deploy
+ssh_user: deploy-service
+server_groups: analytics, relforge
+analytics_dsh_targets: discovery-analytics
+relforge_dsh_targets: relforge
+git_submodules: True
+git_fat: True
diff --git a/spark.yaml b/spark.yaml
new file mode 100644
index 0000000..3e687bd
--- /dev/null
+++ b/spark.yaml
@@ -0,0 +1,193 @@
+# Configuration shared by all training groups and
+working_dir: /srv/deployment/search/mjolnir
+global:
+    environment:
+        PYSPARK_PYTHON: venv/bin/python
+        SPARK_HOME: "/usr/lib/spark2"
+    template_vars:
+        spark_version: 2.1.2
+        # Path to spark-submit applicatoin
+        spark_submit: "%(SPARK_HOME)s/bin/spark-submit"
+        # Local path to zip'd virtualenv which will be shipped to executors
+        mjolnir_venv_zip: "%(working_dir)s/mjolnir_venv.zip"
+        # Local path to python script for running mjolnir utilities
+        mjolnir_utility_path: "%(working_dir)s/venv/bin/mjolnir-utilities.py"
+        # Path inside hdfs to the training data
+        training_data_path: "user/%(USER)s/mjolnir/%(marker)s"
+        # Fully qualified HDFS path to the training data
+        hdfs_training_data_path: 
"hdfs://analytics-hadoop/%(training_data_path)s"
+        # Fully qualified local path to the training data
+        local_training_data_path: "/mnt/hdfs/%(training_data_path)s"
+        # Base directory used to build path to write training output to
+        base_training_output_dir: "%(HOME)s/training_size"
+        # Number of cpu cores to assign per task. Must be a multiple of
+        # cores_per_executor. Spark can't take advantage of this being > 1, but
+        # xgboost can.
+        cores_per_task: 1
+        # Number of cpu cores to request per executor. If cores_per_task is
+        # less than this spark will run multiple tasks per executor in separate
+        # threads.
+        cores_per_executor: 1
+        # Size of JVM heap on executors
+        executor_memory: 2G
+        # Additional memory allocated by yarn beyond executor_memory. This
+        # accounts for off-heap data structures both in the JVM itself, and
+        # those created via JNI for xgboost. Primarily this is set here so it
+        # can be overridden from the command line.
+        executor_memory_overhead: 512
+        # Used by the data pipeline to decide the minimum number of sessions
+        # to group together. Setting this too high on low volume wikis will
+        # result in little to no training data.
+        min_sessions_per_query: 10
+    # Files that must exist to run
+    paths:
+        file_exist: !!set
+            ? "%(mjolnir_venv_zip)s"
+            ? "%(mjolnir_utility_path)s"
+            ? "%(spark_submit)s"
+            ? "%(PYSPARK_PYTHON)s"
+    spark_args:
+        master: yarn
+        # TODO: When is this necessary?
+        files: /usr/lib/libhdfs.so.0.0.0
+        # Ship the mjolnir virtualenv to executors and decompress it to ./venv
+        archives: "%(mjolnir_venv_zip)s#venv"
+        executor-cores: "%(cores_per_executor)s"
+        executor-memory: "%(executor_memory)s"
+        # Source our jvm dependencies from archiva.
+        repositories: 
https://archiva.wikimedia.org/repository/releases,https://archiva.wikimedia.org/repository/snapshots,https://archiva.wikimedia.org/repository/mirrored
+        packages: 
ml.dmlc:xgboost4j-spark:0.7-wmf-1,org.wikimedia.search:mjolnir:0.2,org.apache.spark:spark-streaming-kafka-0-8_2.11:%(spark_version)s
+    spark_conf:
+        spark.task.cpus: "%(cores_per_task)s"
+        spark.yarn.executor.memoryOverhead: "%(executor_memory_overhead)s"
+        # While undesirable, we can't disable the public (maven central) 
repository
+        # until spark 2.2, which depends on java 8 (and our cluster is on java 
7 still)
+        spark.driver.extraJavaOptions: "-Dhttp.proxyHost=webproxy.eqiad.wmnet 
-Dhttp.proxyPort=8080 -Dhttps.proxyHost=webproxy.eqiad.wmnet 
-Dhttps.proxyPort=8080"
+    commands:
+        pyspark:
+            spark_command: "%(SPARK_HOME)s/bin/pyspark"
+        # Shell used to test model training
+        pyspark_train:
+            spark_command: "%(SPARK_HOME)s/bin/pyspark"
+            template_vars:
+                cores_per_executor: 4
+                cores_per_task: 4
+                executor_memory: 2G
+                executor_memory_overhead: 6144
+        data_pipeline:
+            spark_command: "%(SPARK_HOME)s/bin/spark-submit"
+            mjolnir_utility_path: "%(mjolnir_utility_path)s"
+            mjolnir_utility: data_pipeline
+            paths:
+                dir_not_exist: !!set
+                    ? "%(local_training_data_path)s"
+            cmd_args:
+                input: 
hdfs://analytics-hadoop/wmf/data/discovery/query_clicks/daily/year=*/month=*/day=*
+                output-dir: "%(hdfs_training_data_path)s"
+                # Maximum number of training observations per-wiki. We usually 
get a bit less
+                # than requestsed, 35M turns into 25 or 30M.
+                samples-per-wiki: 35000000
+                search-cluster: codfw
+                min-sessions: "%(min_sessions_per_query)s"
+        training_pipeline:
+            spark_command: "%(SPARK_HOME)s/bin/spark-submit"
+            mjolnir_utility_path: "%(mjolnir_utility_path)s"
+            mjolnir_utility: training_pipeline
+            paths:
+                dir_exist: !!set
+                    # TODO: Would be nice if we could specify paths.dir_exist 
for
+                    # input training data, but it's evaluated before 
data_pipeline is
+                    # called when doing collect_and_train.
+                    ? "%(base_training_output_dir)s"
+            spark_args:
+                driver-memory: 3G
+            spark_conf:
+                # Disabling auto broadcast join prevents memory explosion when 
spark
+                # mis-predicts the size of a dataframe. (where does this 
happen?)
+                spark.sql.autoBroadcastJoinThreshold: -1
+                # Adjusting up executor idle timeout from 60s to 180s is a bit 
greedy,
+                # but prevents a whole bunch of log spam from spark killing 
executors
+                # between CV runs
+                spark.dynamicAllocation.executorIdleTimeout: 180s
+            cmd_args:
+                input: "%(hdfs_training_data_path)s"
+                output: 
"%(base_training_output_dir)s/%(marker)s_%(profile_name)s"
+
+# Individual training groups
+profiles:
+    large:
+        # 12M to 30M observations. 4M to 12M per executor.
+        # Approximately 63 executors, 378 cores, 756GB memory
+        wikis:
+            - enwiki
+            - dewiki
+        commands:
+            training_pipeline:
+                template_vars:
+                    cores_per_executor: 6
+                    cores_per_task: 6
+                    executor_memory: 3G
+                    executor_memory_overhead: 9216
+                spark_conf:
+                    spark.dynamicAllocation.maxExecutors: 65
+                cmd_args:
+                    workers: 3
+                    cv-jobs: 22
+                    folds: 3
+                    final-trees: 100
+
+    medium:
+        # 4M to 12M observations per executor.
+        # Approximately 70 executors, 420 cores, 840GB memory
+        wikis:
+            - itwiki
+            - ptwiki
+            - frwiki
+            - ruwiki
+        commands:
+            training_pipeline:
+                template_vars:
+                    cores_per_executor: 6
+                    cores_per_task: 6
+                    executor_memory: 3G
+                    executor_memory_overhead: 9216
+                spark_conf:
+                    spark.dynamicAllocation.maxExecutors: 75
+                cmd_args:
+                    workers: 1
+                    cv-jobs: 70
+                    folds: 5
+                    final-trees: 100
+
+    small:
+        # 100k to 4M observations per executor. Way overprovisioned
+        # Approximately 100 executors, 400 cores, 800G memory.
+        wikis:
+            - svwiki
+            - fawiki
+            - idwiki
+            - viwiki
+            - nowiki
+            - hewiki
+            - kowiki
+            - fiwiki
+            - jawiki
+            - arwiki
+            - itwiki
+            - nlwiki
+            - zhwiki
+            - plwiki
+        commands:
+            training_pipeline:
+                template_vars:
+                    cores_per_executor: 4
+                    cores_per_task: 4
+                    executor_memory: 2G
+                    executor_memory_overhead: 6144
+                spark_conf:
+                    spark.dynamicAllocation.maxExecutors: 105
+                cmd_args:
+                    workers: 1
+                    cv-jobs: 100
+                    folds: 5
+                    final-trees: 500
diff --git a/src b/src
new file mode 160000
index 0000000..96337a0
--- /dev/null
+++ b/src
@@ -0,0 +1 @@
+Subproject commit 96337a0ab1931278f93b752ca3be5f30e8124762
diff --git a/upload_wheels.py b/upload_wheels.py
new file mode 100755
index 0000000..e16453b
--- /dev/null
+++ b/upload_wheels.py
@@ -0,0 +1,104 @@
+#!/usr/bin/env python
+"""
+Uploads python wheels to archiva
+
+Usage:
+    upload-wheels.py wheels/*.whl
+"""
+
+from __future__ import print_function
+import hashlib
+import os
+import subprocess
+import sys
+import urllib
+
+try:
+    # python 2.x
+    import urllib2
+except ImportError:
+    # python 3.x. This isn't the entirety
+    # of urllib2, but it's enough for us.
+    import urllib.request as urllib2
+
+
+DRY_RUN=False
+REPO='https://archiva.wikimedia.org/repository/python/'
+GROUP_ID='python'
+
+
+def make_url(artifact, version):
+   return '%s%s/%s/%s/%s-%s.whl' % (
+        REPO, GROUP_ID.replace('.', '/'), artifact, version, artifact, version)
+
+
+def url_exists(url):
+    request = urllib2.Request(url)
+    request.get_method = lambda: 'HEAD'
+    try:
+        response = urllib2.urlopen(request)
+        return response.code == 200
+    except urllib2.HTTPError:
+        # 404, others?
+        return False
+
+
+def fetch_url(url):
+    res = urllib2.urlopen(url)
+    return res.read()
+
+
+def calc_sha1(path):
+    sha1 = hashlib.sha1()
+    with open(path, 'rb') as f:
+        while True:
+            data = f.read(65536)
+            if not data:
+                break
+            sha1.update(data)
+    return sha1.hexdigest()
+
+
+def mvn_deploy_file(**kwargs):
+    cmd = ['mvn', 'deploy:deploy-file'] + ['-D%s=%s' % x for x in 
kwargs.items()]
+    print(cmd)
+    if not DRY_RUN:
+        subprocess.check_call(cmd)
+
+
+if __name__ == "__main__":
+    args = sys.argv[1:]
+    if not len(args) or args[0] in ("-h", "--help"):
+        print(__doc__ + "\n")
+        sys.exit(1)
+
+    for path in sys.argv[1:]:
+        fname = os.path.basename(path)
+        artifact, version = os.path.splitext(fname)[0].split('-', 1)
+        url = make_url(artifact, version)
+
+        # Git-fat stores the sha1sum of a package. Sadly wheel creation is not
+        # bit for bit reproducable, so it's possible the same version is
+        # already on archiva but with a different sha1. In that case download
+        # the remote version rather than failing to replace the file already
+        # there.
+        #
+        # The wheel file creation itself is deterministic in wheel >= 0.26 when
+        # SOURCE_DATE_EPOCH is set, but any C compilation is non-deterministic.
+        if url_exists(url):
+            # TODO: Archiva only provides sha1 and md5, which are both unsafe.
+            # We could always download but that might be a few hundred MB.
+            repo_sha1 = fetch_url(url + '.sha1').decode('ascii').split(' ')[0]
+            local_sha1 = calc_sha1(path)
+            if repo_sha1 == local_sha1:
+                print("Already deployed to repo: %s" % (path))
+            else:
+                print("Remote wheel does not match local: %s (remote) != %s 
(local)" % (repo_sha1, local_sha1))
+                print("Downloading repo wheel from: %s" % (url))
+                if not DRY_RUN:
+                    urllib.urlretrieve(url, path)
+        else:
+            mvn_deploy_file(
+                repositoryId='wikimedia.python', url=REPO, file=path,
+                groupId=GROUP_ID, artifactId=artifact, version=version,
+                generatePom=False,packaging="whl")

-- 
To view, visit https://gerrit.wikimedia.org/r/388345
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I420c8942507fc06c516df09e74e17b90ebf7d2c8
Gerrit-PatchSet: 10
Gerrit-Project: search/MjoLniR/deploy
Gerrit-Branch: master
Gerrit-Owner: EBernhardson <[email protected]>
Gerrit-Reviewer: Awight <[email protected]>
Gerrit-Reviewer: DCausse <[email protected]>
Gerrit-Reviewer: Gehel <[email protected]>
Gerrit-Reviewer: Giuseppe Lavagetto <[email protected]>
Gerrit-Reviewer: Thcipriani <[email protected]>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to