Ottomata has uploaded a new change for review. ( 
https://gerrit.wikimedia.org/r/343753 )

Change subject: Load wiki project namespace map into HDFS weekly, sqoop 
mediawiki monthly
......................................................................

Load wiki project namespace map into HDFS weekly, sqoop mediawiki monthly

This starts the process of moving Hadoop crons off of analytics1027 and onto 
anlytics1003 (T159527).
Bug: T160083

Change-Id: I08a39a5b68cb33dea5b60c6527dfa62d6f3a41e5
---
M manifests/site.pp
A modules/role/manifests/analytics_cluster/refinery/job/project_namespace_map.pp
A modules/role/manifests/analytics_cluster/refinery/job/sqoop_mediawiki.pp
3 files changed, 57 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.wikimedia.org:29418/operations/puppet 
refs/changes/53/343753/1

diff --git a/manifests/site.pp b/manifests/site.pp
index 08861d7..405be0b 100644
--- a/manifests/site.pp
+++ b/manifests/site.pp
@@ -76,7 +76,12 @@
         analytics_cluster::oozie::server::database,
         analytics_cluster::hive::metastore,
         analytics_cluster::hive::server,
-        analytics_cluster::oozie::server)
+        analytics_cluster::oozie::server
+
+        # analytics1003 also runs various crons that launch
+        # Hadoop jobs.
+        analytics_cluster::refinery::job::project_namespace_map,
+        analytics_cluster::refinery::job::sqoop_mediawiki)
 
     include ::standard
     include ::base::firewall
diff --git 
a/modules/role/manifests/analytics_cluster/refinery/job/project_namespace_map.pp
 
b/modules/role/manifests/analytics_cluster/refinery/job/project_namespace_map.pp
new file mode 100644
index 0000000..b6397d3
--- /dev/null
+++ 
b/modules/role/manifests/analytics_cluster/refinery/job/project_namespace_map.pp
@@ -0,0 +1,22 @@
+# == Class role::analytics_cluster::refinery::job::project_namespace_map
+# Installs a weekly cron job to download the Wikimedia sitematrix project
+# namespace map file so that other refinery jobs know about what wiki projects
+# exist.
+#
+class role::analytics_cluster::refinery::job::project_namespace_map {
+    require ::role::analytics_cluster::refinery
+
+    # Shortcut var to DRY up cron commands.
+    $env = "export 
PYTHONPATH=\${PYTHONPATH}:${role::analytics_cluster::refinery::path}/python"
+
+    $output_directory = '/wmf/data/raw/mediawiki/project_namespace_map'
+
+    # This downloads the project namespace map for a 'labsdb' public import.
+    cron { 'refinery-download-project-namespace':
+        command => "${env} && 
${role::analytics_cluster::refinery::path}/bin/download-project-namespace-map 
-x ${output_directory} -s \$(/bin/date '+%Y-%m)",
+        user    => 'hdfs',
+        minute  => '0',
+        hour    => '12'
+        weekday => '6' # Saturday
+    }
+}
diff --git 
a/modules/role/manifests/analytics_cluster/refinery/job/sqoop_mediawiki.pp 
b/modules/role/manifests/analytics_cluster/refinery/job/sqoop_mediawiki.pp
new file mode 100644
index 0000000..56e7582
--- /dev/null
+++ b/modules/role/manifests/analytics_cluster/refinery/job/sqoop_mediawiki.pp
@@ -0,0 +1,29 @@
+# == Class role::analytics_cluster::refinery::job::sqoop_mediawiki
+# Schedules sqoop to import MediaWiki databases into Hadoop monthly.
+# NOTE: This requires that role::analytics_cluster::mysql_password has
+# been included somewhere, so that 
/user/hdfs/mysql-analytics-research-client-pw.txt
+# exists in HDFS.  (We can't require it here, since it needs to only be 
included once
+# on a different node.)
+#
+class role::analytics_cluster::refinery::job::sqoop_mediawiki {
+    require ::role::analytics_cluster::refinery
+
+    # Shortcut var to DRY up cron commands.
+    $env = "export 
PYTHONPATH=\${PYTHONPATH}:${role::analytics_cluster::refinery::path}/python"
+
+    $output_directory = '/wmf/data/raw/mediawiki/tables'
+    $wiki_file        = 
'/mnt/hdfs/wmf/refinery/current/static_data/mediawiki/grouped_wikis/labs_grouped_wikis.csv'
+    # We regularly sqoop out of labsdb so that data is pre-sanitized.
+    $db_host          = 'labsdb-analytics.eqiad.wmnet'
+    $db_user          = 'research'
+    $db_password_file = '/user/hdfs/mysql-analytics-research-client-pw.txt'
+
+    cron { 'refinery-sqoop-mediawiki':
+        command => "${env} && /usr/bin/python3 
${role::analytics_cluster::refinery::path}/bin/sqoop-mediawiki-tables 
--job-name sqoop-mediawiki-monthly-$(/bin/date '+%Y-%m') --labsdb --jdbc-host 
${db_host} --output-dir ${$output_directory} --wiki-file  ${wiki_file} --user 
${db_user} --password-file ${db_password_file} --timestamp \$(/bin/date 
'+%Y%m01000000') --snapshot \$(/bin/date '+%Y-%m')"
+        user    => 'hdfs',
+        minute  => '0',
+        hour    => '0'
+        # Start on the fifth day of every month.
+        day     => '5'
+    }
+}

-- 
To view, visit https://gerrit.wikimedia.org/r/343753
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I08a39a5b68cb33dea5b60c6527dfa62d6f3a41e5
Gerrit-PatchSet: 1
Gerrit-Project: operations/puppet
Gerrit-Branch: production
Gerrit-Owner: Ottomata <[email protected]>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to