Ottomata has submitted this change and it was merged. (
https://gerrit.wikimedia.org/r/343753 )
Change subject: Load wiki project namespace map into HDFS weekly, sqoop
mediawiki monthly
......................................................................
Load wiki project namespace map into HDFS weekly, sqoop mediawiki monthly
This starts the process of moving Hadoop crons off of analytics1027 and onto
anlytics1003 (T159527).
Bug: T160083
Change-Id: I08a39a5b68cb33dea5b60c6527dfa62d6f3a41e5
---
M manifests/site.pp
A modules/role/manifests/analytics_cluster/refinery/job/project_namespace_map.pp
A modules/role/manifests/analytics_cluster/refinery/job/sqoop_mediawiki.pp
3 files changed, 58 insertions(+), 1 deletion(-)
Approvals:
Ottomata: Verified; Looks good to me, approved
diff --git a/manifests/site.pp b/manifests/site.pp
index 72cb7ce..662bb91 100644
--- a/manifests/site.pp
+++ b/manifests/site.pp
@@ -76,7 +76,13 @@
analytics_cluster::oozie::server::database,
analytics_cluster::hive::metastore,
analytics_cluster::hive::server,
- analytics_cluster::oozie::server)
+ analytics_cluster::oozie::server,
+
+ # analytics1003 also runs various crons that launch
+ # Hadoop jobs.
+ analytics_cluster::refinery,
+ analytics_cluster::refinery::job::project_namespace_map,
+ analytics_cluster::refinery::job::sqoop_mediawiki)
include ::standard
include ::base::firewall
diff --git
a/modules/role/manifests/analytics_cluster/refinery/job/project_namespace_map.pp
b/modules/role/manifests/analytics_cluster/refinery/job/project_namespace_map.pp
new file mode 100644
index 0000000..9ffde94
--- /dev/null
+++
b/modules/role/manifests/analytics_cluster/refinery/job/project_namespace_map.pp
@@ -0,0 +1,22 @@
+# == Class role::analytics_cluster::refinery::job::project_namespace_map
+# Installs a weekly cron job to download the Wikimedia sitematrix project
+# namespace map file so that other refinery jobs know about what wiki projects
+# exist.
+#
+class role::analytics_cluster::refinery::job::project_namespace_map {
+ require ::role::analytics_cluster::refinery
+
+ # Shortcut var to DRY up cron commands.
+ $env = "export
PYTHONPATH=\${PYTHONPATH}:${role::analytics_cluster::refinery::path}/python"
+
+ $output_directory = '/wmf/data/raw/mediawiki/project_namespace_map'
+
+ # This downloads the project namespace map for a 'labsdb' public import.
+ cron { 'refinery-download-project-namespace':
+ command => "${env} &&
${role::analytics_cluster::refinery::path}/bin/download-project-namespace-map
-x ${output_directory} -s \$(/bin/date '+%Y-%m')",
+ user => 'hdfs',
+ minute => '0',
+ hour => '12',
+ weekday => '6', # Saturday
+ }
+}
diff --git
a/modules/role/manifests/analytics_cluster/refinery/job/sqoop_mediawiki.pp
b/modules/role/manifests/analytics_cluster/refinery/job/sqoop_mediawiki.pp
new file mode 100644
index 0000000..fe52931
--- /dev/null
+++ b/modules/role/manifests/analytics_cluster/refinery/job/sqoop_mediawiki.pp
@@ -0,0 +1,29 @@
+# == Class role::analytics_cluster::refinery::job::sqoop_mediawiki
+# Schedules sqoop to import MediaWiki databases into Hadoop monthly.
+# NOTE: This requires that role::analytics_cluster::mysql_password has
+# been included somewhere, so that
/user/hdfs/mysql-analytics-research-client-pw.txt
+# exists in HDFS. (We can't require it here, since it needs to only be
included once
+# on a different node.)
+#
+class role::analytics_cluster::refinery::job::sqoop_mediawiki {
+ require ::role::analytics_cluster::refinery
+
+ # Shortcut var to DRY up cron commands.
+ $env = "export
PYTHONPATH=\${PYTHONPATH}:${role::analytics_cluster::refinery::path}/python"
+
+ $output_directory = '/wmf/data/raw/mediawiki/tables'
+ $wiki_file =
'/mnt/hdfs/wmf/refinery/current/static_data/mediawiki/grouped_wikis/labs_grouped_wikis.csv'
+ # We regularly sqoop out of labsdb so that data is pre-sanitized.
+ $db_host = 'labsdb-analytics.eqiad.wmnet'
+ $db_user = 's53272'
+ $db_password_file = '/user/hdfs/mysql-analytics-research-client-pw.txt'
+
+ cron { 'refinery-sqoop-mediawiki':
+ command => "${env} && /usr/bin/python3
${role::analytics_cluster::refinery::path}/bin/sqoop-mediawiki-tables
--job-name sqoop-mediawiki-monthly-$(/bin/date '+%Y-%m') --labs --jdbc-host
${db_host} --output-dir ${$output_directory} --wiki-file ${wiki_file} --user
${db_user} --password-file ${db_password_file} --timestamp \$(/bin/date
'+%Y%m01000000') --snapshot \$(/bin/date '+%Y-%m')",
+ user => 'hdfs',
+ minute => '0',
+ hour => '0',
+ # Start on the second day of every month.
+ day => '2',
+ }
+}
--
To view, visit https://gerrit.wikimedia.org/r/343753
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings
Gerrit-MessageType: merged
Gerrit-Change-Id: I08a39a5b68cb33dea5b60c6527dfa62d6f3a41e5
Gerrit-PatchSet: 5
Gerrit-Project: operations/puppet
Gerrit-Branch: production
Gerrit-Owner: Ottomata <[email protected]>
Gerrit-Reviewer: Chad <[email protected]>
Gerrit-Reviewer: Giuseppe Lavagetto <[email protected]>
Gerrit-Reviewer: Joal <[email protected]>
Gerrit-Reviewer: Ottomata <[email protected]>
Gerrit-Reviewer: jenkins-bot <>
_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits