Milimetric has uploaded a new change for review.
https://gerrit.wikimedia.org/r/249207
Change subject: Publish the new pageviews dataset to dumps
......................................................................
Publish the new pageviews dataset to dumps
For now, I'm not adding this to the html page that describes what's
available. It will be usable by people who know about it, but I think
we have to think more about how to explain this to users. We have 6
different sources of data available and just adding one more link might
break the proverbial camel's back :)
Change-Id: Ie6ace1d3bb84e261e6407b9a9b4fc1c32e346bc5
---
M manifests/role/dataset.pp
A modules/dataset/manifests/cron/pageviews.pp
2 files changed, 58 insertions(+), 0 deletions(-)
git pull ssh://gerrit.wikimedia.org:29418/operations/puppet
refs/changes/07/249207/1
diff --git a/manifests/role/dataset.pp b/manifests/role/dataset.pp
index 2a0b1ef..f6bf423 100644
--- a/manifests/role/dataset.pp
+++ b/manifests/role/dataset.pp
@@ -22,6 +22,21 @@
}
}
+# == Class role::dataset::pageviews
+#
+# NOTE: this requires that an rsync server
+# module named 'hdfs-archive' is configured on stat1002.
+#
+# This will make these files available at
+# http://dumps.wikimedia.org/other/pageviews/
+#
+class role::dataset::pageviews($enable = true) {
+ class { '::dataset::cron::pageviews':
+ source => 'stat1002.eqiad.wmnet::hdfs-archive/pageviews',
+ enable => $enable,
+ }
+}
+
# == Class role::dataset::mediacounts
#
# NOTE: this requires that an rsync server
@@ -65,6 +80,10 @@
enable => true,
}
+ class { 'role::dataset::pageviews':
+ enable => true,
+ }
+
class { 'role::dataset::mediacounts':
enable => true,
}
diff --git a/modules/dataset/manifests/cron/pageviews.pp
b/modules/dataset/manifests/cron/pageviews.pp
new file mode 100644
index 0000000..f54c8f9
--- /dev/null
+++ b/modules/dataset/manifests/cron/pageviews.pp
@@ -0,0 +1,39 @@
+# == Class dataset::cron::pageviews
+# Copies over files with pageview statistics per page and project,
+# using the current definition of pageviews, from an rsyncable location.
+#
+# These statistics are computed from the raw webrequest logs by the
+# pageview definition: https://meta.wikimedia.org/wiki/Research:Page_view
+#
+# See:
https://github.com/wikimedia/analytics-refinery/tree/master/oozie/pageview
+# (docs on the jobs that create the table and archive the files)
+# https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly
+# (docs on the table from which these statistics are computed)
+#
+class dataset::cron::pageviews(
+ $source,
+ $enable = true,
+ $destination = '/data/xmldatadumps/public/other/pageviews',
+ $user = 'datasets',
+)
+{
+ $ensure = $enable ? {
+ true => 'present',
+ default => 'absent',
+ }
+
+ file { $destination:
+ ensure => 'directory',
+ owner => $user,
+ group => 'root',
+ }
+
+ cron { 'pageviews':
+ ensure => $ensure,
+ command => "/usr/bin/rsync -rt --delete --chmod=go-w ${source}/
${destination}/",
+ environment => '[email protected]',
+ user => $user,
+ minute => '51',
+ require => User[$user],
+ }
+}
--
To view, visit https://gerrit.wikimedia.org/r/249207
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ie6ace1d3bb84e261e6407b9a9b4fc1c32e346bc5
Gerrit-PatchSet: 1
Gerrit-Project: operations/puppet
Gerrit-Branch: production
Gerrit-Owner: Milimetric <[email protected]>
_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits