[MediaWiki-commits] [Gerrit] Match title & ns using x_analytics header & get all agent_types - change (analytics...source)

Nuria (Code Review) Wed, 20 Jul 2016 07:47:52 -0700

Nuria has submitted this change and it was merged.

Change subject: Match title & ns using x_analytics header & get all agent_types
......................................................................



Match title & ns using x_analytics header & get all agent_types

The previous version of the job matches pageviews based
on the pagetitle "Special:AboutTopic".
This was an issue on sites where the namespace name
had been translated or an alias was used.
The ns ID is now added to the analytics header to avoid
this issue so -1 is matched directly for Special:

The previous version would also run into issues in the
future when aliases for the special page name would
be added.
Thus the raw name of the special page is now added to
the analytics header when viewed, and that is now used in
this patch.

The previous version also only added data about 'users'
to graphite, but a request for spiders has also been made
this the spiders are now also added.

x_analytics element docs were added to
https://wikitech.wikimedia.org/wiki/X-Analytics
in
https://wikitech.wikimedia.org/w/index.php?diff=746636&oldid=710297

A sample run of this query for 1 day:
 - Cumulative CPU: 66844.7 sec
 - Total MapReduce CPU Time Spent: 18 hours 34 minutes 4 seconds
 - Time taken: 442.308 seconds

Bug: T138500
Depends-On: Iaecfbae4c04c12472346b7f885b8a6e458bf47c4
Depends-On: I26c7b5ea6294b2403b89bd91869d2f18275c9917
Change-Id: I3848ee6136d2f7737d45071ee8528864c34299b4
---
M 
refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/WikidataArticlePlaceholderMetrics.scala
1 file changed, 17 insertions(+), 13 deletions(-)

Approvals:
  Nuria: Verified; Looks good to me, approved
  jenkins-bot: Verified



diff --git 
a/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/WikidataArticlePlaceholderMetrics.scala
 
b/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/WikidataArticlePlaceholderMetrics.scala
index abd59a1..28d796e 100644
--- 
a/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/WikidataArticlePlaceholderMetrics.scala
+++ 
b/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/WikidataArticlePlaceholderMetrics.scala
@@ -22,7 +22,7 @@
   /**
     * Config class for CLI argument parser using scopt
     */
-  case class Params(pageviewTable: String = "wmf.pageview_hourly",
+  case class Params(webrequestTable: String = "wmf.webrequest",
                     graphiteHost: String = "localhost",
                     graphitePort: Int = 2003,
                     graphiteNamespace: String = 
"daily.wikidata.articleplaceholder",
@@ -36,9 +36,9 @@
     note("This job reports ArticlePlaceholder extension traffic to graphite 
daily")
     help("help") text ("Prints this usage text")
 
-    opt[String]('t', "pageview-table") optional() valueName ("<table>") action 
{ (x, p) =>
-      p.copy(pageviewTable = x)
-    } text ("Hive pageview table to use. Defaults to wmf.pageview_hourly")
+    opt[String]('t', "webrequest-table") optional() valueName ("<table>") 
action { (x, p) =>
+      p.copy(webrequestTable = x)
+    } text ("Hive webrequest table to use. Defaults to wmf.webrequest")
 
     opt[String]('g', "graphite-host") optional() valueName ("<path>") action { 
(x, p) =>
       p.copy(graphiteHost = x)
@@ -77,25 +77,29 @@
         val sc = new SparkContext(conf)
         val hiveContext = new HiveContext(sc)
 
+        // Currently limited to wikipedia as ArticlePlaceholder is only 
deployed to wikipedias
         val sql = """
   SELECT
-    project,
-    SUM(view_count)
+    pageview_info["project"],
+    agent_type,
+    COUNT(*)
   FROM %s
   WHERE year = %d
     AND month = %d
     AND day = %d
-    AND page_title LIKE 'Special:AboutTopic%%'
-    AND agent_type = 'user'
-  GROUP BY project
-                  """.format(params.pageviewTable, params.year, params.month, 
params.day)
+    AND is_pageview = TRUE
+    AND x_analytics_map["ns"] = '-1'
+    AND x_analytics_map["special"] = 'AboutTopic'
+    AND normalized_host.project_class = 'wikipedia'
+    GROUP BY pageview_info["project"], agent_type
+                  """.format(params.webrequestTable, params.year, 
params.month, params.day)
 
-        val data = hiveContext.sql(sql).collect().map(r => (r.getString(0), 
r.getLong(1)))
+        val data = hiveContext.sql(sql).collect().map(r => (r.getString(0), 
r.getString(0), r.getLong(1)))
         val time = new DateTime(params.year, params.month, params.day, 0, 0)
         val graphite = new GraphiteClient(params.graphiteHost, 
params.graphitePort)
 
-        data.foreach{ case (project, count) => {
-          val metric = 
"%s.varnish_requests.abouttopic.user.%s".format(params.graphiteNamespace, 
project.replace('.','_'))
+        data.foreach{ case (project, agentType, count) => {
+          val metric = 
"%s.varnish_requests.abouttopic.%s.%s".format(params.graphiteNamespace, 
agentType, project.replace('.','_'))
           graphite.sendOnce(metric, count, time.getMillis / 1000)
         }}
 

-- 
To view, visit https://gerrit.wikimedia.org/r/298724
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I3848ee6136d2f7737d45071ee8528864c34299b4
Gerrit-PatchSet: 11
Gerrit-Project: analytics/refinery/source
Gerrit-Branch: master
Gerrit-Owner: Addshore <[email protected]>
Gerrit-Reviewer: Addshore <[email protected]>
Gerrit-Reviewer: Joal <[email protected]>
Gerrit-Reviewer: Nuria <[email protected]>
Gerrit-Reviewer: jenkins-bot <>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

[MediaWiki-commits] [Gerrit] Match title & ns using x_analytics header & get all agent_types - change (analytics...source)

Reply via email to