Nuria has submitted this change and it was merged. Change subject: Match title & ns using x_analytics header & get all agent_types ......................................................................
Match title & ns using x_analytics header & get all agent_types The previous version of the job matches pageviews based on the pagetitle "Special:AboutTopic". This was an issue on sites where the namespace name had been translated or an alias was used. The ns ID is now added to the analytics header to avoid this issue so -1 is matched directly for Special: The previous version would also run into issues in the future when aliases for the special page name would be added. Thus the raw name of the special page is now added to the analytics header when viewed, and that is now used in this patch. The previous version also only added data about 'users' to graphite, but a request for spiders has also been made this the spiders are now also added. x_analytics element docs were added to https://wikitech.wikimedia.org/wiki/X-Analytics in https://wikitech.wikimedia.org/w/index.php?diff=746636&oldid=710297 A sample run of this query for 1 day: - Cumulative CPU: 66844.7 sec - Total MapReduce CPU Time Spent: 18 hours 34 minutes 4 seconds - Time taken: 442.308 seconds Bug: T138500 Depends-On: Iaecfbae4c04c12472346b7f885b8a6e458bf47c4 Depends-On: I26c7b5ea6294b2403b89bd91869d2f18275c9917 Change-Id: I3848ee6136d2f7737d45071ee8528864c34299b4 --- M refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/WikidataArticlePlaceholderMetrics.scala 1 file changed, 17 insertions(+), 13 deletions(-) Approvals: Nuria: Verified; Looks good to me, approved jenkins-bot: Verified diff --git a/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/WikidataArticlePlaceholderMetrics.scala b/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/WikidataArticlePlaceholderMetrics.scala index abd59a1..28d796e 100644 --- a/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/WikidataArticlePlaceholderMetrics.scala +++ b/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/WikidataArticlePlaceholderMetrics.scala @@ -22,7 +22,7 @@ /** * Config class for CLI argument parser using scopt */ - case class Params(pageviewTable: String = "wmf.pageview_hourly", + case class Params(webrequestTable: String = "wmf.webrequest", graphiteHost: String = "localhost", graphitePort: Int = 2003, graphiteNamespace: String = "daily.wikidata.articleplaceholder", @@ -36,9 +36,9 @@ note("This job reports ArticlePlaceholder extension traffic to graphite daily") help("help") text ("Prints this usage text") - opt[String]('t', "pageview-table") optional() valueName ("<table>") action { (x, p) => - p.copy(pageviewTable = x) - } text ("Hive pageview table to use. Defaults to wmf.pageview_hourly") + opt[String]('t', "webrequest-table") optional() valueName ("<table>") action { (x, p) => + p.copy(webrequestTable = x) + } text ("Hive webrequest table to use. Defaults to wmf.webrequest") opt[String]('g', "graphite-host") optional() valueName ("<path>") action { (x, p) => p.copy(graphiteHost = x) @@ -77,25 +77,29 @@ val sc = new SparkContext(conf) val hiveContext = new HiveContext(sc) + // Currently limited to wikipedia as ArticlePlaceholder is only deployed to wikipedias val sql = """ SELECT - project, - SUM(view_count) + pageview_info["project"], + agent_type, + COUNT(*) FROM %s WHERE year = %d AND month = %d AND day = %d - AND page_title LIKE 'Special:AboutTopic%%' - AND agent_type = 'user' - GROUP BY project - """.format(params.pageviewTable, params.year, params.month, params.day) + AND is_pageview = TRUE + AND x_analytics_map["ns"] = '-1' + AND x_analytics_map["special"] = 'AboutTopic' + AND normalized_host.project_class = 'wikipedia' + GROUP BY pageview_info["project"], agent_type + """.format(params.webrequestTable, params.year, params.month, params.day) - val data = hiveContext.sql(sql).collect().map(r => (r.getString(0), r.getLong(1))) + val data = hiveContext.sql(sql).collect().map(r => (r.getString(0), r.getString(0), r.getLong(1))) val time = new DateTime(params.year, params.month, params.day, 0, 0) val graphite = new GraphiteClient(params.graphiteHost, params.graphitePort) - data.foreach{ case (project, count) => { - val metric = "%s.varnish_requests.abouttopic.user.%s".format(params.graphiteNamespace, project.replace('.','_')) + data.foreach{ case (project, agentType, count) => { + val metric = "%s.varnish_requests.abouttopic.%s.%s".format(params.graphiteNamespace, agentType, project.replace('.','_')) graphite.sendOnce(metric, count, time.getMillis / 1000) }} -- To view, visit https://gerrit.wikimedia.org/r/298724 To unsubscribe, visit https://gerrit.wikimedia.org/r/settings Gerrit-MessageType: merged Gerrit-Change-Id: I3848ee6136d2f7737d45071ee8528864c34299b4 Gerrit-PatchSet: 11 Gerrit-Project: analytics/refinery/source Gerrit-Branch: master Gerrit-Owner: Addshore <[email protected]> Gerrit-Reviewer: Addshore <[email protected]> Gerrit-Reviewer: Joal <[email protected]> Gerrit-Reviewer: Nuria <[email protected]> Gerrit-Reviewer: jenkins-bot <> _______________________________________________ MediaWiki-commits mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits
