Joal has uploaded a new change for review. ( https://gerrit.wikimedia.org/r/361500 )
Change subject: Update mediawiki history related tables ...................................................................... Update mediawiki history related tables Related to - https://gerrit.wikimedia.org/r/#/c/358916/ - https://gerrit.wikimedia.org/r/#/c/359019/ - Add lines for future timestamp change (when we get hive > 1.2.0) - Update jobs for timestamp format change (from mediawiki format YYYYMMDDHHMMSS to JDBC compliant fornat YYYY-MM-DD HH:MM:SS) - Add new fields in mediawiki_history table and update jobs accordingly - Correct bug in denormalize oozie job (wmf_raw.mediawiki_namespace_project_map table repair had been forgotten) Bug: T161147,T161150 Change-Id: I9f093e50e53b0669f09866f9409d3eb18c7552c9 --- M hive/mediawiki/history/create_mediawiki_history_table.hql M hive/mediawiki/history/create_mediawiki_page_history_table.hql M hive/mediawiki/history/create_mediawiki_user_history_table.hql M oozie/mediawiki/history/denormalize/coordinator.properties M oozie/mediawiki/history/denormalize/coordinator.xml M oozie/mediawiki/history/denormalize/workflow.xml M oozie/mediawiki/history/druid/generate_json_mediawiki_history.hql M oozie/mediawiki/history/druid/load_mediawiki_history.json.template M oozie/mediawiki/history/druid/workflow.xml M oozie/mediawiki/history/metrics/coordinator.properties M oozie/mediawiki/history/metrics/coordinator.xml M oozie/mediawiki/history/metrics/daily_edits.hql M oozie/mediawiki/history/metrics/daily_edits_by_anonymous_users.hql M oozie/mediawiki/history/metrics/daily_edits_by_bot_users.hql M oozie/mediawiki/history/metrics/daily_edits_by_registered_users.hql M oozie/mediawiki/history/metrics/daily_unique_anonymous_editors.hql M oozie/mediawiki/history/metrics/daily_unique_bot_editors.hql M oozie/mediawiki/history/metrics/daily_unique_editors.hql M oozie/mediawiki/history/metrics/daily_unique_page_creators.hql M oozie/mediawiki/history/metrics/daily_unique_registered_editors.hql M oozie/mediawiki/history/metrics/monthly_new_editors.hql M oozie/mediawiki/history/metrics/monthly_new_registered_users.hql M oozie/mediawiki/history/metrics/monthly_surviving_new_editors.hql M oozie/mediawiki/history/metrics/workflow.xml 24 files changed, 125 insertions(+), 146 deletions(-) git pull ssh://gerrit.wikimedia.org:29418/analytics/refinery refs/changes/00/361500/1 diff --git a/hive/mediawiki/history/create_mediawiki_history_table.hql b/hive/mediawiki/history/create_mediawiki_history_table.hql index da09be2..a04a494 100644 --- a/hive/mediawiki/history/create_mediawiki_history_table.hql +++ b/hive/mediawiki/history/create_mediawiki_history_table.hql @@ -12,7 +12,8 @@ `wiki_db` string COMMENT 'enwiki, dewiki, eswiktionary, etc.', `event_entity` string COMMENT 'revision, user or page', `event_type` string COMMENT 'create, move, delete, etc. Detailed explanation in the docs under #Event_types', - `event_timestamp` string COMMENT 'When this event ocurred, in YYYYMMDDHHmmss format', + `event_timestamp` string COMMENT 'When this event ocurred', + --`event_timestamp` timestamp COMMENT 'When this event ocurred', `event_comment` string COMMENT 'Comment related to this event, sourced from log_comment, rev_comment, etc.', `event_user_id` bigint COMMENT 'Id of the user that caused the event', `event_user_text` string COMMENT 'Historical text of the user that caused the event', @@ -27,6 +28,10 @@ `event_user_is_anonymous` boolean COMMENT 'Whether the event_user is not registered', `event_user_is_bot_by_name` boolean COMMENT 'Whether the event_user\'s name matches patterns we use to identify bots', `event_user_creation_timestamp` string COMMENT 'Registration timestamp of the user that caused the event', + --`event_user_creation_timestamp` timestamp COMMENT 'Registration timestamp of the user that caused the event', + `event_user_revision_count` bigint COMMENT 'Cumulative revision count per user for the current event_user_id (only available in revision-create events so far)', + `event_user_seconds_to_previous_revision` bigint COMMENT 'In revision events: seconds elapsed since the previous revision made by the current event_user_id (only available in revision-create events so far)', + `page_id` bigint COMMENT 'In revision/page events: id of the page', `page_title` string COMMENT 'In revision/page events: historical title of the page', @@ -37,6 +42,9 @@ `page_namespace_is_content_latest` boolean COMMENT 'In revision/page events: current namespace of the page is categorized as content', `page_is_redirect_latest` boolean COMMENT 'In revision/page events: whether the page is currently a redirect', `page_creation_timestamp` string COMMENT 'In revision/page events: creation timestamp of the page', + --`page_creation_timestamp` timestamp COMMENT 'In revision/page events: creation timestamp of the page', + `page_revision_count` bigint COMMENT 'In revision/page events: Cumulative revision count per page for the current page_id (only available in revision-create events so far)', + `page_seconds_to_previous_revision` bigint COMMENT 'In revision/page events: seconds elapsed since the previous revision made on the current page_id (only available in revision-create events so far)', `user_id` bigint COMMENT 'In user events: id of the user', `user_text` string COMMENT 'In user events: historical user text', @@ -51,6 +59,7 @@ `user_is_anonymous` boolean COMMENT 'In user events: whether the user is not registered', `user_is_bot_by_name` boolean COMMENT 'In user events: whether the user\'s name matches patterns we use to identify bots', `user_creation_timestamp` string COMMENT 'In user events: registration timestamp of the user.', + --`user_creation_timestamp` timestamp COMMENT 'In user events: registration timestamp of the user.', `revision_id` bigint COMMENT 'In revision events: id of the revision', `revision_parent_id` bigint COMMENT 'In revision events: id of the parent revision', @@ -62,9 +71,10 @@ `revision_content_format` string COMMENT 'In revision events: content format of revision', `revision_is_deleted` boolean COMMENT 'In revision events: whether this revision has been deleted (moved to archive table)', `revision_deleted_timestamp` string COMMENT 'In revision events: the timestamp when the revision was deleted', + --`revision_deleted_timestamp` timestamp COMMENT 'In revision events: the timestamp when the revision was deleted', `revision_is_identity_reverted` boolean COMMENT 'In revision events: whether this revision was reverted by another future revision', `revision_first_identity_reverting_revision_id` bigint COMMENT 'In revision events: id of the revision that reverted this revision', - `revision_seconds_to_identity_revert` bigint COMMENT 'In revision events: seconds elapsed between revision posting and its revert (if there was one)', + `revision_seconds_to_identity_revert` bigint COMMENT 'In revision events: seconds elapsed between revision posting and its revert (if there was one)', `revision_is_identity_revert` boolean COMMENT 'In revision events: whether this revision reverts other revisions' ) COMMENT diff --git a/hive/mediawiki/history/create_mediawiki_page_history_table.hql b/hive/mediawiki/history/create_mediawiki_page_history_table.hql index 5b3b5d5..d429a8c 100644 --- a/hive/mediawiki/history/create_mediawiki_page_history_table.hql +++ b/hive/mediawiki/history/create_mediawiki_page_history_table.hql @@ -13,6 +13,7 @@ page_id bigint COMMENT 'Id of the page, as in the page table.', page_id_artificial string COMMENT 'Generated Id for deleted pages without real Id.', page_creation_timestamp string COMMENT 'Timestamp of the page\'s first revision.', + --page_creation_timestamp timestamp COMMENT 'Timestamp of the page\'s first revision.', page_title string COMMENT 'Historical page title.', page_title_latest string COMMENT 'Page title as of today.', page_namespace int COMMENT 'Historical namespace.', @@ -22,6 +23,8 @@ page_is_redirect_latest boolean COMMENT 'In revision/page events: whether the page is currently a redirect', start_timestamp string COMMENT 'Timestamp from where this state applies (inclusive).', end_timestamp string COMMENT 'Timestamp to where this state applies (exclusive).', + --start_timestamp timestamp COMMENT 'Timestamp from where this state applies (inclusive).', + --end_timestamp timestamp COMMENT 'Timestamp to where this state applies (exclusive).', caused_by_event_type string COMMENT 'Event that caused this state (create, move, delete or restore).', caused_by_user_id bigint COMMENT 'ID from the user that caused this state.', inferred_from string COMMENT 'If non-NULL, some fields have been inferred from an inconsistency in the source data.' diff --git a/hive/mediawiki/history/create_mediawiki_user_history_table.hql b/hive/mediawiki/history/create_mediawiki_user_history_table.hql index 67240b6..802208e 100644 --- a/hive/mediawiki/history/create_mediawiki_user_history_table.hql +++ b/hive/mediawiki/history/create_mediawiki_user_history_table.hql @@ -17,7 +17,8 @@ user_groups_latest array<string> COMMENT 'User groups as of today.', user_blocks array<string> COMMENT 'Historical user blocks.', user_blocks_latest array<string> COMMENT 'User blocks as of today.', - user_registration_timestamp string COMMENT 'When the user accoung was registered, in YYYYMMDDHHmmss format.', + user_registration_timestamp string COMMENT 'When the user accoung was registered', + --user_registration_timestamp timestamp COMMENT 'When the user accoung was registered', created_by_self boolean COMMENT 'Whether the user created their own account', created_by_system boolean COMMENT 'Whether the user account was created by mediawiki (eg. centralauth)', created_by_peer boolean COMMENT 'Whether the user account was created by another user', @@ -25,9 +26,11 @@ is_bot_by_name boolean COMMENT 'Whether the user\'s name matches patterns we use to identify bots', start_timestamp string COMMENT 'Timestamp from where this state applies (inclusive).', end_timestamp string COMMENT 'Timestamp to where this state applies (exclusive).', + --start_timestamp timestamp COMMENT 'Timestamp from where this state applies (inclusive).', + --end_timestamp timestamp COMMENT 'Timestamp to where this state applies (exclusive).', caused_by_event_type string COMMENT 'Event that caused this state (create, move, delete or restore).', caused_by_user_id bigint COMMENT 'ID from the user that caused this state.', - caused_by_block_expiration string COMMENT 'Block expiration timestamp, if any.', + caused_by_block_expiration string COMMENT 'Block expiration value, if any.', inferred_from string COMMENT 'If non-NULL, indicates that some of this state\'s fields have been inferred after an inconsistency in the source data.' ) COMMENT diff --git a/oozie/mediawiki/history/denormalize/coordinator.properties b/oozie/mediawiki/history/denormalize/coordinator.properties index 9994159..b845116 100644 --- a/oozie/mediawiki/history/denormalize/coordinator.properties +++ b/oozie/mediawiki/history/denormalize/coordinator.properties @@ -44,6 +44,7 @@ mw_directory = ${name_node}/wmf/data/wmf/mediawiki # mw tables +mw_project_namespace_map_table = wmf_raw.mediawiki_project_namespace_map mw_archive_table = wmf_raw.mediawiki_archive mw_logging_table = wmf_raw.mediawiki_logging mw_page_table = wmf_raw.mediawiki_page diff --git a/oozie/mediawiki/history/denormalize/coordinator.xml b/oozie/mediawiki/history/denormalize/coordinator.xml index 0f1b610..4a7144f 100644 --- a/oozie/mediawiki/history/denormalize/coordinator.xml +++ b/oozie/mediawiki/history/denormalize/coordinator.xml @@ -21,6 +21,7 @@ <property><name>datasets_file</name></property> <property><name>mw_directory</name></property> + <property><name>mw_project_namespace_map_table</name></property> <property><name>mw_archive_table</name></property> <property><name>mw_logging_table</name></property> <property><name>mw_page_table</name></property> diff --git a/oozie/mediawiki/history/denormalize/workflow.xml b/oozie/mediawiki/history/denormalize/workflow.xml index f9b3d95..84ffe00 100644 --- a/oozie/mediawiki/history/denormalize/workflow.xml +++ b/oozie/mediawiki/history/denormalize/workflow.xml @@ -86,6 +86,10 @@ <!-- Hive tables--> <property> + <name>mw_project_namespace_map_table</name> + <description>Project_namespace_map table built from Mediawiki</description> + </property> + <property> <name>mw_archive_table</name> <description>Archive table imported from Mediawiki</description> </property> @@ -137,10 +141,22 @@ </property> </parameters> - <start to="repair_archive_table_partitions"/> + <start to="repair_project_namespace_map_table_partitions"/> <!-- Add MW tables partitions --> + <action name="repair_project_namespace_map_table_partitions"> + <sub-workflow> + <app-path>${repair_partitions_workflow_file}</app-path> + <propagate-configuration/> + <configuration> + <property><name>table</name><value>${mw_project_namespace_map_table}</value></property> + </configuration> + </sub-workflow> + <ok to="repair_archive_table_partitions"/> + <error to="send_error_email"/> + </action> + <action name="repair_archive_table_partitions"> <sub-workflow> <app-path>${repair_partitions_workflow_file}</app-path> diff --git a/oozie/mediawiki/history/druid/generate_json_mediawiki_history.hql b/oozie/mediawiki/history/druid/generate_json_mediawiki_history.hql index be07553..0334538 100644 --- a/oozie/mediawiki/history/druid/generate_json_mediawiki_history.hql +++ b/oozie/mediawiki/history/druid/generate_json_mediawiki_history.hql @@ -26,7 +26,8 @@ `wiki_db` string COMMENT 'enwiki, dewiki, eswiktionary, etc.', `event_entity` string COMMENT 'revision, user or page', `event_type` string COMMENT 'create, move, delete, etc. Detailed explanation in the docs under #Event_types', - `event_timestamp` string COMMENT 'When this event ocurred, in YYYYMMDDHHmmss format', + `event_timestamp` string COMMENT 'When this event ocurred', + --`event_timestamp` timestamp COMMENT 'When this event ocurred', `event_comment` string COMMENT 'Comment related to this event, sourced from log_comment, rev_comment, etc.', `event_user_id` bigint COMMENT 'Id of the user that caused the event', `event_user_text` string COMMENT 'Historical text of the user that caused the event', @@ -41,6 +42,9 @@ `event_user_is_anonymous` int COMMENT 'Whether the event_user is not registered', `event_user_is_bot_by_name` int COMMENT 'Whether the event_user\'s name matches patterns we use to identify bots', `event_user_creation_timestamp` string COMMENT 'Registration timestamp of the user that caused the event', + --`event_user_creation_timestamp` timestamp COMMENT 'Registration timestamp of the user that caused the event', + `event_user_revision_count` bigint COMMENT 'Cumulative revision count per user for the current event_user_id (only available in revision-create events so far)', + `event_user_seconds_to_previous_revision` bigint COMMENT 'In revision events: seconds elapsed since the previous revision made by the current event_user_id (only available in revision-create events so far)', `page_id` bigint COMMENT 'In revision/page events: id of the page', `page_title` string COMMENT 'In revision/page events: historical title of the page', @@ -51,6 +55,9 @@ `page_namespace_is_content_latest` int COMMENT 'In revision/page events: current namespace of the page is categorized as content', `page_is_redirect_latest` int COMMENT 'In revision/page events: whether the page is currently a redirect', `page_creation_timestamp` string COMMENT 'In revision/page events: creation timestamp of the page', + --`page_creation_timestamp` timestamp COMMENT 'In revision/page events: creation timestamp of the page', + `page_revision_count` bigint COMMENT 'In revision/page events: Cumulative revision count per page for the current page_id (only available in revision-create events so far)', + `page_seconds_to_previous_revision` bigint COMMENT 'In revision/page events: seconds elapsed since the previous revision made on the current page_id (only available in revision-create events so far)', `user_id` bigint COMMENT 'In user events: id of the user', `user_text` string COMMENT 'In user events: historical user text', @@ -65,6 +72,7 @@ `user_is_anonymous` int COMMENT 'In user events: whether the user is not registered', `user_is_bot_by_name` int COMMENT 'In user events: whether the user\'s name matches patterns we use to identify bots', `user_creation_timestamp` string COMMENT 'In user events: registration timestamp of the user.', + --`user_creation_timestamp` timestamp COMMENT 'In user events: registration timestamp of the user.', `revision_id` bigint COMMENT 'In revision events: id of the revision', `revision_parent_id` bigint COMMENT 'In revision events: id of the parent revision', @@ -76,9 +84,10 @@ `revision_content_format` string COMMENT 'In revision events: content format of revision', `revision_is_deleted` int COMMENT 'In revision events: whether this revision has been deleted (moved to archive table)', `revision_deleted_timestamp` string COMMENT 'In revision events: the timestamp when the revision was deleted', + --`revision_deleted_timestamp` timestamp COMMENT 'In revision events: the timestamp when the revision was deleted', `revision_is_identity_reverted` int COMMENT 'In revision events: whether this revision was reverted by another future revision', `revision_first_identity_reverting_revision_id` bigint COMMENT 'In revision events: id of the revision that reverted this revision', - `revision_seconds_to_identity_revert` bigint COMMENT 'In revision events: seconds elapsed between revision posting and its revert (if there was one)', + `revision_seconds_to_identity_revert` bigint COMMENT 'In revision events: seconds elapsed between revision posting and its revert (if there was one)', `revision_is_identity_revert` int COMMENT 'In revision events: whether this revision reverts other revisions' ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' @@ -106,6 +115,8 @@ CASE WHEN event_user_is_anonymous THEN 1 ELSE 0 END AS event_user_is_anonymous, CASE WHEN event_user_is_bot_by_name THEN 1 ELSE 0 END AS event_user_is_bot_by_name, event_user_creation_timestamp, + event_user_revision_count, + event_user_seconds_to_previous_revision, page_id, page_title, @@ -116,6 +127,8 @@ CASE WHEN page_namespace_is_content_latest THEN 1 ELSE 0 END AS page_namespace_is_content_latest, CASE WHEN page_is_redirect_latest THEN 1 ELSE 0 END AS page_is_redirect_latest, page_creation_timestamp, + page_revision_count, + page_seconds_to_previous_revision, user_id, user_text, @@ -150,7 +163,6 @@ AND snapshot = '${snapshot}' -- Only export rows with valid timestamp format AND event_timestamp IS NOT NULL - AND LENGTH(event_timestamp) = 14 ; diff --git a/oozie/mediawiki/history/druid/load_mediawiki_history.json.template b/oozie/mediawiki/history/druid/load_mediawiki_history.json.template index a6035f4..03ad787 100644 --- a/oozie/mediawiki/history/druid/load_mediawiki_history.json.template +++ b/oozie/mediawiki/history/druid/load_mediawiki_history.json.template @@ -38,6 +38,8 @@ "event_user_is_anonymous", "event_user_is_bot_by_name", "event_user_creation_timestamp", + "event_user_revision_count", + "event_user_seconds_to_previous_revision", "page_id", "page_title", @@ -48,6 +50,8 @@ "page_namespace_is_content_latest", "page_is_redirect_latest", "page_creation_timestamp", + "page_revision_count", + "page_seconds_to_previous_revision", "user_id", "user_text", @@ -80,7 +84,7 @@ ] }, "timestampSpec" : { - "format" : "yyyyMMddHHmmss", + "format" : "auto", "column" : "event_timestamp" } } diff --git a/oozie/mediawiki/history/druid/workflow.xml b/oozie/mediawiki/history/druid/workflow.xml index b78dd23..6722126 100644 --- a/oozie/mediawiki/history/druid/workflow.xml +++ b/oozie/mediawiki/history/druid/workflow.xml @@ -119,7 +119,6 @@ <error to="send_error_email"/> </action> - <action name="index_druid"> <sub-workflow> <app-path>${load_druid_workflow_file}</app-path> diff --git a/oozie/mediawiki/history/metrics/coordinator.properties b/oozie/mediawiki/history/metrics/coordinator.properties index 24df418..afa9897 100644 --- a/oozie/mediawiki/history/metrics/coordinator.properties +++ b/oozie/mediawiki/history/metrics/coordinator.properties @@ -40,7 +40,7 @@ # Run metrics for all wikis, starting 2001 wiki_db = all -start_timestamp = 20010101000000 +start_timestamp = 2001-01-01 00:00:00 # Initial import time of the mediawiki dataset. start_time = 2017-03-01T00:00Z diff --git a/oozie/mediawiki/history/metrics/coordinator.xml b/oozie/mediawiki/history/metrics/coordinator.xml index 183c181..389f4d9 100644 --- a/oozie/mediawiki/history/metrics/coordinator.xml +++ b/oozie/mediawiki/history/metrics/coordinator.xml @@ -75,7 +75,7 @@ </property> <property> <name>end_timestamp</name> - <value>${coord:formatTime(coord:nominalTime(), "yyyy")}${coord:formatTime(coord:nominalTime(), "MM")}01000000</value> + <value>${coord:formatTime(coord:nominalTime(), "yyyy")}-${coord:formatTime(coord:nominalTime(), "MM")}-01 00:00:00</value> </property> <property> <name>metrics_location</name> diff --git a/oozie/mediawiki/history/metrics/daily_edits.hql b/oozie/mediawiki/history/metrics/daily_edits.hql index e3d5fd5..bc73cd0 100644 --- a/oozie/mediawiki/history/metrics/daily_edits.hql +++ b/oozie/mediawiki/history/metrics/daily_edits.hql @@ -11,8 +11,8 @@ -- hive -f daily_edits.hql \ -- -d source_table=wmf.mediawiki_history \ -- -d destination_table=wmf.mediawiki_metrics \ --- -d start_timestamp=20010101000000 \ --- -d end_timestamp=20170101000000 \ +-- -d start_timestamp=2001-01-01 00:00:00 \ +-- -d end_timestamp=2017-01-01 00:00:00 \ -- -d wiki_db=all \ -- -d snapshot=2017-03 @@ -28,11 +28,7 @@ -- dynamic partitions must be specified here insert overwrite table ${destination_table} partition (snapshot='${snapshot}', metric, wiki_db) -- dynamic partitions must be selected in order and at the end - select concat_ws('-', - substring(event_timestamp, 0, 4), - substring(event_timestamp, 5, 2), - substring(event_timestamp, 7, 2) - ) as dt, + select substring(event_timestamp, 0, 10) as dt, count(*) as value, 'daily_edits' as metric, wiki_db @@ -46,9 +42,7 @@ and snapshot = '${snapshot}' group by wiki_db, - substring(event_timestamp, 0, 4), - substring(event_timestamp, 5, 2), - substring(event_timestamp, 7, 2) + substring(event_timestamp, 0, 10) order by wiki_db, dt ; diff --git a/oozie/mediawiki/history/metrics/daily_edits_by_anonymous_users.hql b/oozie/mediawiki/history/metrics/daily_edits_by_anonymous_users.hql index bb68b58..ca1f4f6 100644 --- a/oozie/mediawiki/history/metrics/daily_edits_by_anonymous_users.hql +++ b/oozie/mediawiki/history/metrics/daily_edits_by_anonymous_users.hql @@ -11,8 +11,8 @@ -- hive -f daily_edits_by_anonymous_users.hql \ -- -d source_table=wmf.mediawiki_history \ -- -d destination_table=wmf.mediawiki_metrics \ --- -d start_timestamp=20010101000000 \ --- -d end_timestamp=20170101000000 \ +-- -d start_timestamp=2001-01-01 00:00:00 \ +-- -d end_timestamp=2017-01-01 00:00:00 \ -- -d wiki_db=all \ -- -d snapshot=2017-03 @@ -28,11 +28,7 @@ -- dynamic partitions must be specified here insert overwrite table ${destination_table} partition (snapshot='${snapshot}', metric, wiki_db) -- dynamic partitions must be selected in order and at the end - select concat_ws('-', - substring(event_timestamp, 0, 4), - substring(event_timestamp, 5, 2), - substring(event_timestamp, 7, 2) - ) as dt, + select substring(event_timestamp, 0, 10) as dt, count(*) as value, 'daily_edits_by_anonymous_users' as metric, wiki_db @@ -47,9 +43,7 @@ and snapshot = '${snapshot}' group by wiki_db, - substring(event_timestamp, 0, 4), - substring(event_timestamp, 5, 2), - substring(event_timestamp, 7, 2) + substring(event_timestamp, 0, 10) order by wiki_db, dt ; diff --git a/oozie/mediawiki/history/metrics/daily_edits_by_bot_users.hql b/oozie/mediawiki/history/metrics/daily_edits_by_bot_users.hql index 6ebc1f0..f22680b 100644 --- a/oozie/mediawiki/history/metrics/daily_edits_by_bot_users.hql +++ b/oozie/mediawiki/history/metrics/daily_edits_by_bot_users.hql @@ -11,8 +11,8 @@ -- hive -f daily_edits_by_bot_users.hql \ -- -d source_table=wmf.mediawiki_history \ -- -d destination_table=wmf.mediawiki_metrics \ --- -d start_timestamp=20010101000000 \ --- -d end_timestamp=20170101000000 \ +-- -d start_timestamp=2001-01-01 00:00:00 \ +-- -d end_timestamp=2017-01-01 00:00:00 \ -- -d wiki_db=all \ -- -d snapshot=2017-03 @@ -28,11 +28,7 @@ -- dynamic partitions must be specified here insert overwrite table ${destination_table} partition (snapshot='${snapshot}', metric, wiki_db) -- dynamic partitions must be selected in order and at the end - select concat_ws('-', - substring(event_timestamp, 0, 4), - substring(event_timestamp, 5, 2), - substring(event_timestamp, 7, 2) - ) as dt, + select substring(event_timestamp, 0, 10) as dt, count(*) as value, 'daily_edits_by_bot_users' as metric, wiki_db @@ -47,9 +43,7 @@ and snapshot = '${snapshot}' group by wiki_db, - substring(event_timestamp, 0, 4), - substring(event_timestamp, 5, 2), - substring(event_timestamp, 7, 2) + substring(event_timestamp, 0, 10) order by wiki_db, dt ; diff --git a/oozie/mediawiki/history/metrics/daily_edits_by_registered_users.hql b/oozie/mediawiki/history/metrics/daily_edits_by_registered_users.hql index 649fadd..dd444d5 100644 --- a/oozie/mediawiki/history/metrics/daily_edits_by_registered_users.hql +++ b/oozie/mediawiki/history/metrics/daily_edits_by_registered_users.hql @@ -11,8 +11,8 @@ -- hive -f daily_edits_by_registered_users.hql \ -- -d source_table=wmf.mediawiki_history \ -- -d destination_table=wmf.mediawiki_metrics \ --- -d start_timestamp=20010101000000 \ --- -d end_timestamp=20170101000000 \ +-- -d start_timestamp=2001-01-01 00:00:00 \ +-- -d end_timestamp=2017-01-01 00:00:00 \ -- -d wiki_db=all \ -- -d snapshot=2017-03 @@ -28,11 +28,7 @@ -- dynamic partitions must be specified here insert overwrite table ${destination_table} partition (snapshot='${snapshot}', metric, wiki_db) -- dynamic partitions must be selected in order and at the end - select concat_ws('-', - substring(event_timestamp, 0, 4), - substring(event_timestamp, 5, 2), - substring(event_timestamp, 7, 2) - ) as dt, + select substring(event_timestamp, 0, 10) as dt, count(*) as value, 'daily_edits_by_registered_users' as metric, wiki_db @@ -48,9 +44,7 @@ and snapshot = '${snapshot}' group by wiki_db, - substring(event_timestamp, 0, 4), - substring(event_timestamp, 5, 2), - substring(event_timestamp, 7, 2) + substring(event_timestamp, 0, 10) order by wiki_db, dt ; diff --git a/oozie/mediawiki/history/metrics/daily_unique_anonymous_editors.hql b/oozie/mediawiki/history/metrics/daily_unique_anonymous_editors.hql index cb236da..8efdf4a 100644 --- a/oozie/mediawiki/history/metrics/daily_unique_anonymous_editors.hql +++ b/oozie/mediawiki/history/metrics/daily_unique_anonymous_editors.hql @@ -11,8 +11,8 @@ -- hive -f daily_unique_anonymous_editors.hql \ -- -d source_table=wmf.mediawiki_history \ -- -d destination_table=wmf.mediawiki_metrics \ --- -d start_timestamp=20010101000000 \ --- -d end_timestamp=20170101000000 \ +-- -d start_timestamp=2001-01-01 00:00:00 \ +-- -d end_timestamp=2017-01-01 00:00:00 \ -- -d wiki_db=all \ -- -d snapshot=2017-03 @@ -28,11 +28,7 @@ -- dynamic partitions must be specified here insert overwrite table ${destination_table} partition (snapshot='${snapshot}', metric, wiki_db) -- dynamic partitions must be selected in order and at the end - select concat_ws('-', - substring(event_timestamp, 0, 4), - substring(event_timestamp, 5, 2), - substring(event_timestamp, 7, 2) - ) as dt, + select substring(event_timestamp, 0, 10) as dt, count(distinct event_user_text) as value, 'daily_unique_anonymous_editors' as metric, wiki_db @@ -47,9 +43,7 @@ and snapshot = '${snapshot}' group by wiki_db, - substring(event_timestamp, 0, 4), - substring(event_timestamp, 5, 2), - substring(event_timestamp, 7, 2) + substring(event_timestamp, 0, 10) order by wiki_db, dt ; diff --git a/oozie/mediawiki/history/metrics/daily_unique_bot_editors.hql b/oozie/mediawiki/history/metrics/daily_unique_bot_editors.hql index 3c5b13f..db24917 100644 --- a/oozie/mediawiki/history/metrics/daily_unique_bot_editors.hql +++ b/oozie/mediawiki/history/metrics/daily_unique_bot_editors.hql @@ -11,8 +11,8 @@ -- hive -f daily_unique_bot_editors.hql \ -- -d source_table=wmf.mediawiki_history \ -- -d destination_table=wmf.mediawiki_metrics \ --- -d start_timestamp=20010101000000 \ --- -d end_timestamp=20170101000000 \ +-- -d start_timestamp=2001-01-01 00:00:00 \ +-- -d end_timestamp=2017-01-01 00:00:00 \ -- -d wiki_db=all \ -- -d snapshot=2017-03 @@ -28,11 +28,7 @@ -- dynamic partitions must be specified here insert overwrite table ${destination_table} partition (snapshot='${snapshot}', metric, wiki_db) -- dynamic partitions must be selected in order and at the end - select concat_ws('-', - substring(event_timestamp, 0, 4), - substring(event_timestamp, 5, 2), - substring(event_timestamp, 7, 2) - ) as dt, + select substring(event_timestamp, 0, 10) as dt, count(distinct event_user_id) as value, 'daily_unique_bot_editors' as metric, wiki_db @@ -48,9 +44,7 @@ and snapshot = '${snapshot}' group by wiki_db, - substring(event_timestamp, 0, 4), - substring(event_timestamp, 5, 2), - substring(event_timestamp, 7, 2) + substring(event_timestamp, 0, 10) order by wiki_db, dt ; diff --git a/oozie/mediawiki/history/metrics/daily_unique_editors.hql b/oozie/mediawiki/history/metrics/daily_unique_editors.hql index 4a90023..01e16ad 100644 --- a/oozie/mediawiki/history/metrics/daily_unique_editors.hql +++ b/oozie/mediawiki/history/metrics/daily_unique_editors.hql @@ -11,8 +11,8 @@ -- hive -f daily_unique_editors.hql \ -- -d source_table=wmf.mediawiki_history \ -- -d destination_table=wmf.mediawiki_metrics \ --- -d start_timestamp=20010101000000 \ --- -d end_timestamp=20170101000000 \ +-- -d start_timestamp=2001-01-01 00:00:00 \ +-- -d end_timestamp=2017-01-01 00:00:00 \ -- -d wiki_db=all \ -- -d snapshot=2017-03 @@ -28,11 +28,7 @@ -- dynamic partitions must be specified here insert overwrite table ${destination_table} partition (snapshot='${snapshot}', metric, wiki_db) -- dynamic partitions must be selected in order and at the end - select concat_ws('-', - substring(event_timestamp, 0, 4), - substring(event_timestamp, 5, 2), - substring(event_timestamp, 7, 2) - ) as dt, + select substring(event_timestamp, 0, 10) as dt, count(distinct event_user_text) as value, 'daily_unique_editors' as metric, wiki_db @@ -46,9 +42,7 @@ and snapshot = '${snapshot}' group by wiki_db, - substring(event_timestamp, 0, 4), - substring(event_timestamp, 5, 2), - substring(event_timestamp, 7, 2) + substring(event_timestamp, 0, 10) order by wiki_db, dt ; diff --git a/oozie/mediawiki/history/metrics/daily_unique_page_creators.hql b/oozie/mediawiki/history/metrics/daily_unique_page_creators.hql index 9242bc2..70f5a7a 100644 --- a/oozie/mediawiki/history/metrics/daily_unique_page_creators.hql +++ b/oozie/mediawiki/history/metrics/daily_unique_page_creators.hql @@ -11,8 +11,8 @@ -- hive -f daily_unique_page_creators.hql \ -- -d source_table=wmf.mediawiki_history \ -- -d destination_table=wmf.mediawiki_metrics \ --- -d start_timestamp=20010101000000 \ --- -d end_timestamp=20170101000000 \ +-- -d start_timestamp=2001-01-01 00:00:00 \ +-- -d end_timestamp=2017-01-01 00:00:00 \ -- -d wiki_db=all \ -- -d snapshot=2017-03 @@ -28,17 +28,13 @@ -- dynamic partitions must be specified here insert overwrite table ${destination_table} partition (snapshot='${snapshot}', metric, wiki_db) -- dynamic partitions must be selected in order and at the end - select concat_ws('-', - substring(day, 0, 4), - substring(day, 5, 2), - substring(day, 7, 2) - ) as dt, + select day as dt, count(*) as value, 'daily_unique_page_creators' as metric, wiki_db from (select wiki_db, - substring(event_timestamp, 0, 8) day, + substring(event_timestamp, 0, 10) as day, coalesce(event_user_id, event_user_text) as any_user from ${source_table} where event_entity = 'page' @@ -50,13 +46,10 @@ group by wiki_db, - substring(event_timestamp, 0, 8), + substring(event_timestamp, 0, 10), coalesce(event_user_id, event_user_text) ) page_creators - group by wiki_db, - substring(day, 0, 4), - substring(day, 5, 2), - substring(day, 7, 2) + group by wiki_db, day order by wiki_db, dt ; diff --git a/oozie/mediawiki/history/metrics/daily_unique_registered_editors.hql b/oozie/mediawiki/history/metrics/daily_unique_registered_editors.hql index 3907067..5ee51be 100644 --- a/oozie/mediawiki/history/metrics/daily_unique_registered_editors.hql +++ b/oozie/mediawiki/history/metrics/daily_unique_registered_editors.hql @@ -11,8 +11,8 @@ -- hive -f daily_unique_registered_editors.hql \ -- -d source_table=wmf.mediawiki_history \ -- -d destination_table=wmf.mediawiki_metrics \ --- -d start_timestamp=20010101000000 \ --- -d end_timestamp=20170101000000 \ +-- -d start_timestamp=2001-01-01 00:00:00 \ +-- -d end_timestamp=2017-01-01 00:00:00 \ -- -d wiki_db=all \ -- -d snapshot=2017-03 @@ -28,11 +28,7 @@ -- dynamic partitions must be specified here insert overwrite table ${destination_table} partition (snapshot='${snapshot}', metric, wiki_db) -- dynamic partitions must be selected in order and at the end - select concat_ws('-', - substring(event_timestamp, 0, 4), - substring(event_timestamp, 5, 2), - substring(event_timestamp, 7, 2) - ) as dt, + select substring(event_timestamp, 0, 10) as dt, count(distinct event_user_id) as value, 'daily_unique_registered_editors' as metric, wiki_db @@ -48,9 +44,7 @@ and snapshot = '${snapshot}' group by wiki_db, - substring(event_timestamp, 0, 4), - substring(event_timestamp, 5, 2), - substring(event_timestamp, 7, 2) + substring(event_timestamp, 0, 10) order by wiki_db, dt ; diff --git a/oozie/mediawiki/history/metrics/monthly_new_editors.hql b/oozie/mediawiki/history/metrics/monthly_new_editors.hql index 40491a0..c284ee8 100644 --- a/oozie/mediawiki/history/metrics/monthly_new_editors.hql +++ b/oozie/mediawiki/history/metrics/monthly_new_editors.hql @@ -11,8 +11,8 @@ -- hive -f monthly_new_editors.hql \ -- -d source_table=wmf.mediawiki_history \ -- -d destination_table=wmf.mediawiki_metrics \ --- -d start_timestamp=20010101000000 \ --- -d end_timestamp=20170101000000 \ +-- -d start_timestamp=2001-01-01 00:00:00 \ +-- -d end_timestamp=2017-01-01 00:00:00 \ -- -d wiki_db=all \ -- -d snapshot=2017-03 @@ -28,25 +28,21 @@ -- dynamic partitions must be specified here insert overwrite table ${destination_table} partition (snapshot='${snapshot}', metric, wiki_db) -- dynamic partitions must be selected in order and at the end - select concat_ws('-', - substring(month, 0, 4), - substring(month, 5, 2), - '01' - ) as dt, + select concat(month, '01') as dt, count(*) as value, 'monthly_new_editors' as metric, wiki_db from (select wiki_db, - substring(event_user_creation_timestamp, 0, 6) month, + substring(event_user_creation_timestamp, 0, 8) month, event_user_id from ${source_table} where event_entity = 'revision' and event_type = 'create' and event_user_id is not null and event_user_is_created_by_self - and unix_timestamp(event_timestamp, 'yyyyMMddHHmmss') - - unix_timestamp(event_user_creation_timestamp, 'yyyyMMddHHmmss') + and unix_timestamp(event_timestamp) - + unix_timestamp(event_user_creation_timestamp) <= 86400 and ('${wiki_db}' = 'all' or wiki_db = '${wiki_db}') and event_user_creation_timestamp >= '${start_timestamp}' @@ -54,14 +50,13 @@ and snapshot = '${snapshot}' group by wiki_db, - substring(event_user_creation_timestamp, 0, 6), + substring(event_user_creation_timestamp, 0, 8), event_user_id ) new_editors group by wiki_db, - substring(month, 0, 4), - substring(month, 5, 2) + substring(event_timestamp, 0, 8) order by wiki_db, dt ; diff --git a/oozie/mediawiki/history/metrics/monthly_new_registered_users.hql b/oozie/mediawiki/history/metrics/monthly_new_registered_users.hql index 4e930e3..e0eaa59 100644 --- a/oozie/mediawiki/history/metrics/monthly_new_registered_users.hql +++ b/oozie/mediawiki/history/metrics/monthly_new_registered_users.hql @@ -11,8 +11,8 @@ -- hive -f monthly_new_registered_users.hql \ -- -d source_table=wmf.mediawiki_history \ -- -d destination_table=wmf.mediawiki_metrics \ --- -d start_timestamp=20010101000000 \ --- -d end_timestamp=20170101000000 \ +-- -d start_timestamp=2001-01-01 00:00:00 \ +-- -d end_timestamp=2017-01-01 00:00:00 \ -- -d wiki_db=all \ -- -d snapshot=2017-03 @@ -28,11 +28,7 @@ -- dynamic partitions must be specified here insert overwrite table ${destination_table} partition (snapshot='${snapshot}', metric, wiki_db) -- dynamic partitions must be selected in order and at the end - select concat_ws('-', - substring(event_timestamp, 0, 4), - substring(event_timestamp, 5, 2), - '01' - ) as dt, + select concat(substring(event_user_creation_timestamp, 0, 8), '01') as dt, count(*) as value, 'monthly_new_registered_users' as metric, wiki_db @@ -47,8 +43,7 @@ and snapshot = '${snapshot}' group by wiki_db, - substring(event_timestamp, 0, 4), - substring(event_timestamp, 5, 2) + substring(event_user_creation_timestamp, 0, 8) order by wiki_db, dt ; diff --git a/oozie/mediawiki/history/metrics/monthly_surviving_new_editors.hql b/oozie/mediawiki/history/metrics/monthly_surviving_new_editors.hql index 0195027..e4591cb 100644 --- a/oozie/mediawiki/history/metrics/monthly_surviving_new_editors.hql +++ b/oozie/mediawiki/history/metrics/monthly_surviving_new_editors.hql @@ -11,8 +11,8 @@ -- hive -f monthly_surviving_new_editors.hql \ -- -d source_table=wmf.mediawiki_history \ -- -d destination_table=wmf.mediawiki_metrics \ --- -d start_timestamp=20010101000000 \ --- -d end_timestamp=20170101000000 \ +-- -d start_timestamp=2001-01-01 00:00:00 \ +-- -d end_timestamp=2017-01-01 00:00:00 \ -- -d wiki_db=all \ -- -d snapshot=2017-03 @@ -28,25 +28,21 @@ -- dynamic partitions must be specified here insert overwrite table ${destination_table} partition (snapshot='${snapshot}', metric, wiki_db) -- dynamic partitions must be selected in order and at the end - select concat_ws('-', - substring(surviving_editors.month, 0, 4), - substring(surviving_editors.month, 5, 2), - '01' - ) as dt, + select concat(surviving_editors.month, '01') as dt, count(*) as value, 'monthly_surviving_new_editors' as metric, surviving_editors.wiki_db from (select wiki_db, - substring(event_user_creation_timestamp, 0, 6) month, + substring(event_user_creation_timestamp, 0, 8) as month, event_user_id from ${source_table} where event_entity = 'revision' and event_type = 'create' and event_user_id is not null and event_user_is_created_by_self - and unix_timestamp(event_timestamp, 'yyyyMMddHHmmss') - - unix_timestamp(event_user_creation_timestamp, 'yyyyMMddHHmmss') + and unix_timestamp(event_timestamp) - + unix_timestamp(event_user_creation_timestamp) <= 86400 and ('${wiki_db}' = 'all' or wiki_db = '${wiki_db}') and event_user_creation_timestamp >= '${start_timestamp}' @@ -54,7 +50,7 @@ and snapshot = '${snapshot}' group by wiki_db, - substring(event_user_creation_timestamp, 0, 6), + substring(event_user_creation_timestamp, 0, 8), event_user_id ) new_editors @@ -62,13 +58,13 @@ inner join (select wiki_db, - substring(event_user_creation_timestamp, 0, 6) month, + substring(event_user_creation_timestamp, 0, 8) as month, event_user_id from ${source_table} where event_entity = 'revision' and event_type = 'create' - and unix_timestamp(event_timestamp, 'yyyyMMddHHmmss') - - unix_timestamp(event_user_creation_timestamp, 'yyyyMMddHHmmss') + and unix_timestamp(event_timestamp) - + unix_timestamp(event_user_creation_timestamp) between 2592000 and 5184000 and ('${wiki_db}' = 'all' or wiki_db = '${wiki_db}') and event_user_creation_timestamp >= '${start_timestamp}' @@ -76,7 +72,7 @@ and snapshot = '${snapshot}' group by wiki_db, - substring(event_user_creation_timestamp, 0, 6), + substring(event_user_creation_timestamp, 0, 8), event_user_id ) surviving_editors on surviving_editors.event_user_id = new_editors.event_user_id @@ -84,8 +80,7 @@ and surviving_editors.wiki_db = new_editors.wiki_db group by surviving_editors.wiki_db, - substring(surviving_editors.month, 0, 4), - substring(surviving_editors.month, 5, 2) + substring(event_user_creation_timestamp, 0, 8) order by wiki_db, dt ; diff --git a/oozie/mediawiki/history/metrics/workflow.xml b/oozie/mediawiki/history/metrics/workflow.xml index 181587d..c0a60bd 100644 --- a/oozie/mediawiki/history/metrics/workflow.xml +++ b/oozie/mediawiki/history/metrics/workflow.xml @@ -36,11 +36,11 @@ <property> <name>start_timestamp</name> - <description>Start timestamp (YYYYMMDDHHMMSS format) to restrict metrics computation</description> + <description>Start timestamp (YYYY-MM-DD HH:MM:SS format) to restrict metrics computation</description> </property> <property> <name>end_timestamp</name> - <description>End timestamp (YYYYMMDDHHMMSS format) to restrict metrics computation</description> + <description>End timestamp (YYYY-MM-DD HH:MM:SS format) to restrict metrics computation</description> </property> <property> <name>wiki_db</name> -- To view, visit https://gerrit.wikimedia.org/r/361500 To unsubscribe, visit https://gerrit.wikimedia.org/r/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I9f093e50e53b0669f09866f9409d3eb18c7552c9 Gerrit-PatchSet: 1 Gerrit-Project: analytics/refinery Gerrit-Branch: master Gerrit-Owner: Joal <[email protected]> _______________________________________________ MediaWiki-commits mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits
