[ 
https://issues.apache.org/jira/browse/BEAM-5836?focusedWorklogId=160933&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-160933
 ]

ASF GitHub Bot logged work on BEAM-5836:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 31/Oct/18 00:01
            Start Date: 31/Oct/18 00:01
    Worklog Time Spent: 10m 
      Work Description: swegner closed pull request #6880: [BEAM-5836] Add 
script to sync data for Beam GitHub community metrics.
URL: https://github.com/apache/beam/pull/6880
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/.test-infra/metrics/beamgrafana-deploy.yaml 
b/.test-infra/metrics/beamgrafana-deploy.yaml
index 34941284172..18e7ff9c3fe 100644
--- a/.test-infra/metrics/beamgrafana-deploy.yaml
+++ b/.test-infra/metrics/beamgrafana-deploy.yaml
@@ -118,6 +118,30 @@ spec:
               secretKeyRef:
                 name: beammetrics-psql-db-credentials
                 key: password
+      - name: beammetricssyncgithub
+        image: gcr.io/apache-beam-testing/beammetricssyncgithub:v20181029
+        env:
+          - name: DB_HOST
+            value: 127.0.0.1
+          - name: DB_PORT
+            value: "5432"
+          - name: DB_DBNAME
+            value: beammetrics
+          - name: DB_DBUSERNAME
+            valueFrom:
+              secretKeyRef:
+                name: beammetrics-psql-db-credentials
+                key: username
+          - name: DB_DBPWD
+            valueFrom:
+              secretKeyRef:
+                name: beammetrics-psql-db-credentials
+                key: password
+          - name: GH_ACCESS_TOKEN
+            valueFrom:
+              secretKeyRef:
+                name: beamcrawler-gh-access-token
+                key: token
       volumes:
       - name: beammetrics-psql-credentials
         secret:
diff --git a/.test-infra/metrics/docker-compose.yml 
b/.test-infra/metrics/docker-compose.yml
index 95d70c0cad8..546830a7901 100644
--- a/.test-infra/metrics/docker-compose.yml
+++ b/.test-infra/metrics/docker-compose.yml
@@ -43,6 +43,19 @@ services:
       - GF_AUTH_ANONYMOUS_ENABLED=true
       - GF_AUTH_ANONYMOUS_ORG_NAME=Beam
       - GF_INSTALL_PLUGINS=vonage-status-panel
+  syncgithub:
+    image: syncgithub
+    container_name: beamsyncgithub
+    build:
+      context: ./sync/github
+      dockerfile: Dockerfile
+    environment:
+      - DB_HOST=beampostgresql
+      - DB_PORT=5432
+      - DB_DBNAME=beam_metrics
+      - DB_DBUSERNAME=admin
+      - DB_DBPWD=<PGPasswordHere>
+      - GH_ACCESS_TOKEN=<GitHubAccessToken>
   syncjenkins:
     image: syncjenkins
     container_name: beamsyncjenkins
diff --git a/.test-infra/metrics/sync/github/.pylintrc 
b/.test-infra/metrics/sync/github/.pylintrc
new file mode 100644
index 00000000000..8c44369ebbf
--- /dev/null
+++ b/.test-infra/metrics/sync/github/.pylintrc
@@ -0,0 +1,174 @@
+#
+#    Licensed to the Apache Software Foundation (ASF) under one or more
+#    contributor license agreements.  See the NOTICE file distributed with
+#    this work for additional information regarding copyright ownership.
+#    The ASF licenses this file to You under the Apache License, Version 2.0
+#    (the "License"); you may not use this file except in compliance with
+#    the License.  You may obtain a copy of the License at
+#
+#       http://www.apache.org/licenses/LICENSE-2.0
+#
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS,
+#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#    See the License for the specific language governing permissions and
+#    limitations under the License.
+#
+
+# TODO [BEAM-5923] Find a way to utilize common .pylintrc with python sdk
+
+[MASTER]
+# Ignore auto-generated files.
+ignore=clients
+
+[BASIC]
+# Regular expression which should only match the name
+# of functions or classes which do not require a docstring.
+no-docstring-rgx=(__.*__|main)
+
+# Min length in lines of a function that requires a docstring.
+docstring-min-length=10
+
+# Regular expression which should only match correct module names. The
+# leading underscore is sanctioned for private modules by Google's style
+# guide.
+#
+# There are exceptions to the basic rule (_?[a-z][a-z0-9_]*) to cover
+# requirements of Python's module system and of the presubmit framework.
+module-rgx=^(_?[a-z][a-z0-9_]*)|__init__|PRESUBMIT|PRESUBMIT_unittest$
+
+# Regular expression which should only match correct module level names
+const-rgx=^(_?[A-Z][A-Z0-9_]*|__[a-z0-9_]+__|_?[a-z][a-z0-9_]*)$
+
+# Regular expression which should only match correct class attribute
+class-attribute-rgx=^(_?[A-Z][A-Z0-9_]*|__[a-z0-9_]+__|_?[a-z][a-z0-9_]*)$
+
+# Regular expression which should only match correct class names
+class-rgx=^_?[A-Z][a-zA-Z0-9]*$
+
+# Regular expression which should only match correct function names.
+# 'camel_case' and 'snake_case' group names are used for consistency of naming
+# styles across functions and methods.
+function-rgx=^(?:(?P<camel_case>_?[A-Z][a-zA-Z0-9]*)|(?P<snake_case>_?[a-z][a-z0-9_]*))$
+
+# Regular expression which should only match correct method names.
+# 'camel_case' and 'snake_case' group names are used for consistency of naming
+# styles across functions and methods. 'exempt' indicates a name which is
+# consistent with all naming styles.
+method-rgx=^(?:(?P<exempt>__[a-z0-9_]+__|next)|(?P<camel_case>_{0,2}[A-Z][a-zA-Z0-9]*)|(?P<snake_case>_{0,2}[a-z][a-z0-9_]*))$
+
+# Regular expression which should only match correct instance attribute names
+attr-rgx=^_{0,2}[a-z][a-z0-9_]*$
+
+# Regular expression which should only match correct argument names
+argument-rgx=^[a-z][a-z0-9_]*$
+
+# Regular expression which should only match correct variable names
+variable-rgx=^[a-z][a-z0-9_]*$
+
+# Regular expression which should only match correct list comprehension /
+# generator expression variable names
+inlinevar-rgx=^[a-z][a-z0-9_]*$
+
+# List of builtins function names that should not be used, separated by a comma
+bad-functions=input,apply,reduce
+
+# Good variable names which should always be accepted, separated by a comma
+good-names=main,_
+
+# Bad variable names which should always be refused, separated by a comma
+bad-names=
+
+[MESSAGES CONTROL]
+disable =
+  abstract-method,
+  arguments-differ,
+  attribute-defined-outside-init,
+  bad-builtin,
+  bad-super-call,
+  broad-except,
+  consider-using-enumerate,
+  cyclic-import,
+  design,
+  fixme,
+  function-redefined,
+  global-statement,
+  import-error,
+  import-self,
+  inconsistent-return-statements,
+  invalid-name,
+  invalid-unary-operand-type,
+  keyword-arg-before-vararg,
+  len-as-condition,
+  locally-disabled,
+  locally-enabled,
+  logging-not-lazy,
+  misplaced-bare-raise,
+  missing-docstring,
+  multiple-statements,
+  no-else-return,
+  no-member,
+  no-name-in-module,
+  no-self-use,
+  no-value-for-parameter,
+  not-callable,
+  pointless-statement,
+  protected-access,
+  raising-format-tuple,
+  raising-non-exception,
+  redefined-builtin,
+  redefined-outer-name,
+  redefined-variable-type,
+  redundant-keyword-arg,
+  relative-import,
+  similarities,
+  simplifiable-if-statement,
+  super-init-not-called,
+  super-on-old-class,
+  undefined-variable,
+  unexpected-keyword-arg,
+  unidiomatic-typecheck,
+  unnecessary-lambda,
+  unneeded-not,
+  unused-argument,
+  unused-wildcard-import,
+  wildcard-import,
+  wrong-import-order,
+
+[REPORTS]
+# Tells whether to display a full report or only the messages
+reports=no
+
+[CLASSES]
+# List of method names used to declare (i.e. assign) instance attributes.
+defining-attr-methods=__init__,__new__,setUp
+
+# "class_" is also a valid for the first argument to a class method.
+valid-classmethod-first-arg=cls,class_
+
+[FORMAT]
+# Maximum number of characters on a single line.
+max-line-length=80
+
+# Maximum number of lines in a module
+max-module-lines=99999
+
+# String used as indentation unit. (2 spaces.)
+indent-string='  '
+
+# Number of spaces of indent required.
+indent-after-paren=4
+
+# Regexp for a line that is allowed to be longer than the limit.
+# Long import lines or URLs in comments or pydocs. 
+ignore-long-lines=(?x)
+  (^\s*(import|from)\s
+   |^\s*(\#\ )?<?(https?|ftp):\/\/[^\s\/$.?#].[^\s]*>?$
+   )
+
+[VARIABLES]
+# Tells whether we should check for unused import in __init__ files.
+init-import=no
+
+# A regular expression matching names used for dummy variables (i.e. not used).
+dummy-variables-rgx=^\*{0,2}(_$|unused_|dummy_|args|kwargs)
diff --git a/.test-infra/metrics/sync/github/Dockerfile 
b/.test-infra/metrics/sync/github/Dockerfile
new file mode 100644
index 00000000000..0ee43c54801
--- /dev/null
+++ b/.test-infra/metrics/sync/github/Dockerfile
@@ -0,0 +1,29 @@
+################################################################################
+#  Licensed to the Apache Software Foundation (ASF) under one
+#  or more contributor license agreements.  See the NOTICE file
+#  distributed with this work for additional information
+#  regarding copyright ownership.  The ASF licenses this file
+#  to you under the Apache License, Version 2.0 (the
+#  "License"); you may not use this file except in compliance
+#  with the License.  You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+# limitations under the License.
+################################################################################
+
+FROM python:3
+
+WORKDIR /usr/src/app
+
+COPY . .
+
+RUN pip install pylint yapf nose
+RUN pip install --no-cache-dir -r requirements.txt
+
+
+CMD python ./sync.py
diff --git a/.test-infra/metrics/sync/github/README.md 
b/.test-infra/metrics/sync/github/README.md
new file mode 100644
index 00000000000..ef6a6d3312f
--- /dev/null
+++ b/.test-infra/metrics/sync/github/README.md
@@ -0,0 +1,24 @@
+<!--
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+-->
+# Running script locally
+1. Build container
+2. `docker run -it --rm --name sync -v "$PWD":/usr/src/myapp -w /usr/src/myapp 
-e "DB_PORT=5432" -e "DB_DBNAME=beam_metrics" -e "DB_DBUSERNAME=admin" -e 
"DB_DBPWD=aaa" -e "GH_ACCESSTOKEN=<githubaccesstoken>" syncgithub python 
sync.py`
+
+# Runnin linter
+`docker run -it --rm --name sync -v "$PWD":/usr/src/myapp -w /usr/src/myapp 
syncgithub pylint sync.py`
diff --git a/.test-infra/metrics/sync/github/ghutilities.py 
b/.test-infra/metrics/sync/github/ghutilities.py
new file mode 100644
index 00000000000..865475787ae
--- /dev/null
+++ b/.test-infra/metrics/sync/github/ghutilities.py
@@ -0,0 +1,33 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+
+'''This file contains a list of utilities for working with GitHub data.'''
+
+from datetime import datetime
+import re
+
+GITHUB_DATETIME_FORMAT = "%Y-%m-%dT%H:%M:%SZ"
+
+
+def datetimeFromGHTimeStr(text):
+  '''Parse GitHub time format into datetime structure.'''
+  return datetime.strptime(text, GITHUB_DATETIME_FORMAT)
+
+
+def datetimeToGHTimeStr(timestamp):
+  '''Convert datetime to GitHub datetime string'''
+  return timestamp.strftime(GITHUB_DATETIME_FORMAT)
+
+
+def findMentions(text):
+  '''Returns all mentions in text. Skips "username".'''
+  matches = re.findall("@(\\w+)", text)
+  return list(filter(lambda x: (x != "username" and x != ""), matches))
diff --git a/.test-infra/metrics/sync/github/queries.py 
b/.test-infra/metrics/sync/github/queries.py
new file mode 100644
index 00000000000..f734a3c8c61
--- /dev/null
+++ b/.test-infra/metrics/sync/github/queries.py
@@ -0,0 +1,116 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+
+
+'''
+This query is used to fetch PR data from github via Github API v4 (GraphQL).
+Returns PRs on apache/beam repo that are older than provided timestamp.
+Time format "2017-10-26T20:00:00Z
+'''
+MAIN_PR_QUERY = '''
+query {
+  search(query: "type:pr repo:apache/beam updated:><TemstampSubstitueLocation> 
sort:updated-asc", type: ISSUE, first: 100) {
+    issueCount
+    pageInfo {
+      endCursor
+      startCursor
+      hasNextPage
+      hasPreviousPage
+    }
+    edges {
+      cursor
+      node {
+        ... on PullRequest {
+          number
+          createdAt
+          updatedAt
+          closedAt
+          comments(first: 100) {
+            pageInfo {
+              endCursor
+              startCursor
+              hasNextPage
+              hasPreviousPage
+            }
+            edges {
+              node {
+                author {
+                  login
+                }
+                body
+                createdAt
+              }
+            }
+          }
+          reviewRequests(first: 50) {
+            pageInfo {
+              startCursor
+              endCursor
+              hasNextPage
+              hasPreviousPage
+            }
+            edges {
+              node {
+                requestedReviewer {
+                  ... on User {
+                    login
+                  }
+                }
+              }
+            }
+          }
+          assignees(first: 50) {
+            pageInfo {
+              startCursor
+              endCursor
+              hasNextPage
+              hasPreviousPage
+            }
+            edges {
+              node {
+                login
+              }
+            }
+          }
+          reviews (first:50) {
+            pageInfo {
+              startCursor
+              endCursor
+              hasNextPage
+              hasPreviousPage
+            }
+            edges {
+              node {
+                author {
+                  login
+                }
+                body
+                createdAt
+                state
+              }
+            }
+          }
+          author {
+            login
+          }
+          url
+          body
+          merged
+          mergedAt
+          mergedBy {
+            login
+          }
+        }
+      }
+    }
+  }
+}
+'''
diff --git a/.test-infra/metrics/sync/github/requirements.txt 
b/.test-infra/metrics/sync/github/requirements.txt
new file mode 100644
index 00000000000..14a64686e55
--- /dev/null
+++ b/.test-infra/metrics/sync/github/requirements.txt
@@ -0,0 +1,19 @@
+#    Licensed to the Apache Software Foundation (ASF) under one or more
+#    contributor license agreements.  See the NOTICE file distributed with
+#    this work for additional information regarding copyright ownership.
+#    The ASF licenses this file to You under the Apache License, Version 2.0
+#    (the "License"); you may not use this file except in compliance with
+#    the License.  You may obtain a copy of the License at
+#
+#       http://www.apache.org/licenses/LICENSE-2.0
+#
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS,
+#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#    See the License for the specific language governing permissions and
+#    limitations under the License.
+
+requests
+psycopg2-binary
+ddt
+
diff --git a/.test-infra/metrics/sync/github/sync.py 
b/.test-infra/metrics/sync/github/sync.py
new file mode 100644
index 00000000000..1b67d98f302
--- /dev/null
+++ b/.test-infra/metrics/sync/github/sync.py
@@ -0,0 +1,421 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+'''
+This module queries GitHub to collect Beam-related metrics and put them in
+PostgreSQL.
+'''
+import itertools
+import os
+import socket
+import sys
+import time
+import traceback
+import re
+from datetime import datetime
+
+import requests
+import psycopg2
+
+import queries
+import ghutilities
+
+
+# Keeping this as reference for localhost debug
+# Fetching docker host machine ip for testing purposes.
+# Actual host should be used for production.
+def findDockerNetworkIP():
+  '''Utilizes ip tool to find docker network IP'''
+  import subprocess
+  cmd_out = subprocess.check_output(["ip", "route", "show"]).decode("utf-8")
+  return cmd_out.split(" ")[2]
+
+
+DB_HOST = findDockerNetworkIP()
+
+# DB_HOST = os.environ['DB_HOST']
+DB_PORT = os.environ['DB_PORT']
+DB_NAME = os.environ['DB_DBNAME']
+DB_USER_NAME = os.environ['DB_DBUSERNAME']
+DB_PASSWORD = os.environ['DB_DBPWD']
+
+GH_ACCESS_TOKEN = os.environ['GH_ACCESSTOKEN']
+
+GH_PRS_TABLE_NAME = 'gh_pull_requests'
+
+GH_PRS_CREATE_TABLE_QUERY = f"""
+  create table {GH_PRS_TABLE_NAME} (
+  pr_id integer NOT NULL PRIMARY KEY,
+  author varchar NOT NULL,
+  created_ts timestamp NOT NULL,
+  first_non_author_activity_ts timestamp NULL,
+  first_non_author_activity_author varchar NULL,
+  closed_ts timestamp NULL,
+  updated_ts timestamp NOT NULL,
+  is_merged boolean NOT NULL,
+  requested_reviewers varchar[] NOT NULL,
+  beam_reviewers varchar[] NOT NULL,
+  mentioned varchar[] NOT NULL,
+  reviewed_by varchar[] NOT NULL
+  )
+  """
+
+GH_SYNC_METADATA_TABLE_NAME = 'gh_sync_metadata'
+GH_SYNC_METADATA_TABLE_CREATE_QUERY = f"""
+  create table {GH_SYNC_METADATA_TABLE_NAME} (
+  name varchar NOT NULL PRIMARY KEY,
+  timestamp timestamp NOT NULL
+  )
+  """
+
+
+def initDBConnection():
+  '''Opens connection to postgresql DB, as configured via global variables.'''
+  conn = None
+  while not conn:
+    try:
+      conn = psycopg2.connect(
+          f"dbname='{DB_NAME}' user='{DB_USER_NAME}' host='{DB_HOST}'"
+          f" port='{DB_PORT}' password='{DB_PASSWORD}'")
+    except:
+      print('Failed to connect to DB; retrying in 1 minute')
+      sys.stdout.flush()
+      time.sleep(60)
+  return conn
+
+
+def tableExists(cursor, tableName):
+  '''Checks the existense of table.'''
+  cursor.execute(f"select * from information_schema.tables"
+                 f" where table_name='{tableName}';")
+  return bool(cursor.rowcount)
+
+
+def initDbTablesIfNeeded():
+  '''Creates and initializes DB tables required for script to work.'''
+  connection = initDBConnection()
+  cursor = connection.cursor()
+
+  buildsTableExists = tableExists(cursor, GH_PRS_TABLE_NAME)
+  print('PRs table exists', buildsTableExists)
+  if not buildsTableExists:
+    cursor.execute(GH_PRS_CREATE_TABLE_QUERY)
+    if not bool(cursor.rowcount):
+      raise Exception(f"Failed to create table {GH_PRS_TABLE_NAME}")
+
+  metadataTableExists = tableExists(cursor, GH_SYNC_METADATA_TABLE_NAME)
+  print('Metadata table exists', metadataTableExists)
+  if not buildsTableExists:
+    cursor.execute(GH_SYNC_METADATA_TABLE_CREATE_QUERY)
+    if not bool(cursor.rowcount):
+      raise Exception(f"Failed to create table {GH_SYNC_METADATA_TABLE_NAME}")
+
+  cursor.close()
+  connection.commit()
+
+  connection.close()
+
+
+def fetchLastSyncTimestamp(cursor):
+  '''Fetches last sync timestamp from metadata DB table.'''
+  fetchQuery = f'''
+  SELECT timestamp
+  FROM {GH_SYNC_METADATA_TABLE_NAME}
+  WHERE name LIKE 'gh_sync'
+  '''
+
+  cursor.execute(fetchQuery)
+  queryResult = cursor.fetchone()
+
+  defaultResult = datetime(year=1980, month=1, day=1)
+  return defaultResult if queryResult is None else queryResult[0]
+
+
+def updateLastSyncTimestamp(timestamp):
+  '''Updates last sync timestamp in metadata DB table.'''
+  connection = initDBConnection()
+  cursor = connection.cursor()
+
+  insertTimestampSqlQuery = f'''INSERT INTO {GH_SYNC_METADATA_TABLE_NAME}
+                                  (name, timestamp)
+                                VALUES ('gh_sync', %s) 
+                                ON CONFLICT (name) DO UPDATE
+                                  SET timestamp = excluded.timestamp
+                                '''
+  cursor.execute(insertTimestampSqlQuery, [timestamp])
+
+  cursor.close()
+  connection.commit()
+  connection.close()
+
+
+def executeGHGraphqlQuery(query):
+  '''Runs graphql query on GitHub.'''
+  url = 'https://api.github.com/graphql'
+  headers = {'Authorization': f'Bearer {GH_ACCESS_TOKEN}'}
+  r = requests.post(url=url, json={'query': query}, headers=headers)
+  return r.json()
+
+
+def fetchGHData(timestamp):
+  '''Fetches GitHub data required for reporting Beam metrics'''
+  tsString = ghutilities.datetimeToGHTimeStr(timestamp)
+  query = queries.MAIN_PR_QUERY.replace('<TemstampSubstitueLocation>', 
tsString)
+  return executeGHGraphqlQuery(query)
+
+
+def extractRequestedReviewers(pr):
+  reviewEdges = pr["reviewRequests"]["edges"]
+  return list(
+      map(lambda x: x["node"]["requestedReviewer"]["login"], reviewEdges))
+
+
+def extractMentions(pr):
+  body = pr["body"]
+  commentEdges = pr["comments"]["edges"]
+  reviewEdges = pr["reviews"]["edges"]
+
+  bodyMentions = ghutilities.findMentions(body)
+  commentMentionsLists = map(
+      lambda x: ghutilities.findMentions(x["node"]["body"]), commentEdges)
+  reviewMentionsLists = map(
+      lambda x: ghutilities.findMentions(x["node"]["body"]), reviewEdges)
+  commentMentions = [
+      item for sublist in commentMentionsLists for item in sublist
+  ]
+  reviewMentions = [item for sublist in reviewMentionsLists for item in 
sublist]
+
+  mentionsSet = set(bodyMentions) | set(commentMentions) | set(reviewMentions)
+  return list(mentionsSet)
+
+
+def extractFirstNAActivity(pr):
+  '''
+  Returns timestamp and login of author on first activity on pull request done
+  by non-author.
+  '''
+  author = pr["author"]["login"]
+  commentEdges = None
+  commentEdges = [
+      edge for edge in pr["comments"]["edges"]
+      if edge["node"]["author"]["login"] != author
+  ]
+  reviewEdges = [
+      edge for edge in pr["reviews"]["edges"]
+      if edge["node"]["author"]["login"] != author
+  ]
+  merged = pr["merged"]
+  mergedAt = pr["mergedAt"]
+  mergedBy = None if not merged else pr["mergedBy"]["login"]
+  commentTimestamps = list(
+      map(lambda x: (x["node"]["createdAt"], x["node"]["author"]["login"]),
+          commentEdges))
+  reviewTimestamps = list(
+      map(lambda x: (x["node"]["createdAt"], x["node"]["author"]["login"]),
+          reviewEdges))
+  allTimestamps = commentTimestamps + reviewTimestamps
+  if merged:
+    allTimestamps.append((mergedAt, mergedBy))
+  return (None, None) if not allTimestamps else min(
+      allTimestamps, key=lambda t: t[0])
+
+
+def extractBeamReviewers(pr):
+  '''Extract logins of users defined by Beam as reviewers.'''
+  author = pr['author']['login']
+
+  # All the direct GitHub indicators of reviewers
+  reviewers = []
+  for r in pr['assignees']['edges']:
+    reviewers.append(r['node']['login'])
+  for r in pr['reviewRequests']['edges']:
+    reviewers.append(r['node']['requestedReviewer']['login'])
+
+  # GitHub users that have performed reviews.
+  for r in pr['reviews']['edges']:
+    reviewers.append(r['node']['author']['login'])
+
+  # @r1, @r2 ... look/PTAL/ptal?
+  beam_reviewer_regex = r'(@\w+).*?(?:PTAL|ptal|look)'
+  # R= @r1 @r2 @R3
+  contrib_reviewer_regex = r'(?:^|\W)[Rr]\s*[=:.]((?:[\s,;.]*-?@\w+)+)'
+  username_regex = r'(-?)(@\w+)'
+  for m in [pr['body']] + [c['node']['body'] for c in pr['comments']['edges']]:
+    if m is None:
+      continue
+    for match in itertools.chain(
+        re.finditer(contrib_reviewer_regex, m), 
re.finditer(beam_reviewer_regex, m)):
+      for user in re.finditer(username_regex, match.groups()[0]):
+        # First group decides if it is additive or subtractive
+        remove = user.groups()[0] == '-'
+        # [1:] to drop the @
+        r = user.groups()[1][1:]
+        if remove and r in reviewers:
+          reviewers.remove(r)
+        elif r not in reviewers:
+          reviewers.append(r)
+  return [r for r in set(reviewers) if r != author]
+
+
+def extractReviewers(pr):
+  '''Extracts reviewers logins from PR.'''
+  return [edge["node"]["author"]["login"] for edge in pr["reviews"]["edges"]]
+
+
+def extractRowValuesFromPr(pr):
+  '''
+  Extracts row values required to fill Beam metrics table from PullRequest
+  GraphQL response.
+  '''
+  requestedReviewers = extractRequestedReviewers(pr)
+  mentions = extractMentions(pr)
+  firstNAActivity, firstNAAAuthor = extractFirstNAActivity(pr)
+  beamReviewers = extractBeamReviewers(pr)
+  reviewedBy = extractReviewers(pr)
+
+  result = [
+      pr["number"], pr["author"]["login"], pr["createdAt"], pr["updatedAt"],
+      pr["closedAt"], pr["merged"], firstNAActivity, firstNAAAuthor,
+      requestedReviewers, mentions, beamReviewers, reviewedBy
+  ]
+
+  return result
+
+
+def upsertIntoPRsTable(cursor, values):
+  upsertPRRowQuery = f'''INSERT INTO {GH_PRS_TABLE_NAME}
+                            (pr_id,
+                            author,
+                            created_ts,
+                            updated_ts,
+                            closed_ts,
+                            is_merged,
+                            first_non_author_activity_ts,
+                            first_non_author_activity_author,
+                            requested_reviewers,
+                            mentioned,
+                            beam_reviewers,
+                            reviewed_by)
+                          VALUES
+                            (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
+                          ON CONFLICT (pr_id) DO UPDATE
+                            SET
+                            pr_id=excluded.pr_id,
+                            author=excluded.author,
+                            created_ts=excluded.created_ts,
+                            updated_ts=excluded.updated_ts,
+                            closed_ts=excluded.closed_ts,
+                            is_merged=excluded.is_merged,
+                            
first_non_author_activity_ts=excluded.first_non_author_activity_ts,
+                            requested_reviewers=excluded.requested_reviewers,
+                            mentioned=excluded.mentioned,
+                            beam_reviewers=excluded.beam_reviewers,
+                            reviewed_by=excluded.reviewed_by
+                          '''
+  cursor.execute(upsertPRRowQuery, values)
+
+
+def fetchNewData():
+  '''
+  Main workhorse method. Fetches data from GitHub and puts it in metrics table.
+  '''
+  connection = initDBConnection()
+  cursor = connection.cursor()
+  lastSyncTimestamp = fetchLastSyncTimestamp(cursor)
+  cursor.close()
+  connection.close()
+
+  currTS = lastSyncTimestamp
+
+  resultsPresent = True
+  while resultsPresent:
+    print("Syncing data for: ", currTS)
+    jsonData = fetchGHData(currTS)
+
+    connection = initDBConnection()
+    cursor = connection.cursor()
+
+    if "errors" in jsonData:
+      print("Failed to fetch data, error:", jsonData)
+      return
+
+    prs = None
+    try:
+      prs = jsonData["data"]["search"]["edges"]
+    except:
+      # TODO This means that API returned error.
+      # We might want to bring this to stderr or utilize other means of 
logging.
+      # Examples: we hit throttling, etc
+      print("Got bad json format: ", jsonData)
+      return
+
+    if not prs:
+      resultsPresent = False
+
+    for edge in prs:
+      pr = edge["node"]
+      try:
+        rowValues = extractRowValuesFromPr(pr)
+      except Exception as e:
+        print("Failed to extract data. Exception: ", e, " PR: ", edge)
+        traceback.print_tb(e.__traceback__)
+        return
+
+      upsertIntoPRsTable(cursor, rowValues)
+
+      prUpdateTime = ghutilities.datetimeFromGHTimeStr(pr["updatedAt"])
+
+      currTS = currTS if currTS > prUpdateTime else prUpdateTime
+
+    cursor.close()
+    connection.commit()
+    connection.close()
+
+    updateLastSyncTimestamp(currTS)
+
+
+def probeGitHubIsUp():
+  '''
+  Returns True if GitHub responds to simple queries. Else returns False.
+  '''
+  sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+  result = sock.connect_ex(('github.com', 443))
+  return True if result == 0 else False
+
+
+################################################################################
+if __name__ == '__main__':
+  '''
+  This script is supposed to be invoked directly.
+  However for testing purposes and to allow importing,
+  wrap work code in module check.
+  '''
+  print("Started.")
+
+  print("Checking if DB needs to be initialized.")
+  sys.stdout.flush()
+  initDbTablesIfNeeded()
+
+  while True:
+    print("Start PR fetching.")
+    sys.stdout.flush()
+
+    if not probeGitHubIsUp():
+      print("GitHub is unavailable, skipping fetching data.")
+      continue
+    else:
+      print("GitHub is available start fetching data.")
+      fetchNewData()
+      print("Fetched data.")
+    print("Sleeping for 5 minutes.")
+    sys.stdout.flush()
+    time.sleep(5 * 60)
+
+  print('Done.')
diff --git a/.test-infra/metrics/sync/github/sync_test.py 
b/.test-infra/metrics/sync/github/sync_test.py
new file mode 100644
index 00000000000..5f4fa6cbcfe
--- /dev/null
+++ b/.test-infra/metrics/sync/github/sync_test.py
@@ -0,0 +1,33 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+'''Tests for syncing metrics related data from GitHub.'''
+
+import unittest
+from ddt import ddt, data
+import ghutilities
+
+
+@ddt
+class GhutilitiesTestCase(unittest.TestCase):
+
+  @data(("sample text with mention @mention", ["mention"]),
+        ("Data without mention", []),
+        ("sample text with several mentions @first, @second @third", ["first", 
"second", "third"]))
+  def test_findMentions_finds_mentions_by_pattern(self, params):
+    input, expectedResult = params
+    result = ghutilities.findMentions(input)
+    self.assertEqual(expectedResult, result)
+
+  def test_findCommentReviewers(self):
+    result = "some tesxt \n body"
+
+if __name__ == '__main__':
+  unittest.main()
diff --git a/.test-infra/metrics/sync/jenkins/syncjenkins.py 
b/.test-infra/metrics/sync/jenkins/syncjenkins.py
index 0f198cc132d..c3e82dd1efa 100644
--- a/.test-infra/metrics/sync/jenkins/syncjenkins.py
+++ b/.test-infra/metrics/sync/jenkins/syncjenkins.py
@@ -70,8 +70,14 @@ def fetchJobs():
   return result
 
 def initConnection():
-  conn = psycopg2.connect(f"dbname='{dbname}' user='{dbusername}' 
host='{host}'"
-                          f" port='{port}' password='{dbpassword}'")
+  conn = None
+  while not conn:
+    try:
+      conn = psycopg2.connect(f"dbname='{dbname}' user='{dbusername}' 
host='{host}'"
+                              f" port='{port}' password='{dbpassword}'")
+    except:
+      print('Failed to connect to DB; retrying in 1 minute')
+      time.sleep(60)
   return conn
 
 def tableExists(cursor, tableName):
@@ -96,7 +102,7 @@ def initDbTablesIfNeeded():
 
   connection.close()
 
-
+# TODO rename to fetchLastSyncJobIds
 def fetchLastSyncTimestamp(cursor):
   fetchQuery = f'''
   select job_name, max(build_id)
diff --git a/.test-infra/metrics/sync/jira/syncjira.py 
b/.test-infra/metrics/sync/jira/syncjira.py
index 4ce3cac75c1..95c3f935ae9 100644
--- a/.test-infra/metrics/sync/jira/syncjira.py
+++ b/.test-infra/metrics/sync/jira/syncjira.py
@@ -73,8 +73,14 @@ def fetchIssues(startTime, startAt = 0):
 
 
 def initDBConnection():
-  conn = psycopg2.connect(f"dbname='{dbname}' user='{dbusername}' 
host='{host}'"
-                          f" port='{port}' password='{dbpassword}'")
+  conn = None
+  while not conn:
+    try:
+      conn = psycopg2.connect(f"dbname='{dbname}' user='{dbusername}' 
host='{host}'"
+                              f" port='{port}' password='{dbpassword}'")
+    except:
+      print('Failed to connect to DB; retrying in 1 minute')
+      time.sleep(60)
   return conn
 
 
@@ -157,7 +163,6 @@ def buildRowValuesArray(issue):
 
 
 def insertRow(cursor, rowValues):
-  print(len(rowValues))
   insertClause = (f'''insert into {jiraIssuesTableName}
                    values (%s, %s, %s, %s, %s, %s, %s, %s, %s)
                    ON CONFLICT (id) DO UPDATE


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 160933)
    Time Spent: 3h 10m  (was: 3h)

> Build Code Velocity metrics pipeline
> ------------------------------------
>
>                 Key: BEAM-5836
>                 URL: https://issues.apache.org/jira/browse/BEAM-5836
>             Project: Beam
>          Issue Type: Sub-task
>          Components: project-management
>            Reporter: Scott Wegner
>            Assignee: Mikhail Gryzykhin
>            Priority: Major
>          Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Based on discussion fromĀ [Improving Beam Code 
> Review|https://s.apache.org/beam-code-review], we want to build two new 
> metric dashboards:
>  * Open Pull Requests (over time, aggregate and per-reviewer), to track 
> reviewer load
>  * Pull Request Time-to-first-comment (over time), to track review latency



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to