This is an automated email from the ASF dual-hosted git repository.
gerlowskija pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/solr-sandbox.git
The following commit(s) were added to refs/heads/main by this push:
new 8171c95 Add scripts to fetch+prepare solr-community data (#111)
8171c95 is described below
commit 8171c954d7fe479eb841c13dabbc72e5605eda13
Author: Jason Gerlowski <[email protected]>
AuthorDate: Tue Sep 24 07:41:40 2024 -0700
Add scripts to fetch+prepare solr-community data (#111)
One common barrier to running performance tests is the lack of a dataset
that's ready to be indexed into Solr. Wikipedia dumps and other public
data is common, but there's work needed to massage them into a format
that's ready for Solr to index.
This commit partially addresses this problem by introducing scripts that
download a particular dataset and prepare it for indexing into Solr.
Hopefully this eliminates at least one of the hurdles in getting a quick
perf test running. Selfishly, it's also useful for gathering stats often
needed by ASF Quarterly Board Reports.
The dataset is comprised of the public interactions in our community.
Right now this covers interactions on our mailing lists, and data pulled
from Solr's git-log. Hopefully in the future it can be supplemented
with data pulled from JIRA as well. See
`scripts/community-dataset/README.md` for more details on the dataset
and how to use the provided scripts.
---
scripts/community-dataset/.gitignore | 1 +
scripts/community-dataset/README.md | 85 ++++++++++++++++
.../convert-git-repositories-to-solr-docs.sh | 33 +++++++
.../convert-mailing-lists-to-solr-docs.sh | 36 +++++++
.../community-dataset/convert-mbox-to-solr-docs.py | 110 +++++++++++++++++++++
.../community-dataset/download-git-repositories.sh | 30 ++++++
.../community-dataset/download-mailing-lists.sh | 55 +++++++++++
scripts/community-dataset/export-git-data.sh | 48 +++++++++
8 files changed, 398 insertions(+)
diff --git a/scripts/community-dataset/.gitignore
b/scripts/community-dataset/.gitignore
new file mode 100644
index 0000000..ea1472e
--- /dev/null
+++ b/scripts/community-dataset/.gitignore
@@ -0,0 +1 @@
+output/
diff --git a/scripts/community-dataset/README.md
b/scripts/community-dataset/README.md
new file mode 100644
index 0000000..d800e72
--- /dev/null
+++ b/scripts/community-dataset/README.md
@@ -0,0 +1,85 @@
+# Solr-Community-Datasets
+
+Utility scripts for fetching datasets related to the Solr community, and
preparing them for ingestion into Solr.
+All created documents rely on dynamic field suffixes, and should work with
Solr's `_default` configset.
+
+## Mailing List Data
+
+Run the following to download and prepare mailing list data for Solr ingestion:
+
+```
+./download-mailing-lists.sh
+./convert-mailing-lists-to-solr-docs.sh
+```
+
+This invocation will create a series of JSON files at `output/solr-data`
directory, ready to be indexed with `bin/solr post`.
+Currently, the created documents reflect email metadata only.
+Email content itself isn't captured for search, though nothing precludes that
if users wish to make the requisite changes to `convert-mbox-to-solr-docs.py`.
+
+### Example Mailing List Queries
+
+Assuming mailing list traffic ingested into a collection `maildata`, it
supports the following example queries:
+
+**Human List Traffic by Month**
+
+```
+export COLLECTION="maildata"
+curl -sk
"http://localhost:8983/solr/$COLLECTION/select?facet.field=date_bucket_month_s&\
+facet.sort=index&\
+facet=true&\
+indent=true&\
+q=list_s:dev+OR+list_s:users&\
+rows=0"
+```
+
+**Human List Traffic by Fiscal Quarter** (useful for board-reports)
+
+```
+export COLLECTION="maildata"
+curl -sk
"http://localhost:8983/solr/$COLLECTION/select?facet.field=date_bucket_quarter_s&\
+facet.sort=index&\
+facet=true&\
+indent=true&\
+q=list_s:dev+OR+list_s:users&\
+rows=0"
+```
+
+## Git Commit Data
+
+Run the snippet below to download and prepare git-commit data for Solr
ingestion.
+Preparing git data can take a good bit longer than other sources described
here, so consider a coffee while it runs.
+
+```
+./download-git-repositories.sh
+./convert-git-repositories-to-solr-docs.sh
+```
+
+This invocation will create a series of JSON files at `output/solr-data`
directory, ready to be indexed with `bin/solr post`.
+
+### Example Git Data Queries
+
+Assuming git-data ingested into a collection `gitdata`, it supports the
following example queries:
+
+**Compare Commit Volume b/w Two Fiscal Quarters**
+
+Fiscal "quarters" aren't currently computed at index time, as with the mailing
list data above, but users can still achieve a similar affect by specifying
quarters of interest.
+The query below compares Q1 FY2025 (May-July 2024) with Q1 FY2024 (May-June
2023):
+
+```
+export COLLECTION="git_data"
+curl -sk "http://localhost:8983/solr/$COLLECTION/select" -d '
+{
+ "query": "*:*",
+ "limit": 0,
+ "facet": {
+ "q1_fy2025": {
+ "type": "query",
+ "q": "date_dt:[2024-05-01T00:00:00Z TO 2024-08-01T00:00:00Z]"
+ },
+ "q1_fy2024": {
+ "type": "query",
+ "q": "date_dt:[2023-05-01T00:00:00Z TO 2023-08-01T00:00:00Z]"
+ }
+ }
+}'
+```
diff --git a/scripts/community-dataset/convert-git-repositories-to-solr-docs.sh
b/scripts/community-dataset/convert-git-repositories-to-solr-docs.sh
new file mode 100755
index 0000000..89baa3b
--- /dev/null
+++ b/scripts/community-dataset/convert-git-repositories-to-solr-docs.sh
@@ -0,0 +1,33 @@
+#!/bin/bash
+
+set -eu
+
+# Usage: ./convert-git-repositories-to-solr-docs.sh [<git-repo-directory>]
[<solr-doc-output-dir>]
+
+# Determine git dir (must already exist and contain git checkouts)
+DEFAULT_GIT_LOCATION="output/git-data"
+GIT_DIRECTORY="${1:-}"
+if [[ -z "$GIT_DIRECTORY" ]]; then
+ GIT_DIRECTORY="$DEFAULT_GIT_LOCATION"
+fi
+
+# Determine output dir (may not exist)
+DEFAULT_DOC_OUTPUT_DIR="output/solr-data"
+DOC_OUTPUT_DIR="${2:-}"
+if [[ -z ${DOC_OUTPUT_DIR} ]]; then
+ DOC_OUTPUT_DIR=$DEFAULT_DOC_OUTPUT_DIR
+fi
+
+# Ensure doc output dir exists
+if [[ -d $DOC_OUTPUT_DIR ]]; then
+ echo "Output directory [$DOC_OUTPUT_DIR] already exists; clearing it out and
continuing..."
+ rm -rf $DOC_OUTPUT_DIR
+fi
+mkdir -p $DOC_OUTPUT_DIR
+
+# This repo list should always remain in sync with the value in
'download-git-repositories.sh'
+GIT_REPOS=("solr" "solr-site" "solr-sandbox" "solr-operator")
+
+for repo in ${GIT_REPOS[@]}; do
+ ./export-git-data.sh ${GIT_DIRECTORY}/$repo $DOC_OUTPUT_DIR
+done
diff --git a/scripts/community-dataset/convert-mailing-lists-to-solr-docs.sh
b/scripts/community-dataset/convert-mailing-lists-to-solr-docs.sh
new file mode 100755
index 0000000..3e16f4e
--- /dev/null
+++ b/scripts/community-dataset/convert-mailing-lists-to-solr-docs.sh
@@ -0,0 +1,36 @@
+#!/bin/bash
+
+set -eu
+
+# Usage: ./convert-mailing-lists-to-solr-docs.sh [<mbox-data-directory>]
[<solr-doc-output-dir>]
+# <mbox-data-directory> and <solr-doc-output-dir> are both optional, but
+# <mbox-data-directory> must be specified if <solr-doc-output-dir> is.
+
+# Determine mbox dir (must already exist and contain mbox data)
+DEFAULT_MBOX_LOCATION="output/mbox-data"
+MBOX_DIRECTORY="${1:-}"
+if [[ -z "$MBOX_DIRECTORY" ]]; then
+ MBOX_DIRECTORY="$DEFAULT_MBOX_LOCATION"
+fi
+
+# Determine doc output dir (may not exist)
+DEFAULT_DOC_OUTPUT_DIR="output/solr-data"
+DOC_OUTPUT_DIR="${2:-}"
+if [[ -z ${DOC_OUTPUT_DIR} ]]; then
+ DOC_OUTPUT_DIR=$DEFAULT_DOC_OUTPUT_DIR
+fi
+
+# Ensure doc output dir exists
+if [[ -d $DOC_OUTPUT_DIR ]]; then
+ echo "Output directory [$DOC_OUTPUT_DIR] already exists; clearing it out and
continuing..."
+ rm -rf $DOC_OUTPUT_DIR
+fi
+mkdir -p $DOC_OUTPUT_DIR
+
+# Iterate over mbox files and convert
+for filepath in $(find $MBOX_DIRECTORY -name "*.mbox")
+do
+ python3 convert-mbox-to-solr-docs.py $filepath $DOC_OUTPUT_DIR
+done
+
+echo "Solr documents now available in $DOC_OUTPUT_DIR; use 'bin/post' (or
'bin/solr post' depending on your Solr version) to upload!"
diff --git a/scripts/community-dataset/convert-mbox-to-solr-docs.py
b/scripts/community-dataset/convert-mbox-to-solr-docs.py
new file mode 100755
index 0000000..2448c0d
--- /dev/null
+++ b/scripts/community-dataset/convert-mbox-to-solr-docs.py
@@ -0,0 +1,110 @@
+#!/usr/local/bin/python3
+
+import mailbox
+import sys
+import os
+import uuid
+import json
+from datetime import datetime
+
+# Potential Improvements:
+# - more cleaning for 'subject' and other free text fields
+# - some regex parsing to separate sender name/email in 'from' fields
+# - capture other fields
+def convert_message_to_solr_doc(message, source_list):
+ solr_doc = dict()
+ solr_doc["id"] = str(uuid.uuid4())
+ solr_doc["from_s"] = message.get("From")
+ solr_doc["list_s"] = source_list
+
+ # List-Id, for whatever reason, is always 'dev.solr.apache.org', so
omitting this for now
+ #if "List-Id" in message:
+ # solr_doc["mailing_list_s"] = message["List-Id"]
+
+ # 'To' might contain multiple addresses, separated by commas
+ sender_unsplit = message.get("To")
+ senders = [line.strip() for line in sender_unsplit.split(",")]
+ solr_doc["to_s"] = sender_unsplit
+ solr_doc["to_ss"] = senders
+
+ # Solr requires dates in a particular format
+ date_str_raw = message.get("Date").replace("(MST)", "").replace("(UTC)",
"").replace("(CST)", "").replace("(EST)", "").strip()
+ try:
+ date_obj = datetime.strptime(date_str_raw, "%a, %d %b %Y %H:%M:%S %z")
+ except ValueError:
+ date_obj = datetime.strptime(date_str_raw, "%d %b %Y %H:%M:%S %z")
+ solr_doc["sent_dt"] = date_obj.strftime("%Y-%m-%dT%H:%M:%SZ")
+ solr_doc["date_bucket_month_s"] = to_monthly_bucket(date_obj)
+ solr_doc["date_bucket_quarter_s"] = to_quarterly_bucket(date_obj)
+
+ subject_raw = message.get("Subject")
+ subject_cleaned = subject_raw.lower()
+ if subject_cleaned.startswith("re: "):
+ subject_cleaned = subject_cleaned.replace("re: ", "", 1)
+ solr_doc["subject_raw_s"] = subject_raw
+ solr_doc["subject_raw_txt"] = subject_raw
+ solr_doc["subject_clean_s"] = subject_cleaned
+ solr_doc["subject_clean_txt"] = subject_cleaned
+
+ return solr_doc
+
+def to_monthly_bucket(date_obj):
+ padded_month = str(date_obj.month)
+ if len(padded_month) == 1:
+ padded_month = "0" + padded_month
+ return str(date_obj.year) + "-" + padded_month
+
+# Returns a string representing the ASF fiscal quarter this email was sent in.
(Useful for compiling quarterly reports!)
+# ASF Fiscal quarters are a bit odd. I don't understand them. But the logic
appears to be, taking FY2020 as an example:
+# - Q1 of FY2020 is May, June, and July of 2019
+# - Q2 of FY2020 is August, September, October of 2019
+# - Q3 of FY2020 is November and December of 2019, and January of 2020
+# - Q4 of FY2020 is February, March, and April of 2020
+# Why would "Q1" start in May? Why would the FY and the calendar year be
offset in this manner? :shrug:
+def to_quarterly_bucket(date_obj):
+ month = date_obj.month
+ year = date_obj.year
+
+ if month >= 2 and month <= 4:
+ quarter = "Q4"
+ fiscal_year = year
+ elif month >= 5 and month <= 7:
+ quarter = "Q1"
+ fiscal_year = year + 1
+ elif month >= 8 and month <= 10:
+ quarter = "Q2"
+ fiscal_year = year + 1
+ else: # month = 11, 12, 1
+ quarter = "Q3"
+ if month == 1:
+ fiscal_year = year
+ else:
+ fiscal_year = year + 1
+ return str(year) + "-" + quarter
+
+if __name__ == "__main__":
+ if len(sys.argv) != 3:
+ print("Incorrect arguments provided")
+ print(" Usage: convert-mbox-to-solr-docs.py <mbox-file>
<output-directory>")
+ sys.exit(1)
+
+ mbox_filepath = sys.argv[1]
+ output_directory = sys.argv[2]
+
+ # Filename is assumed to be in the form sourceList-YYYY-MM.mbox (e.g.
builds-2023-5.mbox)
+ mbox_filename = os.path.basename(mbox_filepath)
+ source_list = mbox_filename.split("-")[0]
+ solr_doc_filename = mbox_filename.replace(".mbox", ".json")
+ solr_doc_filepath = os.path.join(output_directory, solr_doc_filename)
+
+ with open(solr_doc_filepath, 'w') as solr_doc_writer:
+ solr_doc_writer.write("[")
+ first_doc = True
+ for message in mailbox.mbox(mbox_filepath):
+ if not first_doc:
+ solr_doc_writer.write(",")
+ first_doc = False
+ solr_doc_writer.write("\n")
+ solr_doc = convert_message_to_solr_doc(message, source_list)
+ json.dump(solr_doc, solr_doc_writer)
+ solr_doc_writer.write("]")
diff --git a/scripts/community-dataset/download-git-repositories.sh
b/scripts/community-dataset/download-git-repositories.sh
new file mode 100755
index 0000000..598e074
--- /dev/null
+++ b/scripts/community-dataset/download-git-repositories.sh
@@ -0,0 +1,30 @@
+#!/bin/bash
+
+set -eu
+
+# Usage: ./download-git-repositories.sh [<git-output-dir>]
+
+# Determine output dir
+DEFAULT_GIT_LOCATION="output/git-data"
+GIT_OUTPUT_DIR="${1:-}"
+if [[ -z "$GIT_OUTPUT_DIR" ]]; then
+ GIT_OUTPUT_DIR="$DEFAULT_GIT_LOCATION"
+fi
+
+if [[ -d $GIT_OUTPUT_DIR ]]; then
+ echo "Output directory [$GIT_OUTPUT_DIR] already exists; clearing it out and
continuing..."
+ rm -rf $GIT_OUTPUT_DIR
+fi
+mkdir -p $GIT_OUTPUT_DIR
+
+
+REPO_URL="[email protected]:apache/charts.git"
+# This repo list should always remain in sync with the value in
'convert-git-repositories-to-solr-docs.sh'
+GIT_REPOS=("solr" "solr-site" "solr-sandbox" "solr-operator")
+
+pushd $GIT_OUTPUT_DIR
+ for repo in ${GIT_REPOS[@]}; do
+ REPO_URL="[email protected]:apache/${repo}.git"
+ git clone $REPO_URL
+ done
+popd
diff --git a/scripts/community-dataset/download-mailing-lists.sh
b/scripts/community-dataset/download-mailing-lists.sh
new file mode 100755
index 0000000..d6bdf33
--- /dev/null
+++ b/scripts/community-dataset/download-mailing-lists.sh
@@ -0,0 +1,55 @@
+#!/bin/bash
+
+set -eu
+
+# Usage: ./download-mailing-lists.sh [/some/output/dir]
+# TODO Potential Improvements
+# - Arg to pull data for non-solr projects
+# - Arg to only pull data for date-range
+# - Arg to only pull some lists
+# - Arg to customize or omit the 'sleep'
+
+# Determine output dir
+DEFAULT_MBOX_LOCATION="output/mbox-data"
+MBOX_OUTPUT_DIR="${1:-}"
+if [[ -z "$MBOX_OUTPUT_DIR" ]]; then
+ MBOX_OUTPUT_DIR="$DEFAULT_MBOX_LOCATION"
+fi
+
+# Ensure output dir exists
+if [[ -d $MBOX_OUTPUT_DIR ]]; then
+ echo "Output directory [$MBOX_OUTPUT_DIR] already exists; clearing it out
and continuing..."
+ rm -rf $MBOX_OUTPUT_DIR
+fi
+mkdir -p $MBOX_OUTPUT_DIR
+
+CURRENT_YEAR="$(date +%Y)"
+CURRENT_MONTH="$(date +%m)"
+
+# Solr's been around forever, but the mailing lists are only around post-
project-split
+STARTING_YEAR="2021"
+STARTING_MONTH="1"
+
+MAILING_LISTS=("dev" "issues" "builds" "commits" "users")
+
+pushd $MBOX_OUTPUT_DIR
+ for list in ${MAILING_LISTS[@]}; do
+ mkdir -p $list
+
+ # Download all data for the mailing list
+ pushd $list
+ for year in $(seq $STARTING_YEAR $CURRENT_YEAR)
+ do
+ for month in $(seq 1 12)
+ do
+ # Iterate through all months, even those that haven't happened yet.
This is technically wrong, but ASF's mbox.lua tool handles it gracefully
without a 404, etc.
+ # In the case of a month/year with no data, the curl command gets a
200 status and an empty response body
+ curl -sk
"https://lists.apache.org/api/mbox.lua?list=dev&domain=solr.apache.org&d=${year}-${month}&q="
> ${list}-${year}-${month}.mbox
+
+ # Some small sleep to avoid hitting any rate-limiting or causing any
problems for the ASF servers
+ sleep 2
+ done
+ done
+ popd
+ done
+popd
diff --git a/scripts/community-dataset/export-git-data.sh
b/scripts/community-dataset/export-git-data.sh
new file mode 100755
index 0000000..6058c78
--- /dev/null
+++ b/scripts/community-dataset/export-git-data.sh
@@ -0,0 +1,48 @@
+#!/bin/bash
+
+# Usage: ./export-git-data.sh <repository-path> <solr-doc-output-dir>
+
+if [[ -z ${1:-} ]]; then
+ echo "'repository-path' argument is required but was not provided; exiting"
+ exit 1
+fi
+if [[ -z ${2:-} ]]; then
+ echo "'solr-doc-output-dir' argument is required but was not provided;
exiting"
+ exit 1
+fi
+
+REPOSITORY_PATH=$1
+REPO=$(basename $REPOSITORY_PATH)
+SOLR_DOC_OUTPUT_DIRECTORY="$2"
+
+INT_SOLR_DOC_FILE="${REPO}-int-commit-data.json"
+FIELD_SEPARATOR="|||"
+MY_CWD=$(pwd)
+TMP_FILE="${MY_CWD}/raw-git-data.txt"
+# Uncomment to restrict commits to a particular directory
+#IN_REPO_PATH="solr/"
+
+pushd $REPOSITORY_PATH
+ git log
--format=format:"%h$FIELD_SEPARATOR%an$FIELD_SEPARATOR%ad$FIELD_SEPARATOR%s"
--date=iso8601-strict ${IN_REPO_PATH:-} > $TMP_FILE
+popd
+
+while IFS= read -r line
+do
+ echo "Reading line $line"
+ hash_field=$(echo "$line" | awk -F'\\|\\|\\|' '{print $1}')
+ name_field=$(echo "$line" | awk -F'\\|\\|\\|' '{print $2}')
+ date_field=$(echo "$line" | awk -F'\\|\\|\\|' '{print $3}')
+ subject_field=$(echo "$line" | awk -F'\\|\\|\\|' '{print $4}' | tr -d
"[:cntrl:]" | sed 's/\\//g' | sed 's/\"/\\\"/g')
+
+ echo "{\"id\": \"$hash_field\", \"name_s\": \"$name_field\", \"date_dt\":
\"$date_field\", \"subject_s\": \"$subject_field\", \"subject_txt\":
\"$subject_field\", repo_s:\"$REPO\"}," >> $INT_SOLR_DOC_FILE
+done < "$TMP_FILE"
+rm "$TMP_FILE"
+
+# All the lines exist now, they just need formatted into an array
+FINAL_SOLR_DOC_FILE="${SOLR_DOC_OUTPUT_DIRECTORY}/${REPO}-commit-data.json"
+echo "[" > $FINAL_SOLR_DOC_FILE
+cat $INT_SOLR_DOC_FILE | sed '$ s/.$//' >> $FINAL_SOLR_DOC_FILE
+echo "]" >> $FINAL_SOLR_DOC_FILE
+rm $INT_SOLR_DOC_FILE
+
+echo "Git data exported to $FINAL_SOLR_DOC_FILE as JSON docs; ready for
indexing and analysis"