klesh opened a new pull request, #7268:
URL: https://github.com/apache/incubator-devlake/pull/7268
### Summary
Currently, the `gitextractor` would collect commit stat, mainly `added
lines` and `deleted lines` by default which rely on the `diff` algorithm that
is extremely slow both on `libgit2` and `go-git`
Here are the tests I ran:
6 months with diff commits collection **~22mins**
```json
{
"id": 285,
"createdAt": "2024-03-29T21:27:00.728+08:00",
"updatedAt": "2024-03-29T21:48:42.162+08:00",
"taskId": 61,
"name": "Collect Commits",
"number": 2,
"beganAt": "2024-03-29T21:27:00.749+08:00",
"finishedAt": "2024-03-29T21:48:42.146+08:00",
"spentSeconds": 1302,
"finishedRecords": 10699,
"sequence": 2,
"isCollector": true,
"isFailed": false,
"message": ""
}
```
6 months without diff commits collection **~2mins**
```
{
"id": 289,
"createdAt": "2024-03-29T22:02:23.596+08:00",
"updatedAt": "2024-03-29T22:04:42.052+08:00",
"taskId": 62,
"name": "Collect Commits",
"number": 2,
"beganAt": "2024-03-29T22:02:23.622+08:00",
"finishedAt": "2024-03-29T22:04:42.029+08:00",
"spentSeconds": 139,
"finishedRecords": 10699,
"sequence": 2,
"isCollector": true,
"isFailed": false,
"message": ""
}
```
It would be **11x** faster if we skip the diff algo.
However, one might still need the data for analysis, after discussion with
others, we agreed that it is worth to make skipping the commit stat collection
by default but to offer the override options by specifying via the
`TaskOptions` or fallback to `Environment Variable`. One can set the
Environment Variable `SKIP_COMMIT_STAT=false` to restore the previous behavior.
### Does this close any open issues?
Part of #1407
### Screenshots
<img width="1572" alt="Snipaste_2024-04-01_17-06-27"
src="https://github.com/apache/incubator-devlake/assets/61080/7c07268c-8d38-4f13-83cb-4024bcfba781">
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]