hezyin opened a new issue, #1771: URL: https://github.com/apache/incubator-devlake/issues/1771
Alexey from ClickHouse recently showed me a git import tool they developed. It extracts the line-level data from git into a table called `line_changes`, which can be used to compute interesting metrics like code churn, line age, etc. The tool also runs very fast. We should consider incorporating line-level data in the future as well. The source code can be found here: https://github.com/ClickHouse/ClickHouse/tree/master/programs/git-import The questions it can answer from its doc: ``` Allows to answer questions like: - list files with maximum number of authors; - show me the oldest lines of code in the repository; - show me the files with longest history; - list favorite files for author; - list largest files with lowest number of authors; - at what weekday the code has highest chance to stay in repository; - the distribution of code age across repository; - files sorted by average code age; - quickly show file with blame info (rough); - commits and lines of code distribution by time; by weekday, by author; for specific subdirectories; - show history for every subdirectory, file, line of file, the number of changes (lines and commits) across time; how the number of contributors was changed across time; - list files with most modifications; - list files that were rewritten most number of time or by most of authors; - what is percentage of code removal by other authors, across authors; - the matrix of authors that shows what authors tends to rewrite another authors code; - what is the worst time to write code in sense that the code has highest chance to be rewritten; - the average time before code will be rewritten and the median (half-life of code decay); - comments/code percentage change in time / by author / by location; - who tend to write more tests / cpp code / comments. ``` Below are the instructions for how to use the tool: ``` You can get it like this: curl https://clickhouse.com/ | sh - downloads ClickHouse ./clickhouse git-import --help - will show the documentation and the usage of the tool. Then the tool can be run directly inside the git repository. It will collect data like commits, file changes and changes of every line in every file for further analysis. It works well even on largest repositories like Linux or Chromium. Example of a trivial query: SELECT author AS k, count() AS c FROM line_changes WHERE file_extension IN ('h', 'cpp') GROUP BY k ORDER BY c DESC LIMIT 20 Example of some non-trivial query - a matrix of authors, how much code of one author is removed by another: SELECT k, written_code.c, removed_code.c, round(removed_code.c * 100 / written_code.c) AS remove_ratio FROM ( SELECT author AS k, count() AS c FROM line_changes WHERE sign = 1 AND file_extension IN ('h', 'cpp') AND line_type NOT IN ('Punct', 'Empty') GROUP BY k ) AS written_code INNER JOIN ( SELECT prev_author AS k, count() AS c FROM line_changes WHERE sign = -1 AND file_extension IN ('h', 'cpp') AND line_type NOT IN ('Punct', 'Empty') AND author != prev_author GROUP BY k ) AS removed_code USING (k) WHERE written_code.c > 1000 ORDER BY c DESC LIMIT 500 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
