hezyin opened a new issue, #1771:
URL: https://github.com/apache/incubator-devlake/issues/1771

   Alexey from ClickHouse recently showed me a git import tool they developed. 
It extracts the line-level data from git into a table called `line_changes`, 
which can be used to compute interesting metrics like code churn, line age, 
etc. The tool also runs very fast. We should consider incorporating line-level 
data in the future as well.
   
   The source code can be found here: 
https://github.com/ClickHouse/ClickHouse/tree/master/programs/git-import
   
   The questions it can answer from its doc:
   
   ```
   Allows to answer questions like:
   - list files with maximum number of authors;
   - show me the oldest lines of code in the repository;
   - show me the files with longest history;
   - list favorite files for author;
   - list largest files with lowest number of authors;
   - at what weekday the code has highest chance to stay in repository;
   - the distribution of code age across repository;
   - files sorted by average code age;
   - quickly show file with blame info (rough);
   - commits and lines of code distribution by time; by weekday, by author; for 
specific subdirectories;
   - show history for every subdirectory, file, line of file, the number of 
changes (lines and commits) across time; how the number of contributors was 
changed across time;
   - list files with most modifications;
   - list files that were rewritten most number of time or by most of authors;
   - what is percentage of code removal by other authors, across authors;
   - the matrix of authors that shows what authors tends to rewrite another 
authors code;
   - what is the worst time to write code in sense that the code has highest 
chance to be rewritten;
   - the average time before code will be rewritten and the median (half-life 
of code decay);
   - comments/code percentage change in time / by author / by location;
   - who tend to write more tests / cpp code / comments.
   ```
   
   Below are the instructions for how to use the tool: 
   
   ```
   You can get it like this:
   
   curl https://clickhouse.com/ | sh
   - downloads ClickHouse
   
   ./clickhouse git-import --help
   - will show the documentation and the usage of the tool.
   
   Then the tool can be run directly inside the git repository.
   It will collect data like commits, file changes and changes of every
   line in every file for further analysis.
   It works well even on largest repositories like Linux or Chromium.
   
   Example of a trivial query:
   
   SELECT author AS k, count() AS c FROM line_changes WHERE
   file_extension IN ('h', 'cpp') GROUP BY k ORDER BY c DESC LIMIT 20
   
   Example of some non-trivial query - a matrix of authors, how much code
   of one author is removed by another:
   
   SELECT k, written_code.c, removed_code.c,
       round(removed_code.c * 100 / written_code.c) AS remove_ratio
   FROM (
       SELECT author AS k, count() AS c
       FROM line_changes
       WHERE sign = 1 AND file_extension IN ('h', 'cpp')
           AND line_type NOT IN ('Punct', 'Empty')
       GROUP BY k
   ) AS written_code
   INNER JOIN (
       SELECT prev_author AS k, count() AS c
       FROM line_changes
       WHERE sign = -1 AND file_extension IN ('h', 'cpp')
           AND line_type NOT IN ('Punct', 'Empty')
           AND author != prev_author
       GROUP BY k
   ) AS removed_code USING (k)
   WHERE written_code.c > 1000
   ORDER BY c DESC LIMIT 500
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to