--- ** [tickets:#5733] Improve performance of Commit._diffs_copied** **Status:** closed **Milestone:** unreleased **Labels:** performance scm **Created:** Fri Feb 01, 2013 07:29 PM UTC by Cory Johns **Last Updated:** Mon Dec 29, 2014 08:15 AM UTC **Owner:** nobody `Commit._diffs_copied()` is used to determine if a removed blob was actually moved or renamed, possibly with some changes. However, it is called every time a commit is viewed and hits every file removed from a commit, and it is slow enough to be a problem. Some ideas for optimizing it: * Short-circuit identical blob comparisons by comparing the blob hash first, as is done w/ trees * Use `SequenceMatcher.real_quick_ratio()` to get the upper-bound on the ratio to exclude obvious non-matches quickly, probably followed up with `quick_ratio()` and/or `ratio()` to confirm a match * Raise the `DIFF_SIMILARITY_THRESHOLD` and break after a single match instead of continuing to test all files (though this could give false matches, so maybe not do this one) * Exclude binary or particularly large blobs Finally, we should almost certainly move this computation to `compute_diffs()` instead of doing it every time the commit's diffs are used. Also, currently, children of removed (or the removed side of moved/renamed) trees are not included in the diff to avoid hitting this performance issue too often, which causes the added portion of moved/renamed trees to look like brand new files. Once the performance of `_diffs_copied()` is more reasonable and/or pre-computed, the removed trees short-circuit in `compute_diffs()` needs to be removed. --- Sent from forge-allura.apache.org because dev@allura.apache.org is subscribed to https://forge-allura.apache.org/p/allura/tickets/ To unsubscribe from further messages, a project admin can change settings at https://forge-allura.apache.org/p/allura/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.