---

** [tickets:#5733] Improve performance of Commit._diffs_copied**

**Status:** closed
**Milestone:** unreleased
**Labels:** performance scm 
**Created:** Fri Feb 01, 2013 07:29 PM UTC by Cory Johns
**Last Updated:** Mon Dec 29, 2014 08:15 AM UTC
**Owner:** nobody


`Commit._diffs_copied()` is used to determine if a removed blob was actually 
moved or renamed, possibly with some changes.  However, it is called every time 
a commit is viewed and hits every file removed from a commit, and it is slow 
enough to be a problem.

Some ideas for optimizing it:

* Short-circuit identical blob comparisons by comparing the blob hash first, as 
is done w/ trees
* Use `SequenceMatcher.real_quick_ratio()` to get the upper-bound on the ratio 
to exclude obvious non-matches quickly, probably followed up with 
`quick_ratio()` and/or `ratio()` to confirm a match
* Raise the `DIFF_SIMILARITY_THRESHOLD` and break after a single match instead 
of continuing to test all files (though this could give false matches, so maybe 
not do this one)
* Exclude binary or particularly large blobs

Finally, we should almost certainly move this computation to `compute_diffs()` 
instead of doing it every time the commit's diffs are used.

Also, currently, children of removed (or the removed side of moved/renamed) 
trees are not included in the diff to avoid hitting this performance issue too 
often, which causes the added portion of moved/renamed trees to look like brand 
new files.  Once the performance of `_diffs_copied()` is more reasonable and/or 
pre-computed, the removed trees short-circuit in `compute_diffs()` needs to be 
removed.


---

Sent from forge-allura.apache.org because dev@allura.apache.org is subscribed 
to https://forge-allura.apache.org/p/allura/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://forge-allura.apache.org/p/allura/admin/tickets/options.  Or, if this is 
a mailing list, you can unsubscribe from the mailing list.

Reply via email to