I investigated the rsync code and found the reason why. For every file in the source, it searches the entire filter-list looking to see if that filename is on the exclude/include list. Most aren't, so it compares (350K - 72K) * 72K names (the non-listed files) plus (72K * 72K/2) names (the ones that are listed), for a total of about 22,608,000,000 strcmp's. That's 22 BILLION comparisons. (I may have left off a zero there, it might be 220 B).
I'm working on a fix to improve this. The first phase was to just improve the existing code without changing the methodology. The set I've been testing with is local-local machine, dry-run, 216K files in the source directory, 25,000 files in the exclude-from list. The original rsync takes 488 seconds. The improved code takes 300 seconds. The next phase was to improve the algorithm of handling large filter_lists. Change the unsorted linear search to a sorted binary search (skiplist). This improved code takes 2 seconds. The original code does 4,492,304,682 strcmp's. The fully improved code does 6,472,564. 98.5% fewer. I am cleaning up the code and will submit a patchfile soon.
-- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html