RE: [ANNOUNCE] Git v2.5.0-rc0
Jeff / Duy, Fixes since v2.4 * Access to objects in repositories that borrow from another one on a slow NFS server unnecessarily got more expensive due to recent code becoming more cautious in a naive way not to lose objects to pruning. (merge ee1c6c3 jk/prune-mtime later to maint). * Have we lost a race with competing repack? check was too expensive, especially while receiving a huge object transfer that runs index-pack (e.g. clone or fetch). (merge 0eeb077 jk/index-pack-reduce-recheck later to maint). I was reading through the 2.5 release notes and spotted the above two entries and I think they are probably fixes related to the NFS slowness I reported last month. Did some tests this morning and things are much improved: Version 2.4.5 - Local clone of https://github.com/git/git - 8s - Clone to NFS - 1m 3s Version 2.5.0.rc0 - Local - 7s - NFS - 13s Thanks for the effort in addressing this and sorry I missed the patch to check it out earlier. Steve N�r��yb�X��ǧv�^�){.n�+ا���ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf
RE: Troubleshoot clone issue to NFS.
On Sunday, May 24, 2015 @ 10:01 AM Duy Nguyen [mailto:pclo...@gmail.com] did scribble: In case you want to back away from option 2 because it starts to leak raciness, which your old commit tried to fix in the first place. I think the only other place that tests for lots of non-existent loose objects is write_sha1_file (e.g. tar -xf bigtarball.tar.gz; cd bigtarball; git init; git add .). But the number of calls should be much smaller compared to index-pack and it does not use has_sha1_file, it uses check_and_freshen_file() instead. There are other places where has_sha1_file() may return 0, but I think the number of calls is even smaller to bother (shallow.c, fetch-pack.c, apply.c, buik-checkin.c) Any updates / further thoughts on this? Thanks, Steve N�r��yb�X��ǧv�^�){.n�+ا���ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf
RE: Troubleshoot clone issue to NFS.
On Friday, June 05, 2015 @ 1:18 PM Jeff King [mailto:p...@peff.net] did scribble: Sorry, I haven't had a chance to look at it further. It still on my todo list. My plan is: 1. Devise some torture to tests to see whether my patch series is in fact racy on Linux. 2. Assuming it is, scrap it and make a has_sha1_file_quick() which might sometimes return a false negative (if somebody else is repacking). Use that in index-pack (and possibly other places, but we can start with index-pack). If we skip step 1 out of pessimism (which I think is a reasonable thing to do), then step 2 should not be all that much work. I'm going to be offline for a few days, though, so I won't get to it until next week at the earliest. If you (or someone else) wants to take a stab at it, please feel free. I've been off myself and wanted to make sure I hadn't missed anything in the email threads while I was away as there we rather a lot of them. Not feeling confident enough to make the changes myself at the moment. I think what you are saying is that has_sha1_file_quick() would be the version from before your change in 45e8a74. And then use that, but I could be barking up the wrong tree completely. Thanks, Steve
RE: Troubleshoot clone issue to NFS.
On Friday, May 22, 2015 @ 8:12 AM Jeff King did scribble: In builtin/index-pack.c, replace the line collision_test_needed = has_sha1_file(sha1); with collision_test_needed = 0;. Security is compromised but for this test it should be ok. Then clone again. I hope the new number gets down close to v1.8.4.1. Yeah, I think that is a good starting point. I timed a local clone before and after 45e8a74; there is a small difference on my system (about 5%), but it goes away with your suggestion. Tested this change on a couple of versions, first of all on the revision where things go wrong for me: ~ $ bin/git --version git version 1.8.4.1.g45e8a74.dirty ~ $ time bin/git clone https://github.com/git/git test snip real0m7.105s user0m9.566s sys 0m0.989s ~ $ time bin/git clone https://github.com/git/git /sami/test snip real0m14.411s user0m9.703s sys 0m1.374s This is more in line with what I see normally. Also tested on master: ~ $ bin/git --version git version 2.4.1.217.g6c1249c.dirty ~ $ time bin/git clone https://github.com/git/git test snip real0m5.946s user0m9.111s sys 0m1.332s ~ $ time bin/git clone https://github.com/git/git /sami/test snip real0m12.344s user0m9.187s sys 0m1.579s So similar on the latest as well. The problem with has_sha1_file() prior to v1.8.4.2 is that it is racy with respect to simultaneous operations; we might claim we do not have an object, when in fact we do. As you noted, usually has_sha1_file() returns true (i.e., we look up objects that we expect to have), and the performance impact is minimal. But for code paths where _not_ having the object is normal, the impact is much greater. So I think there are two possibilities for improving this: 1. Find places where we expect the object will not exist (like the collision_test check you pointed out) and use a has_sha1_file_fast that accepts that it may very occasionally erroneously return false. In this case it would mean potentially skipping a collision check, but I think that is OK. That could have security implications, but only if an attacker: a. has broken sha1 to generate a colliding object b. can manipulate the victim into repacking in a loop c. can manipulate the victim into fetching (or receiving a push) simultaneously with (b) at which point they can try to race the repack procedure to add their colliding object to the repository. It seems rather unlikely (especially part a). 2. Make reprepare_packed_git() cheaper in the common case that nothing has changed. It would probably be enough to stat(objects/pack). We know that packfiles themselves do not change; we may only add or delete them. And since the hierarchy of objects/pack is flat, we know that the mtime on that directory will change if any packs are added or removed. Of course, we are still doing an extra stat() for each has_sha1_file call. Whether that helps for the NFS case depends on whether stat() is significantly cheaper than opendir/readdir/closedir. On my local disk, the hacky patch below did seem to give me back the 5% lost by 45e8a74 (I did it directly on master, since that old commit does not have the stat_validity infrastructure). Also tested master with the patch provided: ~ $ bin/git --version git version 2.4.1.217.g6c1249c.dirty ~ $ time git clone https://github.com/git/git test real0m8.263s user0m10.550s sys 0m3.763s ~ $ time git clone https://github.com/git/git /sami/test real1m3.286s user0m12.149s sys 0m9.192s So the patch isn't reducing the time taken when cloning to NAS. Here are the top calls from strace % time seconds usecs/call callserrors syscall -- --- --- - - 90.68 19.946171 46437398 45436 futex 4.631.017637 22 46141 5 read 2.370.521161 4141429 pread 0.730.161442 3 47130 9 write 0.420.093146 0188645188621 access 0.380.083033 26 3209 181 open 0.320.069587 0188613 1146 stat 0.230.050855 12 4082 3925 lstat 0.110.023317 8 2979 1 fstat 0.040.009134 0 35696 3 recvfrom 0.030.0076661917 4 wait4 0.020.004478 1 3923 madvise 0.010.002291 0 17858 poll 0.010.002155 0 17851 select Thanks for looking into this. Steve N�r��yb�X��ǧv�^�){.n�+ا���ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf
RE: Troubleshoot clone issue to NFS.
On Friday, May 22, 2015 @ 11:06 AM Duy Nguyen did write: Strange. Maybe there is something else... Anyway some numbers from me. This is nfs3 hosted by Raspberry Pi, accessed over wireless. I just run index-pack on git.git pack instead of full clone. - v1.8.4.1 34s - v1.8.4.2 519s (oh.. wireless..) - v1.8.4.2 without has_sha1_file() in index-pack.c 33s - v1.8.4.2 + Jeff's mtime patch 36s Just ran the test again and it was 12 seconds. Too many versions of git available on the machine and I think I might have run the wrong one: ~ $ time bin/git clone https://github.com/git/git test Cloning into 'test'... remote: Counting objects: 186015, done. remote: Compressing objects: 100% (8/8), done. remote: Total 186015 (delta 8), reused 7 (delta 7), pack-reused 186000 Receiving objects: 100% (186015/186015), 61.33 MiB | 30.35 MiB/s, done. Resolving deltas: 100% (135512/135512), done. Checking connectivity... done. real0m6.931s user0m10.011s sys 0m2.685s ~ $ time bin/git clone https://github.com/git/git /sami/test Cloning into '/sami/test'... remote: Counting objects: 186015, done. remote: Compressing objects: 100% (8/8), done. remote: Total 186015 (delta 8), reused 7 (delta 7), pack-reused 186000 Receiving objects: 100% (186015/186015), 61.33 MiB | 29.91 MiB/s, done. Resolving deltas: 100% (135512/135512), done. Checking connectivity... done. Checking out files: 100% (2795/2795), done. real0m13.169s user0m9.961s sys 0m3.480s So I would say that was in line with the timings for the other versions I have tested. And looking back at the email from the morning I did run the wrong bin/git. Sorry, totally my fault on that. If I can test anything else then let me know. Thanks, Steve --
Troubleshoot clone issue to NFS.
In setting up some new git servers I was trying to test the performance of some NFS mounted volumes and when compared to local disk (although this is a vitualized server so not truly local) cloning to NFS was taking a long time. Here are some timings: ~ $ time bin/git clone https://github.com/git/git test Cloning into 'test'... remote: Counting objects: 185964, done. remote: Compressing objects: 100% (276/276), done. remote: Total 185964 (delta 203), reused 32 (delta 32), pack-reused 185656 Receiving objects: 100% (185964/185964), 61.42 MiB | 26.16 MiB/s, done. Resolving deltas: 100% (135454/135454), done. Checking connectivity... done. real0m8.156s user0m10.569s sys 0m3.857s ~ $ time bin/git clone https://github.com/git/git /sami/test Cloning into '/sami/test'... remote: Counting objects: 185964, done. remote: Compressing objects: 100% (276/276), done. remote: Total 185964 (delta 203), reused 32 (delta 32), pack-reused 185656 Receiving objects: 100% (185964/185964), 61.42 MiB | 10.15 MiB/s, done. Resolving deltas: 100% (135454/135454), done. Checking connectivity... done. Checking out files: 100% (2795/2795), done. real0m58.685s user0m12.153s sys 0m7.619s So cloning to NFS is 50 seconds slower on average (I've run this a lot over the last few days and it does appear to be consistent and not a network / github connectivity issue). Tests creating files on the NFS with dd didn't show that much difference to the local disk (and were sometimes quicker). Volume mount differences are: /dev/mapper/rootvg-homelv on /home type ext4 (rw,nodev) nfsserver:/vol/sami/repos on /sami type nfs (rw,bg,nfsvers=3,tcp,hard,nointr,timeo=600,rsize=32768,wsize=32768,addr=10.1.1.1) And there doesn't appear to be any issue with NFS retransmissions: /sami $ nfsstat Client rpc stats: calls retransauthrefrsh 11719847 0 11720190 This morning I did some more digging as when I tried this on a newly build server the NFS times were slower than local disk, but only buy around 6-10 seconds. The new server had git 1.7.1 installed on it compared to 2.4.0 on the machine I've been testing on. So I build a number of versions of git to test each one to try and find the point where it changed: v1.8.0 11.363 s v1.8.3 13.597 s v1.8.4 13.958 s v1.8.4.114.563 s v1.8.4.21m 0s v1.8.4.31m 9s v1.8.4.51m 1s v1.8.5 1m 0s v1.8.5.61m 0s v1.9.0 1m 38 v2.4.0 58s v2.4.1 58s So there appears to be a change in 1.8.4.2 that made this issue appear for me. Looking at the release notes the only thing that I can see that might be related could be: * When an object is not found after checking the packfiles and then loose object directory, read_sha1_file() re-checks the packfiles to prevent racing with a concurrent repacker; teach the same logic to has_sha1_file(). So the questions are: 1) Should I expect a clone to NFS to be that much slower? 2) Is there anything I could do to speed it up (I've tried --bare as that is what the repositories will be when stored on NFS and there wasn't really a difference). 3) What else can I use in git to compare the performance on local to NFS to see if it is just clones that are affected? 4) I assume I could bisect between 1.8.4.1 and 1.8.4.2 to find exactly where things get worse for me? Thanks for any help, Steve This e-mail is for the sole use of the intended recipient and contains information that may be privileged and/or confidential. If you are not an intended recipient, please notify the sender by return e-mail and delete this e-mail and any attachments. Certain required legal entity disclosures can be accessed on our website.http://thomsonreuters.com/prof_disclosures/ -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Troubleshoot clone issue to NFS.
On Thu, May 21, 2015a at 9:31 PM, Duy Nguyen [mailto:pclo...@gmail.com], did scribble: In case an object is not found pack directory is re-read again, which might cause some increased load on nfs. has_sha1_file() not finding the object should not happen often.. That last statement is probably very wrong, but I have no time to test this now. In index-pack, there is a has_sha1_file() for file collision test. That call on a fresh clone would fail for _every_ object in the (new) pack and the cost of reprepare pack files could be sky high... Confirmed with bisect that it is that commit: ~/git $ git bisect bad 45e8a7487339c0f0ea28244ef06851308d07387c is the first bad commit commit 45e8a7487339c0f0ea28244ef06851308d07387c Author: Jeff King p...@peff.net Date: Fri Aug 30 15:14:13 2013 -0400 I have an strace for that build but it is 153 megabytes so I probably shouldn't attach, but the call summary is below: % time seconds usecs/call callserrors syscall -- --- --- - - 93.71 39.670084 103386835 2171 futex 3.161.338572 7190550 181 open 1.560.658786 18 37450 3 read 0.620.262740 2141390 pread 0.410.171526 5 37313 9 write 0.180.076166 0188859188835 access 0.110.047941 0374712 getdents 0.110.045174 11 4067 3910 lstat 0.060.023630 0190425 close 0.040.017995 6 2975 1 fstat 0.020.0076681917 4 wait4 0.010.004150 1 5065 madvise 0.010.003548 0 16090 8 recvfrom 0.000.001872 0 8048 select 0.000.001870 11 173 1 mkdir 0.000.000872 0 8055 poll 0.000.000262 221212 readlink 0.000.000185 0 1217 1146 stat 0.000.000158 0 457 mprotect 0.000.74 0 298 mmap 0.000.69 1 109 8 rt_sigreturn 0.000.47 0 159 brk 0.000.21 117 getcwd 0.000.00 042 3 lseek 0.000.00 092 munmap 0.000.00 035 rt_sigaction 0.000.00 0 9 rt_sigprocmask 0.000.00 0 8 3 ioctl 0.000.00 011 pipe 0.000.00 0 3 dup 0.000.00 0 8 dup2 0.000.00 0 6 setitimer 0.000.00 011 1 socket 0.000.00 0 8 7 connect 0.000.00 0 8 sendto 0.000.00 0 2 recvmsg 0.000.00 0 1 bind 0.000.00 0 1 getsockname 0.000.00 0 3 1 getpeername 0.000.00 0 2 setsockopt 0.000.00 0 2 getsockopt 0.000.00 0 8 clone 0.000.00 0 5 execve 0.000.00 0 3 uname 0.000.00 0 100 fcntl 0.000.00 0 2 fsync 0.000.00 013 chdir 0.000.00 014 rename 0.000.00 0 2 link 0.000.00 0 5 unlink 0.000.00 0 2 symlink 0.000.00 0 9 chmod 0.000.00 0 6 getrlimit 0.000.00 0 2 sysinfo 0.000.00 0 8 getuid 0.000.00 0 1 statfs 0.000.00 0 5 arch_prctl 0.000.00 0 1 gettid 0.000.00 0 5 set_tid_address 0.000.00 013 set_robust_list -- --- --- - - 100.00 42.333410 1594736196300 total Is there anything else I can provide or test? Thanks for the help, Steve