(Cc:ing this to the APR Development List in hopes that someone there reads my problem description and it triggers some distant memory of a long-ago pain, or that perhaps they can otherwise shed some light on the situation. For reference, I am running Red Hat 7.2, and the APR from HTTP 2.0.44.)
[EMAIL PROTECTED] writes: > Hm. Well debugging led me to believe that 5202 didn't actually break > something, but perhaps shed light on a previously existing other > problem. I went over that change with a fine-toothed comb, and could > find absolutely no fault with it. After spending all of Friday night and Saturday morning hunting this puppy, I was all set to put it out of mind until Monday morning. But you know how that goes. The wife wanted to do a little reading before going to sleep tonight, so I figured I'd use that time to take a fresh look at this nasty bug. Wow, does mental clarity help when debugging. I think I've finally narrowed the bug down to a small enough problem area, and let me tell you -- it ain't where I expected it. Let me recite the facts I had as of 7:00am Saturday: - Subversion as of revision 5202 was not passing 'make check' over DAV. The initial import that happens at the beginning of the test run was failing because the directory "A" (in the greek tree) was being imported more than once. In case you don't know, 5202 was the revision in which I changed 'svn import' back to an "infix textdelta" commit process instead of a "postfix textdelta" one. This means that the contents of imported files are now transmitted as the files are created in the repository instead of having all the files created as part of the directory walk, and then all the content for all the files being transmitted at the very end of the import. - The bug seemed to only occur when 'svn import' was run inside the Python test suite, but never when run on the cmdline. I could trigger the bug by running 'basic_tests.py 1 --url http://localhost' or by calling actions.guarantee_greek_repository() from inside of a Python interpreter. - svn_io_dir_read() was definitely, without-a-doubt, returning the following ordered list of dirents: { '.', '..', 'iota', 'A', 'A' } (And the second 'A' triggered the failure.) - Reverting the 5202 change caused the bug to go away. - Keeping the 5202 change, but commenting out the call to editor->apply_textdelta() that happens during the import would also cause the bug to go away (though of course, now the tests would fail because no file contents were being transmitted). I began my search tonight with a focus on that messed-up svn_io_dir_read() situation. Recall that svn_io_dir_read() is just a wrapper around apr_dir_read(). First, I modified the import_dir() function to simply do the loop over dirents, but nothing else. I gdb'd the client during its 'svn import' from a test run and examined the apr_dir_t structure used in that call. The 'dirstruct' member of that structure was particularly interesting to me, and looked like this after each call (the paths after the '###' below are the current dirent just read by svn_io_dir_read()): ### "." (gdb) p *dir->dirstruct $4 = {fd = 4, data = 0x80896a0 "C^\b", allocation = 4096, size = 64, offset = 16, filepos = 1, lock = {__m_reserved = 0, __m_count = 0, __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, __spinlock = 0}}} ### ".." (gdb) p *dir->dirstruct $6 = {fd = 4, data = 0x80896a0 "C^\b", allocation = 4096, size = 64, offset = 32, filepos = 2, lock = {__m_reserved = 0, __m_count = 0, __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, __spinlock = 0}}} ### "iota" (gdb) p *dir->dirstruct $8 = {fd = 4, data = 0x80896a0 "C^\b", allocation = 4096, size = 64, offset = 48, filepos = 3, lock = {__m_reserved = 0, __m_count = 0, __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, __spinlock = 0}}} ### "A" (gdb) p *dir->dirstruct $10 = {fd = 4, data = 0x80896a0 "C^\b", allocation = 4096, size = 64, offset = 64, filepos = 4, lock = {__m_reserved = 0, __m_count = 0, __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, __spinlock = 0}}} Then, I reverted that temporary change and debugged my client again to see what happened when my weird bug showed up: ### "." (gdb) p *dir->dirstruct $1 = {fd = 4, data = 0x80896a0 "çh\b", allocation = 4096, size = 64, offset = 16, filepos = 1, lock = {__m_reserved = 0, __m_count = 0, __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, __spinlock = 0}}} ### ".." (gdb) p *dir->dirstruct $3 = {fd = 4, data = 0x80896a0 "çh\b", allocation = 4096, size = 64, offset = 32, filepos = 2, lock = {__m_reserved = 0, __m_count = 0, __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, __spinlock = 0}}} ### "iota" (gdb) p *dir->dirstruct $5 = {fd = 4, data = 0x80896a0 "çh\b", allocation = 4096, size = 64, offset = 48, filepos = 3, lock = {__m_reserved = 0, __m_count = 0, __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, __spinlock = 0}}} ### "A" (gdb) p *dir->dirstruct $7 = {fd = 4, data = 0x80896a0 "çh\b", allocation = 4096, size = 64, offset = 64, filepos = 4, lock = {__m_reserved = 0, __m_count = 0, __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, __spinlock = 0}}} ### "A" (again) (gdb) p *dir->dirstruct $10 = {fd = 4, data = 0x80896a0 "èh\b", allocation = 4096, size = 16, offset = 16, filepos = 5, lock = {__m_reserved = 0, __m_count = 0, __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, __spinlock = 0}}} Whoa! I noticed almost immediately that on that second "A" that is returned, the *dir->dirstruct->data member had suddenly changed (and the offset member was no longer incrementing by 16 like it had been). This got me thinking that something down in APR was causing illness. So I set a watchpoint on *dir->dirstruct->data, and ran the test again. The watchpoint was first triggered when the field was first initialized with directory data, and then *POW* it was triggered again on a call to readdir_r() (from apr_dir_read()). So, readdir_r() was actually fouling up somehow, losing track of where it was in its directory reading or something. At this point, I mis-read the documentation for readdir_r(). I recalled that we were creating a temporary file in the top-level import directory each time file contents were sent (in the editor->apply_textdelta() call, which is commit_apply_txdelta in ra_dav). My misunderstanding of the readdir_r() manpage caused me to hack a change in ra.c so that instead of making the temporary file in the top-level import directory, it always made the file in "/tmp". *POOF*. The bug disappeared. At this point, I called Karl. We talked through the situation and by the end had decided on a couple of things: - First, that I had misread the readdir_r() manpage, but that regardless, something seemed to be quite wrong with how that function (or APR's use of that function) was behaving on my system. - Secondly, the reason that 5202 appeared the cause of the bug was because when we were doing postfix textdeltas, we had finished doing all the directory reading for the whole import before we created our first temporary file, thus our tmpfile creation wasn't screwing up dir reads like it is for me now. Now that I've read the readdir_r() manpage correctly, it would appear that it should be completely safe to have this kind of processing: opendir(DIR) while (readdir(DIR)) { create_file(DIR/tmpfile) do_stuff() delete_file(DIR/tmpfile) } But apparently this isn't working out too well on my machine. Does anyone have any reason to believe that this would not be safe, and have any ideas why running Subversion via a Python popen3() call would cause some problem to occur which does not run Subversion is run directly from the shell? Inquiring (and admittedly boggled) minds wanna know.