Hello. I am new to the list because I needed to make some adjustments to the mod_dav code and I'm hoping someone can confirm what I have done makes sense.
Some Info:
We have anywhere from 250k to 1 million PUTs a night. Of those, we usually have about 50 that end up with a 204 status even though they don't actually exist once the upload is finished. This is obviously a huge problem.
The Setup:
We have 4 web servers, and 6 file servers. This problem happens on every web and file server. The file servers are attached using NFS with the sync option. The web servers are running 2.0.55 with FC3. The file servers are mostly FC3, but two are RH9. Keep-Alive is turned on in Apache. The client software that uploads the files will retry the upload 3 times until it gets a status >= 200 and < 300.
More Info:
I was able to narrow down the problem. It seems to only happen with some requests that first return a 500 (the "Could not get next bucket brigade" error). The client gets the 500, and then starts the transfer again. Apache will then respond with a 204 message, and the client will think the upload worked, even though it really didn't. Every filename uploaded is unique, and should always return a 201, so the fact that we see 204's is odd and must mean that the 500 (or first request) has not finished when the 204 (or second request) starts.
An Example For One File:
The apache logs show that both requests (the 500 and 204) had the same request time, such as "2005-12-27 23:01:43" even though there's no way they could have happened at the same time, since the 500 request took 423s and the 204 took 133s. Also the 500 reported (using mod_logio) that the input was 1677285, yet the input for the 204 was 1996163.
My Solution:
At the end of the dav_method_put function, I added code that actually checks the existence of the file that was uploaded, and also checks the size if it does exist.
So right before the return, I added:
struct stat statinfo;
if(stat(r->filename, &statinfo) != 0) {
err = dav_new_error(r->pool, HTTP_NOT_FOUND, 0,
apr_psprintf(r->pool,
"File Not Found After PUT: %s",
r->filename));
return dav_handle_err(r, err, NULL);
} else {
//THIS SECTION HAS NEVER BEEN NEEDED
if (statinfo.st_size != total_written) {
ap_log_rerror(APLOG_MARK, APLOG_ERR, 0, r,
apr_psprintf(r->pool, "Invalid PUT: %s (WRITTEN: %i, SIZE: %i)", r->filename, total_written, statinfo.st_size));
} else {
ap_log_rerror(APLOG_MARK, APLOG_NOTICE, 0, r,
apr_psprintf(r->pool, "Successful PUT: %s (WRITTEN: %i)", r->filename, total_written));
}
}
Unfortunately, this has only decreased the number of lost files, but not eliminated it.
The only thing I can think of is somehow the 500 process remains alive until the end of the 204 and then deletes the file. Also, the fact that 204 is being returned means the 204 request was writing over the 500's version of the file, so the 500 request has not finished when the 204 happens.
Thanks for your help!
-Steve
