Looks like I found a fix for the bug, though I'm not 100% sure what
the actual bug is. The fix was to change mochiweb to send the HTTP
chunk in a single gen_tcp:send/2 call. Previously it sent the length
in one call, then the data followed by data in another call.
My theory of the bug is the Safari HTTP client is getting the chunked
end marker in 2 packets. It gets the 0 length + CRLF line in a one
packet and when it asks for the CRLF in the next packet its not there
yet, so it just skips it (for reasons yet unknown). But the CRLF is
still coming, and then when the client goes ahead and makes the next
request and tries to get the next response and it instead gets that
previous CRLF it had skipped. Because it gets a weird unexpected
response, it retries the request.
This fix is to put the whole chunk in one gen_tcp:send/2 call, which I
think forces it into a single TCP packet and therefore CRLF is always
available immediately. The fix is simple and I think it also be more
efficient for most use cases. However I think there might still be a
flaw in Safari here that could bite. I also think the idempotence work
for document creation is still necessary.
I want to take this fix along with the recent replication and
compaction bug fixes and create a 0.8.1.
-Damien
On Jul 23, 2008, at 3:01 PM, Damien Katz wrote:
Right now we are having a major problem with HTTP request being
retried. This problem is responsible for the test suite failures
seen constantly in Safari (though others report similar failures in
Firefox, I've not seen them myself). And not just test suite
failures, some are seeing the same behavior in production.
The major symptoms of this problem:
1. Mysterious conflict - You get a conflict error saving a document
to the db. When you examine the existing db document, it's already
got your changes.
2. Duplicate document - When creating a new document via POST, you
occasionally get 2 new documents created instead of one.
#1 is annoying but not too serious, no data is lost or corrupted. #2
is a bit more dangerous, because you could consider the database
corrupted by having the duplicate document. (depends on what
problems it would cause for your app)
What is happening in both these cases is the HTTP requests are
getting sent and processed twice. The first request is given to
CouchDB and is handled, but when CouchDB attempts to send the
response, the connection is reset (apparently). Then another
identical HTTP request comes in and the request is processed again.
I am not a TCP expert. but by viewing the network requests via
tcpdump, it is obvious the request packets, 1 header and 1 body
packet, are getting resent from the client to the server. I do not
know if the packets are being resent at the TCP level, or if the
HTTP client in safari is retrying the request after getting a TCP
error.
I do not know why the network error or subsequent resend is
happening. I can only confirm that it *is* happening. If this is at
the TCP level, then it means we definitely need to do away with the
non-idempotent POST to create new documents.
I think we do anyway though. While this network error should not be
happening, it did expose an interesting problem with our use of POST
for document creation. The problem is the generated id for the
document is a UUID generated server side, so the server has no way
to distinguish if a request is a new request or a resend of an
already processed request, and so generates another UUID and thus
creates another new document. But if the UUID is generated by the
client, then the resend will cause a conflict error, that UUID
already exists in the DB, thus eliminating the duplicate data.
However, we still need to figure out why this is happening in the
first place. Why is the connection being reset and why is the
request being retried?
If anyone want to try to debug this, here is what I've been doing:
1. Run a packet sniffer for local port 5984 and start couchdb
2. Got to http://127.0.0.1/_utils/, click the "Test Suite" link
3. Run the "basics" test manually until you see a "conflict error"
exception in test result. (This exception stops the test executing.
I don't try to debug other test failures, since the test keeps on
running after the failure)
4. The last few requests will be the duplicated requests. There is
information about the packets, but I don't know how to interpret it.
Any help and input appreciated.
-Damien