I searched the logs for any signs of error. The operations performed on the prod-folder database in the two hours before the first crash were :
https://gist.github.com/VictorNicollet/878d0176960cc71d9ac1 The compact at 10:54:08 finished without a hitch. The compact at 11:54:07 finished with : https://gist.github.com/VictorNicollet/4d6ccd60bec2ae922a32 On 19 April 2013 00:17, Victor Nicollet <[email protected]> wrote: > It had happened once on a critical production database (the user > database...) so I wrote some code to repair it. And I never throw away any > code. > > If you're interested (but I doubt it : it's pretty useless), I could share > the repair code. > > More info on the logs : apparently, the first compact-related crash > happened Wed, 17 Apr 2013 11:54:08 GMT : since I have hourly compacts, it > means the corruption happened Wed, 17 Apr 2013 10:54:08 GMT at the > earliest. Sifting through that period right now... > > > On 19 April 2013 00:13, Robert Newson <[email protected]> wrote: > >> You say this happens often? Clearly often enough that you have a >> routine to repair it. >> >> B. >> >> On 18 April 2013 23:12, Robert Newson <[email protected]> wrote: >> > Hi Victor, >> > >> > Thanks for the information, we appreciate it. >> > >> > B. >> > >> > On 18 April 2013 23:07, Victor Nicollet <[email protected]> wrote: >> >> Replying to my own mail, hoping it will end up in the same thread (I >> was >> >> not fully subscribed when I posted this, but I still read the >> archives). >> >> >> >> Answers to the questions you asked : >> >> >> >> - I have no idea when the issue happened. I will try to track it down >> in >> >> the logs. I'm afraid I don't have time to filter out all customer >> >> information from the logs and provide them to you, though I can >> certainly >> >> grep for error dumps if you want me to. I have never seen disk-related >> >> errors in the log. >> >> - I am running Debian x86_64 GNU/Linux, with erlang 1:15.b.1-d >> >> - There are no unusual CouchDB configuration options ; the only >> change I >> >> performed was to disable reduce_limit. A perhaps notable usage aspect >> : all >> >> the databases are compacted hourly. >> >> - It's not NFS. From /etc/fstab : >> >> >> >> /dev/sda1 / ext4 errors=remount-ro 0 1 >> >> /dev/sda2 /home ext4 defaults 0 2 >> >> >> >> The dual-partition setup is a silly default from OVH (my dedicated >> server >> >> host), so I have /var/lib/couchdb as a symlink to /home/couchdb/lib, >> from >> >> sda1 to sda2. >> >> >> >> - I can't rule out a disk issue, because I don't have a lot of >> experience >> >> with those... any obvious diagnosis command you would like me to run ? >> I am >> >> certain that I have not run out of disk space, though (still around 1TB >> >> free on that drive). >> >> >> >> Thank you for your patience. >> >> >> >> On 18 April 2013 14:17, Victor Nicollet <[email protected]> wrote: >> >> >> >>> Hello, >> >>> >> >>> The @CouchDB twitter account thought you might find this information >> >>> helpful. >> >>> >> >>> My SaaS start-up uses CouchDB as its primary database. Lately, I have >> been >> >>> having database corruption issues with version 1.2.0 : every few >> weeks, one >> >>> of our databases becomes corrupted, which has several negative >> consequences >> >>> (among others) : >> >>> >> >>> - Replication of that database fails (it does not even start). >> >>> - Compaction of that database fails and *freezes* the server. >> >>> - Several documents in the database become inaccessible through >> either >> >>> direct access or through _all_docs. >> >>> >> >>> The latest affected database does not contain any information about >> our >> >>> customers, so I am allowed to release it publicly : >> >>> >> >>> http://nicollet.net/public/2013-04-18.couchdb/prod-folder.couch >> >>> >> >>> This database contains 325 irretrievable documents between identifiers >> >>> 2xFEY0pU2Eb and 3Fn6l04G6Oa. >> >>> I hope this helps, >> >>> >> >>> -- >> >>> Victor Nicollet, CTO, www.runorg.com >> >>> >> >> >> >> >> >> >> >> -- >> >> Victor Nicollet, Directeur Technique, www.runorg.com >> > > > > -- > Victor Nicollet, Directeur Technique, www.runorg.com > -- Victor Nicollet, Directeur Technique, www.runorg.com
