Victor, I finally remembered to ask a few of the ops guys I work with while they were online about things to run to check for faulty hardware. The general suggests for detecting disk errors are first to check dmesg and /var/log/messages for anything that looks amiss, and then run fsck and smartctl to check the filesystem integrity and smartctl will let you know if the disk thinks its broken.
You may also want to run a RAM test on the machine. I'm told that most BIOS's should have a utility for doing that these days. Otherwise theres' memtest86+ that's a downloadable ISO. They say if you can to just let that run overnight and if the machine is frozen in the morning you've found the issue. HTH, Paul Davis On Thu, Apr 18, 2013 at 5:07 PM, Victor Nicollet <[email protected]> wrote: > Replying to my own mail, hoping it will end up in the same thread (I was > not fully subscribed when I posted this, but I still read the archives). > > Answers to the questions you asked : > > - I have no idea when the issue happened. I will try to track it down in > the logs. I'm afraid I don't have time to filter out all customer > information from the logs and provide them to you, though I can certainly > grep for error dumps if you want me to. I have never seen disk-related > errors in the log. > - I am running Debian x86_64 GNU/Linux, with erlang 1:15.b.1-d > - There are no unusual CouchDB configuration options ; the only change I > performed was to disable reduce_limit. A perhaps notable usage aspect : all > the databases are compacted hourly. > - It's not NFS. From /etc/fstab : > > /dev/sda1 / ext4 errors=remount-ro 0 1 > /dev/sda2 /home ext4 defaults 0 2 > > The dual-partition setup is a silly default from OVH (my dedicated server > host), so I have /var/lib/couchdb as a symlink to /home/couchdb/lib, from > sda1 to sda2. > > - I can't rule out a disk issue, because I don't have a lot of experience > with those... any obvious diagnosis command you would like me to run ? I am > certain that I have not run out of disk space, though (still around 1TB > free on that drive). > > Thank you for your patience. > > On 18 April 2013 14:17, Victor Nicollet <[email protected]> wrote: > >> Hello, >> >> The @CouchDB twitter account thought you might find this information >> helpful. >> >> My SaaS start-up uses CouchDB as its primary database. Lately, I have been >> having database corruption issues with version 1.2.0 : every few weeks, one >> of our databases becomes corrupted, which has several negative consequences >> (among others) : >> >> - Replication of that database fails (it does not even start). >> - Compaction of that database fails and *freezes* the server. >> - Several documents in the database become inaccessible through either >> direct access or through _all_docs. >> >> The latest affected database does not contain any information about our >> customers, so I am allowed to release it publicly : >> >> http://nicollet.net/public/2013-04-18.couchdb/prod-folder.couch >> >> This database contains 325 irretrievable documents between identifiers >> 2xFEY0pU2Eb and 3Fn6l04G6Oa. >> I hope this helps, >> >> -- >> Victor Nicollet, CTO, www.runorg.com >> > > > > -- > Victor Nicollet, Directeur Technique, www.runorg.com
