Hi, Thanks for mentioning of the FD, it was default -- 1024. So I set it to a high value for mogstored, fsd, and lighttpd. However it didn't seem fix the problem.
I too suspect DB and the storage nodes, because they are centralized resources that all fsds connect. Still, I couldn't find anything wrong with the DB or storage nodes. The status in mogadm check is always "writable" for all nodes. > Do you have a huge replication queue? Yes, I'm having about half a million to replicate. Another evidence when I ran those many replicators is, in colo2/colo4 trackers( they don't have replicator or delete running, just queryworker/listener jobs). They showed something like these: [monitor(22673)] Monitor running; scanning usage files [reaper(22674)] Reaper running; looking for dead devices Watchdog killing worker 22658 (queryworker) Child 22824 (queryworker) died: 9 (UNEXPECTED) Job queryworker has only 34, wants 35, making 1. [monitor(22673)] dev10: used = 245859688, total = 304629952, writeable = 1 Child 22658 (queryworker) died: 9 (UNEXPECTED) Job queryworker has only 34, wants 35, making 1. Watchdog killing worker 22652 (queryworker) Child 22829 (queryworker) died: 9 (UNEXPECTED) Job queryworker has only 34, wants 35, making 1. [monitor(22673)] Monitor running; scanning usage files [monitor(22673)] dev17: used = 37421628, total = 305723656, writeable = 1 Watchdog killing worker 22651 (queryworker) Watchdog killing worker 22652 (queryworker) Watchdog killing worker 22647 (queryworker) Child 22652 (queryworker) died: 9 (UNEXPECTED) Job queryworker has only 34, wants 35, making 1. Watchdog killing worker 22654 (queryworker) 2008/6/26 dormando <[EMAIL PROTECTED]>: > I don't have any obvious answers... > > Feels like I'd have to go look to be honest. Something's not being > monitored... Something's running out of FD's/etc. > > One tracker replicating can affect other trackers, because: > > 1) The tracker queries are a little DB intensive. Is your DB hitting max > conns or any other weirdness? > > 2) One replicator will hit all storage nodes. Since you're pairing storage > nodes with trackers, those are going to get hit regardless. > > Can you monitor reading/writing on your storage nodes? Or watch mogadm more > closely to see if they wander from the 'writable' status while you have many > replicators running? > > Do you have a huge replication queue? It's weird why starting 10 > replicators would cause everything to die all of a sudden, unless you have a > pretty massive backlog of crap. > > Do you have the same issue if you run one replicator and one delete per > tracker? > > Is lighttpd missconfigured anywhere? running out of FD's, or failing reads > during this time? Are you running 2.17 from tarballs, or SVN trunk? Please > try SVN trunk if so, we've fixed quite a bit of crap since then. > > Anyone else have easy ideas to check? :\ A create_close makes a handful of > DB queries then hits the storage node to verify file size. > > -Dormando > > I've just tested it in the morning non-busy time, I set N to 10 and let it >> run for 5 minutes. >> >> Then tried to inject file, I got: >> >> colo4:/home/www# perl script/inject.pl "tempf" "tempfile" t15 /tmp/aaa 1 >> MogileFS::Backend: tracker socket never became readable (local2:7001) when >> sending command: [create_close domain=tempf&fid=3462585&devid=16&path= >> http://192.168.11.4:7500/dev16/0/003/462/0003462585.fid&size=10000&key=t15< >> http://192.168.11.4:7500/dev16/0/003/462/0003462585.fid&size=10000&key=t15 >> > >> ] at /usr/local/share/perl/5.8.8/MogileFS/NewHTTPFile.pm line 335 >> >> 0colo4:/home/www# perl script/inject.pl "tempf" "tempfile" t15 /tmp/aaa 1 >> MogileFS::Backend: tracker socket never became readable (local4:7001) when >> sending command: [create_close domain=tempf&fid=3462595&devid=18&path= >> http://192.168.11.6:7500/dev18/0/003/462/0003462595.fid&size=10000&key=t15< >> http://192.168.11.6:7500/dev18/0/003/462/0003462595.fid&size=10000&key=t15 >> > >> ] at /usr/local/share/perl/5.8.8/MogileFS/NewHTTPFile.pm line 335 >> >> By this time, the load avg of colo2 and colo4 were about 0.5, colo3 were >> 1.3. They hadn't reach a swap yet. >> >> Database hadn't reach the swap too, its avg was about 1.0 and I tested >> some queries on it, the database was still fast. >> >> There were no problem with get_paths, it's 100% success. >> >> >> So, what I'm trying to find out is why replication running on only colo3 >> affected other trackers on injection while the DB did not seemed to be >> overloaded. >> >> Also noticed that deletion has the similar symptom. And in the busy-time, >> if I enabled just one replication or deletion job, I usually see some >> failures on inject. >> >> In the past, I had tried to set up more tracker on other machine with the >> similar setting as colo4 and colo2, but it didn't help. >> >> btw, these trackers are running on it's own physical machine, database is >> dedicated. Each one of tracker machine also run mogstored for talking with >> tracker and a lighttpd for serving customers. >> >> -thank you >> kem >> >> 2008/6/25 dormando <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>>: >> >> >> Exactly how many jobs of each are you running? >> >> Some jobs don't scale as well as others. Running more of them would >> increase database load... >> >> Are you properly monitoring your database server? Does it end up in >> swap, are you high on IO usage already? >> >> The mogilefs clients don't have a very forgiving timeout, so if a >> tracker is in swap it'll be unlikely to ever finish it works. However >> even if IO is loaded on a machine, the trackers usually respond in >> decent time... >> >> Are you running the trackers on the database? Out of CPU? Is your >> database actually under any load? >> >> There are no transactions in use in mogilefs... So your theory isn't >> likely. The amount of stuff in the queue also doesn't have relation to >> load. At least not in any mogilefs service I maintain... I can insert >> 20+ million fids into file_to_replicate and it'll work fine. It'll be >> annoyed and that's a very stupid thing to do, but it won't overload >> anything by nature of there being work to do. >> >> What version of mogilefs are you running? 2.17? Latest trunk? >> >> -Dormando >> >> Komtanoo Pinpimai wrote: >> > Hello, >> > >> > My env is: >> > 2 trackers running plenty of querywork,listener,monitor,reaper jobs. >> > 1 tracker running only a few delete and replicate jobs. >> > >> > I guess, I didn't have this problem in the old version of >> > MogileFS(version 1??). >> > What's happening is when a hard drive has gone bad and I mark it >> as dead, >> > it will create tons of replication jobs. If I have 5 simultaneous >> > replication jobs in a tracker, >> > it will be really hard to inject one file into the system(with >> tracker >> > busy). Or I have only 1 replication job >> > and it's in customer busy time, injecting file usually fails. >> > >> > What's annoying is no matter how many trackers or querywork/listener >> > jobs I added, >> > if there is a few simultaneous replicate jobs with tons of works in >> > their queue, >> > the whole trackers seem to be busy all the time. I also have a >> trying >> > code, like trying to inject and sleep for 15 times, >> > it still does not work very well in this situation. >> > >> > Since these trackers share the same database, I'm trying to guess, >> it >> > must have something to do with transaction, >> > for example, the replication or deletion jobs might create some >> > transactions that lock some tables preventing >> > other jobs from injecting files. And when there are tons of >> > replication/deletion jobs in the line, the whole the system >> > will turn into readonly mode. >> > >> > Have you ever experienced this and how do you deal with it ? Is >> there a >> > way to tell mogilefs not to use transaction but still work _ pretty >> > correct _? >> > >> > -- >> > I'm going to stop checking email. >> > Let's talk in my Hi5. >> >> >> >> >> -- >> I'm going to stop checking email. >> Let's talk in my Hi5. >> > > -- I'm going to stop checking email. Let's talk in my Hi5.
