Hi,

Thanks for mentioning of the FD, it was default -- 1024. So I set it to a
high value for mogstored, fsd, and lighttpd.
However it didn't seem fix the problem.

I too suspect DB and the storage nodes, because they are centralized
resources that all fsds connect.

Still, I couldn't find anything wrong with the DB or storage nodes. The
status in mogadm check is always "writable" for all nodes.

> Do you have a huge replication queue?

Yes, I'm having about half a million to replicate.

Another evidence when I ran those many replicators is, in colo2/colo4
trackers( they don't have replicator or delete running, just
queryworker/listener jobs).
They showed something like these:

[monitor(22673)] Monitor running; scanning usage files
[reaper(22674)] Reaper running; looking for dead devices
Watchdog killing worker 22658 (queryworker)
Child 22824 (queryworker) died: 9 (UNEXPECTED)
Job queryworker has only 34, wants 35, making 1.
[monitor(22673)] dev10: used = 245859688, total = 304629952, writeable = 1
Child 22658 (queryworker) died: 9 (UNEXPECTED)
Job queryworker has only 34, wants 35, making 1.
Watchdog killing worker 22652 (queryworker)
Child 22829 (queryworker) died: 9 (UNEXPECTED)
Job queryworker has only 34, wants 35, making 1.
[monitor(22673)] Monitor running; scanning usage files
[monitor(22673)] dev17: used = 37421628, total = 305723656, writeable = 1
Watchdog killing worker 22651 (queryworker)
Watchdog killing worker 22652 (queryworker)
Watchdog killing worker 22647 (queryworker)
Child 22652 (queryworker) died: 9 (UNEXPECTED)
Job queryworker has only 34, wants 35, making 1.
Watchdog killing worker 22654 (queryworker)


2008/6/26 dormando <[EMAIL PROTECTED]>:

> I don't have any obvious answers...
>
> Feels like I'd have to go look to be honest. Something's not being
> monitored... Something's running out of FD's/etc.
>
> One tracker replicating can affect other trackers, because:
>
> 1) The tracker queries are a little DB intensive. Is your DB hitting max
> conns or any other weirdness?
>
> 2) One replicator will hit all storage nodes. Since you're pairing storage
> nodes with trackers, those are going to get hit regardless.
>
> Can you monitor reading/writing on your storage nodes? Or watch mogadm more
> closely to see if they wander from the 'writable' status while you have many
> replicators running?
>
> Do you have a huge replication queue? It's weird why starting 10
> replicators would cause everything to die all of a sudden, unless you have a
> pretty massive backlog of crap.
>
> Do you have the same issue if you run one replicator and one delete per
> tracker?
>
> Is lighttpd missconfigured anywhere? running out of FD's, or failing reads
> during this time? Are you running 2.17 from tarballs, or SVN trunk? Please
> try SVN trunk if so, we've fixed quite a bit of crap since then.
>
> Anyone else have easy ideas to check? :\ A create_close makes a handful of
> DB queries then hits the storage node to verify file size.
>
> -Dormando
>
>  I've just tested it in the morning non-busy time, I set N to 10 and let it
>> run for 5 minutes.
>>
>> Then tried to inject file, I got:
>>
>> colo4:/home/www# perl script/inject.pl "tempf" "tempfile" t15 /tmp/aaa 1
>> MogileFS::Backend: tracker socket never became readable (local2:7001) when
>> sending command: [create_close domain=tempf&fid=3462585&devid=16&path=
>> http://192.168.11.4:7500/dev16/0/003/462/0003462585.fid&size=10000&key=t15<
>> http://192.168.11.4:7500/dev16/0/003/462/0003462585.fid&size=10000&key=t15
>> >
>> ] at /usr/local/share/perl/5.8.8/MogileFS/NewHTTPFile.pm line 335
>>
>> 0colo4:/home/www# perl script/inject.pl "tempf" "tempfile" t15 /tmp/aaa 1
>> MogileFS::Backend: tracker socket never became readable (local4:7001) when
>> sending command: [create_close domain=tempf&fid=3462595&devid=18&path=
>> http://192.168.11.6:7500/dev18/0/003/462/0003462595.fid&size=10000&key=t15<
>> http://192.168.11.6:7500/dev18/0/003/462/0003462595.fid&size=10000&key=t15
>> >
>> ] at /usr/local/share/perl/5.8.8/MogileFS/NewHTTPFile.pm line 335
>>
>> By this time, the load avg of colo2 and colo4 were about 0.5, colo3 were
>> 1.3. They hadn't reach a swap yet.
>>
>> Database hadn't reach the swap too, its avg was about 1.0 and I tested
>> some queries on it, the database was still fast.
>>
>> There were no problem with get_paths, it's 100% success.
>>
>>
>> So, what I'm trying to find out is why replication running on only colo3
>> affected other trackers on injection while the DB did not seemed to be
>> overloaded.
>>
>> Also noticed that deletion has the similar symptom. And in the busy-time,
>> if I enabled just one replication or deletion job, I usually see some
>> failures on inject.
>>
>> In the past, I had tried to set up more tracker on other machine with the
>> similar setting as colo4 and colo2, but it didn't help.
>>
>> btw, these trackers are running on it's own physical machine, database is
>> dedicated. Each one of tracker machine also run mogstored for talking with
>> tracker and a lighttpd for serving customers.
>>
>> -thank you
>> kem
>>
>> 2008/6/25 dormando <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>>:
>>
>>
>>    Exactly how many jobs of each are you running?
>>
>>    Some jobs don't scale as well as others. Running more of them would
>>    increase database load...
>>
>>    Are you properly monitoring your database server? Does it end up in
>>    swap, are you high on IO usage already?
>>
>>    The mogilefs clients don't have a very forgiving timeout, so if a
>>    tracker is in swap it'll be unlikely to ever finish it works. However
>>    even if IO is loaded on a machine, the trackers usually respond in
>>    decent time...
>>
>>    Are you running the trackers on the database? Out of CPU? Is your
>>    database actually under any load?
>>
>>    There are no transactions in use in mogilefs... So your theory isn't
>>    likely. The amount of stuff in the queue also doesn't have relation to
>>    load. At least not in any mogilefs service I maintain... I can insert
>>    20+ million fids into file_to_replicate and it'll work fine. It'll be
>>    annoyed and that's a very stupid thing to do, but it won't overload
>>    anything by nature of there being work to do.
>>
>>    What version of mogilefs are you running? 2.17? Latest trunk?
>>
>>    -Dormando
>>
>>    Komtanoo Pinpimai wrote:
>>     > Hello,
>>     >
>>     > My env is:
>>     > 2 trackers running plenty of querywork,listener,monitor,reaper jobs.
>>     > 1 tracker running only a few delete and replicate jobs.
>>     >
>>     > I guess, I didn't have this problem in the old version of
>>     > MogileFS(version 1??).
>>     > What's happening is when a hard drive has gone bad and I mark it
>>    as dead,
>>     > it will create tons of replication jobs. If I have 5 simultaneous
>>     > replication jobs in a tracker,
>>     > it will be really hard to inject one file into the system(with
>>    tracker
>>     > busy). Or I have only 1 replication job
>>     > and it's in customer busy time, injecting file usually fails.
>>     >
>>     > What's annoying is no matter how many trackers or querywork/listener
>>     > jobs I added,
>>     > if there is a few simultaneous replicate jobs with tons of works in
>>     > their queue,
>>     > the whole trackers seem to be busy all the time. I also have a
>> trying
>>     > code, like trying to inject and sleep for 15 times,
>>     > it still does not work very well in this situation.
>>     >
>>     > Since these trackers share the same database, I'm trying to guess,
>> it
>>     > must have something to do with transaction,
>>     > for example, the replication or deletion jobs might create some
>>     > transactions that lock some tables preventing
>>     > other jobs from injecting files. And when there are tons of
>>     > replication/deletion jobs in the line, the whole the system
>>     > will turn into readonly mode.
>>     >
>>     > Have you ever experienced this and how do you deal with it ? Is
>>    there a
>>     > way to tell mogilefs not to use transaction but still work _ pretty
>>     > correct _?
>>     >
>>     > --
>>     > I'm going to stop checking email.
>>     > Let's talk in my Hi5.
>>
>>
>>
>>
>> --
>> I'm going to stop checking email.
>> Let's talk in my Hi5.
>>
>
>


-- 
I'm going to stop checking email.
Let's talk in my Hi5.

Reply via email to