Ahh, progress... I updated the configuration file on our tracker hosts to include the following line:
old_repl_compat 0 At a glance, we appear to be getting less timeouts for replicate, query, and monitor workers. The row count in file_to_replicate is going down (despite of the continuing FSCK). We're keeping an eye on progress for now. mysql> select count(*) from file_to_replicate; +----------+ | count(*) | +----------+ | 58604833 | +----------+ 1 row in set (14.81 sec) mysql> select count(*) from file_to_replicate; +----------+ | count(*) | +----------+ | 58604490 | +----------+ 1 row in set (14.96 sec) -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Brian Lynch Sent: Wednesday, April 30, 2008 3:36 PM To: dormando Cc: [email protected] Subject: RE: Replication Oddities I figured something out. Old replication compatibility is enabled for some reason (it is not in the configuration). This explains the large number of entries in file_to_replicate and the reason our database has been periodically hanging with the following query (SELECT fid FROM file WHERE dmid='1' AND classid='1' AND devcount = '1' AND length IS NOT NULL LIMIT 1000). Here is the configuration file on all servers: db_dsn DBI:mysql:mogilefs:hsqlmog00 db_user mogile db_pass mogile conf_port 7001 listener_jobs 5 Here is the output from mogadm settings list: hsv4s22cen03 /usr/lib/perl5/site_perl/5.8.8/MogileFS blynch $ mogadm settings list enable_rebalance = 1 schema_version = 9 Looking into the code to figure out why it is the case. Note that this is still the latest version (2.17) from CPAN. - Brian -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Brian Lynch Sent: Wednesday, April 30, 2008 12:37 AM To: dormando Cc: [email protected] Subject: RE: Replication Oddities Dormando, There are roughly 58 million entries in file_to_replicate of a total 71 million files. It seems like the Replication Worker is for some reason not deleting completed rows (though the code path exists). Note that only 570K entries in file_to_replicate have failcount > 0. Only 9 entries have a nexttry = ENDOFTIME. mysql> select count(*) from file_to_replicate; +----------+ | count(*) | +----------+ | 58395828 | +----------+ 1 row in set (2 min 6.26 sec) Best, Brian -----Original Message----- From: dormando [mailto:[EMAIL PROTECTED] Sent: Monday, April 28, 2008 12:50 AM To: Brian Lynch Cc: [email protected] Subject: Re: Replication Oddities >>>> Would it be possible to purge portions of the file_to_replicate > table? I'm currently pulling out known good replications to identify > bogus entries. You should sample rows out of file_to_replicate, see if the nexttry is set to 2147483647 - and that all of the paths are invalid. I've never outright removed rows from file_to_replicate, _unless_ I have verified that the fid is gone, ie: - Has no matching 'file' entry. - Has no matching 'file_on' rows (odd bug, haven't fixed yet). - Has file row, file_on row(s), but all paths are dead. 404's. If at least one of those conditions are met, the fid can be removed from file_to_replicate, and you might want to see why they disappeared to begin with. Otherwise you do not remove the row. If the nexttry is off in the future but not equal to ENDOFTIME (2147483647) you can try UPDATE'ing those rows to UNIX_TIMESTAMP() and see if they get chewed through. If not, you should find out exactly what's going on. Odds are one of the three conditions listed above has happened. If otherwise, you should definitely give a best effort in figuring out what it was. Yeah, this should be way more automatic. We'll get to it someday, and also accept patches ;) -Dormando
