RE: Replication Oddities

Brian Lynch Wed, 30 Apr 2008 16:08:44 -0700

Ahh, progress... 

I updated the configuration file on our tracker hosts to include the
following line:


old_repl_compat 0

At a glance, we appear to be getting less timeouts for replicate, query,
and monitor workers.  The row count in file_to_replicate is going down
(despite of the continuing FSCK).  We're keeping an eye on progress for
now.  

mysql> select count(*) from file_to_replicate;
+----------+
| count(*) |
+----------+
| 58604833 |
+----------+
1 row in set (14.81 sec)

mysql> select count(*) from file_to_replicate;
+----------+
| count(*) |
+----------+
| 58604490 |
+----------+
1 row in set (14.96 sec)




-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Brian Lynch
Sent: Wednesday, April 30, 2008 3:36 PM
To: dormando
Cc: [email protected]
Subject: RE: Replication Oddities

I figured something out.  Old replication compatibility is enabled for
some reason (it is not in the configuration).  This explains the large
number of entries in file_to_replicate and the reason our database has
been periodically hanging with the following query (SELECT fid FROM file
WHERE dmid='1' AND classid='1' AND devcount = '1' AND length IS NOT NULL
LIMIT 1000). Here is the configuration file on all servers: 

db_dsn DBI:mysql:mogilefs:hsqlmog00
db_user mogile
db_pass mogile
conf_port 7001
listener_jobs 5

Here is the output from mogadm settings list: 

hsv4s22cen03 /usr/lib/perl5/site_perl/5.8.8/MogileFS blynch $ mogadm
settings list
         enable_rebalance = 1
           schema_version = 9


Looking into the code to figure out why it is the case.  Note that this
is still the latest version (2.17) from CPAN. 

- Brian


-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Brian Lynch
Sent: Wednesday, April 30, 2008 12:37 AM
To: dormando
Cc: [email protected]
Subject: RE: Replication Oddities

Dormando,

  There are roughly 58 million entries in file_to_replicate of a total
71 million files. It seems like the Replication Worker is for some
reason not deleting completed rows (though the code path exists).  Note
that only 570K entries in file_to_replicate have failcount > 0. Only 9
entries have a nexttry = ENDOFTIME. 

mysql> select count(*) from file_to_replicate;

+----------+
| count(*) |
+----------+
| 58395828 |
+----------+
1 row in set (2 min 6.26 sec)

Best,
Brian

-----Original Message-----
From: dormando [mailto:[EMAIL PROTECTED] 
Sent: Monday, April 28, 2008 12:50 AM
To: Brian Lynch
Cc: [email protected]
Subject: Re: Replication Oddities


>>>> Would it be possible to purge portions of the file_to_replicate
> table?  I'm currently pulling out known good replications to identify
> bogus entries. 

You should sample rows out of file_to_replicate, see if the nexttry is
set to 2147483647 - and that all of the paths are invalid.

I've never outright removed rows from file_to_replicate, _unless_ I have
verified that the fid is gone, ie:

- Has no matching 'file' entry.
- Has no matching 'file_on' rows (odd bug, haven't fixed yet).
- Has file row, file_on row(s), but all paths are dead. 404's.

If at least one of those conditions are met, the fid can be removed from
file_to_replicate, and you might want to see why they disappeared to
begin with. Otherwise you do not remove the row.

If the nexttry is off in the future but not equal to ENDOFTIME
(2147483647) you can try UPDATE'ing those rows to UNIX_TIMESTAMP() and
see if they get chewed through. If not, you should find out exactly
what's going on. Odds are one of the three conditions listed above has
happened. If otherwise, you should definitely give a best effort in
figuring out what it was.

Yeah, this should be way more automatic. We'll get to it someday, and
also accept patches ;)

-Dormando

RE: Replication Oddities

Reply via email to