Hey, > - Over 7 million files are exhibiting a policy violation on the > current FSCK run (out of 71.8 million total) > > - Approximately 6.2 million of these files are perfectly fine, > but have leftover entries in file_to_replicate (also generating > policy_no_suggestions messages)
Is the nexttry value for these fids equal to 'ENDOFTIME' or are they being constantly retried? What do the file and file_on rows look like for some of these guys? Does devcount agree with the count of rows in file_on? Are both of their copies on the same host (but different devices)? > - Approximately 450 thousand files are replicated on more than > 2 devices (the policy limit) Can you get a breakdown of how far off they are? Is the devcount == 3 for most of them, or is it higher? > - Approximately 4.6 thousand files are replicated on 47 devices You might want to test out or upgrade to the latest trunk. If you ended up with fewer than two _writable_ hosts (ie; with enough disk space to take few files), there was a replication bug which would end up putting a copy on every device on that one writable host. > - single master/slave database > > - 3 trackers running off the database > > - 7 mogstored nodes with 46 devices each and lighttpd set up as > the getport On a mogadm check, how many hosts are writable? Did you set any to readonly/etc? What version of mogilefs? I'd be curious to see why those rows are sticking around in file_to_replicate. I'm willing to bet you only have a single writable host. If not, we can work out what it is. Your best bet would be to even out files / add another host / etc. If you really wanted to fix those "bumpkus" fids easily you can either drain each devices down to zero fids one by one... Which will remove one copy but not create a new one... Or generate a list of all of the domain / key combos and re-upload each file. That will create a few file with proper devcounts and delete the old fid and all of the redundant copies. -Dormando
