Hmm :) Maybe my next documentation spree should be a mogilefs FAQ :)

Main question is, do we do more hosts per disks, or more disks per hosts.

I think the tradeoff here is pretty easy to spot:

As you spread out hosts:

- More local cache. mogstored relies on an OS's object cache to speed up hot files, which you mention the CDN should take care of that...
- More bandwidth to the devices.
- Lessens the impact of losing a host (you should have enough mogilefs hosts/devices that losing any one or two is something you don't have to care about!).
- More CPU, I guess. It's rare but possible to load up mogstored on CPU.

As you add more devices:

- Fewer hosts to manage
- Losing an individual disk in a machine shouldn't hurt anything. In my own setup I never bothered replacing dead disks in a host with multiple drives. Just marked them as dead and got more hd's on the next server order. Since you're somewhat more likely to lose a device than a whole host, this isn't so bad.

You have to keep in mind:

- How full are your devices actually going to get before they become too active to hold more files? 750G drives are nice, but usually I can't even fill a 250G drive before it gets hosed with IO. - The impact of losing a whole host with many 750G drives with many (millions of?) files. It could take a long time for the reaper and replicators to deal with this as they work in small batches of files. Then again, it won't matter as much as you grow (and especially if you can quickly deal with dead hosts).

So on a really busy service, I'd have tons of 64-bit hosts with extra RAM. On something with more streaming involved, you have to understand your dataset well to understand which way to go. Think about the average size/access type of your files, as well as how often they're added or replaced in the system.

Just remember to think of spindles more than disk size. Unless your dataset is very idle you won't end up filling the disk, and the more devices you have the more you can parallelize your batch operations :)

As a side note, any real reason not to run the trackers on the storage
nodes?

I did it. Worked okay. Most of my storage nodes didn't have trackers, but some did. The only issue is the trackers can get CPU heavy, which could interact with other things on your box.

also, anyone have any pros cons on running mysql master/save
with InnoDB on DRBD versus running lets say mysql cluster?


MySQL Cluster's probably not the greatest fit for the mogilefs database. The dataset can be relatively small, but I don't think it's quite small enough. Although honestly I only say that because I have limited experience with cluster. My mogilefs DBs have been happy if they have enough RAM for InnoDB to properly cache things...

DRBD should work okay. I've also done master:master with auto_increment_offset, but that might scare the bejesus out of some folks on the list. I like being able to optimize my tables though :)

-Dormando

Reply via email to