I'm starting to use Hadoop as a simple "storage pool" to store backups of large things (currently Oracle database backups). My Hadoop usage is at a pretty primitive level so far and I am really only scratching the surface of what it can do. I haven't used map/reduce at all--so far it's just been "big file goes in, big file comes out".
My first concern is to increase the throughput so that I can put files in faster. Right now I'm doing the equivalent of tar -cf - /volume1/data | hadoop fs -put - /backups/vol1.tar and I've got 4 of those running in parallel. Throughput is acceptable, but not great... I am able to utilize about 0.5Gbits of my 10G network connection. Here's what I've observed so far... * hadoop fs -put processes have a large virtual size (1000m) but only use a small amount of resident/chip ram (like 50M). This is smaller than the configured chunk size (64m) * The tar processes have also larger virtual size (50m) than their resident size (800k). Not sure what that means... maybe I could force tar's output and hadoop's input to have larger buffers? * I don't think CPU is constrained... there are 4 cpus and they are 60% idle * Network is not really breaking a sweat, I have a 10G connection on the sender and 16 hadoop nodes have 1G connections for receiving. * I don't think disk IO is limiting me either; a quick test sending "cat /dev/zero" to hadoop fs put instead of tar gives about the same throughput I haven't changed much over the fresh-install defaults so I'm hoping some tuning will help boost the throughput. I'll probably be trying various things but some pointers on where to look first would be welcome. As I get further down this road, I can imagine I'd like to do some cooler things with my cluster than just tar in, tar out. Eventually I would like to do things like: * dump uncompressed data in and have hadoop compress the stream later using its nodes, maybe depending on age * record checksum/md5/sha1 on the original files, then later ask map/reduce to give me a file list with checksums from the saved file * regularly read the files back again to make sure the checksums are still matching If I'm understanding correctly, these tasks are probably difficult/impossible using "tar" and "gzip" but would probably be quite easy using a padded/splittable archive format (e.g. zip?) and chunkable bzip compression. Has this been done before, and/or is anyone working on something similar? Right now I'm only using "tar" because the directory I'm reading from has symlinks in it--otherwise I would probably just crawl the directory and put one file in hadoop for each 1 file on the source. This would probably take longer to write, though maybe not much. Maybe I'll switch later to putting files in as-is (along with a tarfile of symlinks, device special files, or whatever else "tar" handles and hadoop doesn't). At least then I can do stuff like compress and checksum in distributed fashion. In this case the files aren't small, but I might have another application later to store backups of JPG files and in that case I would definitely need some sort of many-to-one sequencing many input files into one output file. Though this second part of the message is less important right now, I'd be interested in any comments/feedback here, so I can start to read up and get an idea of what's ahead of me as I head down this path. Thanks gregc
