I'm starting to use Hadoop as a simple "storage pool" to store backups of large 
things (currently Oracle database backups).  My Hadoop usage is at a pretty 
primitive level so far and I am really only scratching the surface of what it 
can do.  I haven't used map/reduce at all--so far it's just been "big file goes 
in, big file comes out".

My first concern is to increase the throughput so that I can put files in 
faster.  Right now I'm doing the equivalent of
  tar -cf - /volume1/data | hadoop fs -put   - /backups/vol1.tar
and I've got 4 of those running in parallel.  Throughput is acceptable, but not 
great... I am able to utilize about 0.5Gbits of my 10G network connection.  
Here's what I've observed so far...
* hadoop fs -put processes have a large virtual size (1000m) but only use a 
small amount of resident/chip ram (like 50M).  This is smaller than the 
configured chunk size (64m)
* The tar processes have also larger virtual size (50m) than their resident 
size (800k).  Not sure what that means... maybe I could force tar's output and 
hadoop's input to have larger buffers?
* I don't think CPU is constrained... there are 4 cpus and they are 60% idle
* Network is not really breaking a sweat, I have a 10G connection on the sender 
and 16 hadoop nodes have 1G connections for receiving.
* I don't think disk IO is limiting me either; a quick test sending "cat 
/dev/zero" to hadoop fs put instead of tar gives about the same throughput

I haven't changed much over the fresh-install defaults so I'm hoping some 
tuning will help boost the throughput.  I'll probably be trying various things 
but some pointers on where to look first would be welcome.

As I get further down this road, I can imagine I'd like to do some cooler 
things with my cluster than just tar in, tar out.  Eventually I would like to 
do things like:
* dump uncompressed data in and have hadoop compress the stream later using its 
nodes, maybe depending on age
* record checksum/md5/sha1 on the original files, then later ask map/reduce to 
give me a file list with checksums from the saved file
* regularly read the files back again to make sure the checksums are still 
matching
If I'm understanding correctly, these tasks are probably difficult/impossible 
using "tar" and "gzip" but would probably be quite easy using a 
padded/splittable archive format (e.g. zip?) and chunkable bzip compression.  
Has this been done before, and/or is anyone working on something similar?

Right now I'm only using "tar" because the directory I'm reading from has 
symlinks in it--otherwise I would probably just crawl the directory and put one 
file in hadoop for each 1 file on the source.  This would probably take longer 
to write, though maybe not much.  Maybe I'll switch later to putting files in 
as-is (along with a tarfile of symlinks, device special files, or whatever else 
"tar" handles and hadoop doesn't).  At least then I can do stuff like compress 
and checksum in distributed fashion.  In this case the files aren't small, but 
I might have another application later to store backups of JPG files and in 
that case I would definitely need some sort of many-to-one sequencing many 
input files into one output file.

Though this second part of the message is less important right now, I'd be 
interested in any comments/feedback here, so I can start to read up and get an 
idea of what's ahead of me as I head down this path.

Thanks
gregc

Reply via email to