That size and number of files is fine for storage in HDFS.  Your situation is 
comparable to mine except that I collect 2-5 files per hour and these files are 
slightly larger.  My files are in a compressed and encrypted format, but 
keeping them in a SequenceFile compressed format would make map/reduce 
noticeably more efficient.  This is because the file splits can be arranged by 
the JobTracker to coincide with the disk blocks in HDFS.  That can result 
significantly higher percentage of tasks working against local inputs. 

Note that at these accumulation rates, you are really only talking about < 100K 
files over a year.  That still counts as a small number of files.  A large 
number of files is 10M or more.

You might also be well served if you were to keep your data in a (block 
compressed) tab-delimited form even at the cost of some grotesqueness.  That 
storage format would allow you to use Pig.  Pig is, unfortunately, still quite 
limited in that input data must be fielded and in a simple format. 

-----Original Message-----
From: C G [mailto:[EMAIL PROTECTED]
Sent: Thu 9/6/2007 6:30 PM
To: [email protected]
Subject: RE: Use HDFS as a long term storage solution?
 
Right, my preference would be to use HDFS exclusively...except that there are 
potential issues with many small files in HDFS and a suggestion that perhaps 
MogileFS might be better for many small files.  My strong preference is to 
store everything in HDFS, then do map/reduce with the small files to produce 
results. Since there is a concern about storing a lot of small files in HDFS, I 
now wonder if I should collect small files into MogileFS, the periodically 
merge them together to create large files, and then store those in HDFS and 
then issue my map/reduces. Ick, that sounds complex/time-consuming just writing 
about it :-(.
   
  The files I anticipate processing are all compressed (gzip), and are on the 
order of 80-200M compressed.  I expect to collect 4-8 of these files per hour 
for most hours in the day.

Reply via email to