Could someone explain what Pig is?  (ie, Ted said ' That storage format
would allow you to use Pig.  Pig is, unfortunately, still quite limited
in that input data must be fielded and in a simple format.')

Thanks!

-----Original Message-----
From: Ted Dunning [mailto:[EMAIL PROTECTED] 
Sent: Thursday, September 06, 2007 9:26 PM
To: [email protected]; [email protected]
Subject: RE: Use HDFS as a long term storage solution?


That size and number of files is fine for storage in HDFS.  Your
situation is comparable to mine except that I collect 2-5 files per hour
and these files are slightly larger.  My files are in a compressed and
encrypted format, but keeping them in a SequenceFile compressed format
would make map/reduce noticeably more efficient.  This is because the
file splits can be arranged by the JobTracker to coincide with the disk
blocks in HDFS.  That can result significantly higher percentage of
tasks working against local inputs. 

Note that at these accumulation rates, you are really only talking about
< 100K files over a year.  That still counts as a small number of files.
A large number of files is 10M or more.

You might also be well served if you were to keep your data in a (block
compressed) tab-delimited form even at the cost of some grotesqueness.
That storage format would allow you to use Pig.  Pig is, unfortunately,
still quite limited in that input data must be fielded and in a simple
format. 

-----Original Message-----
From: C G [mailto:[EMAIL PROTECTED]
Sent: Thu 9/6/2007 6:30 PM
To: [email protected]
Subject: RE: Use HDFS as a long term storage solution?
 
Right, my preference would be to use HDFS exclusively...except that
there are potential issues with many small files in HDFS and a
suggestion that perhaps MogileFS might be better for many small files.
My strong preference is to store everything in HDFS, then do map/reduce
with the small files to produce results. Since there is a concern about
storing a lot of small files in HDFS, I now wonder if I should collect
small files into MogileFS, the periodically merge them together to
create large files, and then store those in HDFS and then issue my
map/reduces. Ick, that sounds complex/time-consuming just writing about
it :-(.
   
  The files I anticipate processing are all compressed (gzip), and are
on the order of 80-200M compressed.  I expect to collect 4-8 of these
files per hour for most hours in the day.

Reply via email to