Performance hit by not splitting .bz2?

Erik Forsberg Fri, 26 Jun 2009 01:25:27 -0700

Hi!

I have a case where we need to analyse logfiles. They are currently
compressed using bzip2, and an example logfile is roughly 105Mb
compressed, 720Mb uncompressed.


I'm considering using a Hadoop version with .bz2 support - probably
Cloudera's 18.3 dist, but if I understand correctly, .bz2 files are
not split. 

I expect that for most jobs, the number of log files will exceed the
number of cores in my hadoop cluster.

Is it possible to estimate if I'll get a performance hit
because of the lack of splitting under these circumstances?

Thanks,
\EF
-- 
Erik Forsberg <[email protected]>
Developer, Opera Mini - http://www.opera.com/mini/

Performance hit by not splitting .bz2?

Reply via email to