Hi!

I have a case where we need to analyse logfiles. They are currently
compressed using bzip2, and an example logfile is roughly 105Mb
compressed, 720Mb uncompressed.

I'm considering using a Hadoop version with .bz2 support - probably
Cloudera's 18.3 dist, but if I understand correctly, .bz2 files are
not split. 

I expect that for most jobs, the number of log files will exceed the
number of cores in my hadoop cluster.

Is it possible to estimate if I'll get a performance hit
because of the lack of splitting under these circumstances?

Thanks,
\EF
-- 
Erik Forsberg <forsb...@opera.com>
Developer, Opera Mini - http://www.opera.com/mini/

Reply via email to