Hi! I have a case where we need to analyse logfiles. They are currently compressed using bzip2, and an example logfile is roughly 105Mb compressed, 720Mb uncompressed.
I'm considering using a Hadoop version with .bz2 support - probably Cloudera's 18.3 dist, but if I understand correctly, .bz2 files are not split. I expect that for most jobs, the number of log files will exceed the number of cores in my hadoop cluster. Is it possible to estimate if I'll get a performance hit because of the lack of splitting under these circumstances? Thanks, \EF -- Erik Forsberg <forsb...@opera.com> Developer, Opera Mini - http://www.opera.com/mini/