I loaded 5 files of bzip2 compressed data into a table in Hive. Three
are small test files containing 10,000 records. Two were large ~8Gb
compressed.
When I run a query against the table I see three tasks that complete
almost immediately and two tasks that run for a very long time. It
appears to me that
Hive/Hadoop is not splitting the input of the *.bz2. I have seen some
old mails about this, but could not find any resolution for this
problem. I compressed the files
using the Apache bz2 jar, the file are named *.bz2. I am using Hadoop
0.19.1 r745977