[ https://issues.apache.org/jira/browse/MAPREDUCE-469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854201#action_12854201 ]
David Ciemiewicz commented on MAPREDUCE-469: -------------------------------------------- bzip2 compression format also supports concatenation of individual bzip2 compressed files into a single file. bzcat has absolutely no problem reading all of the data in one of these concatenated files. Unfortunately, both Hadoop Streaming and Pig only see about 2% of the data from the original file in my case. That's a 98% effective data loss. > Support concatenated gzip and bzip2 files > ----------------------------------------- > > Key: MAPREDUCE-469 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-469 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Reporter: Tom White > Assignee: Ravi Gummadi > > When running MapReduce with concatenated gzip files as input only the first > part is read, which is confusing, to say the least. Concatenated gzip is > described in http://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage > and in http://www.ietf.org/rfc/rfc1952.txt. (See original report at > http://www.nabble.com/Problem-with-Hadoop-and-concatenated-gzip-files-to21383097.html) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.