https://issues.apache.org/bugzilla/show_bug.cgi?id=45718

           Summary: Enhance bzip2 for Hadoop
           Product: Ant
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P1
         Component: Other
        AssignedTo: [email protected]
        ReportedBy: [EMAIL PROTECTED]


This enhancement request is about bzip2 component of Ant Tools.

Hadoop is Apache's open source implementation of map-reduce framework.  One of
the cornerstones of this framework is to split the input files into chunks,
which are fed to different machines for parallel processing.  When it comes to
compressed files, most of the codecs need the whole file to successfully decode
the data.  This limitation forces Hadoop to send one compressed input file to
one process only.  The result is the reduced parallelism.

bzip2 does compression on blocks of data and so while decompressing, these
blocks can be processed independent of each other.  This capability of bzip2 is
indeed an opportunity to split the bzip2 compressed file, for Hadoop parallel
processing.

The current code of bzip2 in Ant, does not provide such "by block" processing
capability and rather considers it a continuous stream.  We suggest the
following enhancements to make it possible to be used with Hadoop out of the
box.

(1) CBZip2InputStream should provide two reading modes.  e.g. continuous and By
Block.  In the continuous reading mode, its behavior is the same as it is. 
While in "By Block" mode, the code tells its client when it reaches a bzip2
block delimiter.  So effectively in 'by block' mode it processes one block at a
time and informs the user about the events when it hits a bzip2 block boundary.

(2) CBZip2InputStream should tell the client that how much of compressed data
it has processed at a certain point in time.  e.g. if at a point
CBZip2InputStream tells its client that it has processed 500 bytes of
compressed data, that would mean that the client has read un-compressed data
which was generated from 500 bytes of compressed data.  The ant code should
also provide a setter for this statistic.  This is needed because e.g. if
compressed stream has BZ at the start and the client has stripped that off the
stream, it should tell bzip2 code about it.

(3) Hadoop input files are huge (in tera bytes).  So for such a big files, the
probability that the bzip2 block delimiters accidentally occur somewhere in the
stream is not small.  So bzip2 should make sure that a block maintains its
integrity (by comparing computed crc with recorded crc).  This should be done
before spilling out a block data.  And if a block is found spurious, it is
skipped and tell the client about this event and move on to process the next
block.  Another reason for this functionality is that in a huge file, only a
few bad blocks should not stop the rest of the file from processing


-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

Reply via email to