[ https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Johan Oskarsson updated HADOOP-4640: ------------------------------------ Attachment: HADOOP-4640.patch Updated patch with most of the suggestions incorporated. * Will continue if the index is missing with the whole file as one split * Will only skip verifying the checksums in the close method if we haven't decompressed the whole block. That block will be verified by another split later anyway. * Removed lzop from the codecs list in the config * The indexer method is now aware of the number of checksum algorithms used so it seeks to the next block properly * Changed the unit test to write a lzop compressed file, index and read it back again * As suggested the RecordReaders don't have to read the index, it's done when getting the splits instead I haven't done any work on an output format, I'd rather leave that for another ticket since it will require more extensive modifications of the compression classes. The option I'm leaning towards is to register a class that implements an Indexer interface in the stream classes (LzopOutputStream and BlockCompressorStream). As before this will give one findbugs error. > Add ability to split text files compressed with lzo > --------------------------------------------------- > > Key: HADOOP-4640 > URL: https://issues.apache.org/jira/browse/HADOOP-4640 > Project: Hadoop Core > Issue Type: Improvement > Components: io, mapred > Reporter: Johan Oskarsson > Assignee: Johan Oskarsson > Priority: Trivial > Fix For: 0.20.0 > > Attachments: HADOOP-4640.patch, HADOOP-4640.patch > > > Right now any file compressed with lzop will be processed by one mapper. This > is a shame since the lzo algorithm would be very suitable for large log files > and similar common hadoop data sets. The compression rate is not the best out > there but the decompression speed is amazing. Since lzo writes compressed > data in blocks it would be possible to make an input format that can split > the files. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.