[jira] Updated: (HADOOP-4640) Add ability to split text files compressed with lzo

Johan Oskarsson (JIRA) Fri, 14 Nov 2008 06:25:45 -0800

     [ 
https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Johan Oskarsson updated HADOOP-4640:
------------------------------------

    Attachment: HADOOP-4640.patch

Updated patch with most of the suggestions incorporated.
* Will continue if the index is missing with the whole file as one split
* Will only skip verifying the checksums in the close method if we haven't 
decompressed the whole block. That block will be verified by another split 
later anyway.
* Removed lzop from the codecs list in the config
* The indexer method is now aware of the number of checksum algorithms used so 
it seeks to the next block properly
* Changed the unit test to write a lzop compressed file, index and read it back 
again
* As suggested the RecordReaders don't have to read the index, it's done when 
getting the splits instead

I haven't done any work on an output format, I'd rather leave that for another 
ticket since it will require more extensive modifications of the compression 
classes. The option I'm leaning towards is to register a class that implements 
an Indexer interface in the stream classes (LzopOutputStream and 
BlockCompressorStream).

As before this will give one findbugs error.

> Add ability to split text files compressed with lzo
> ---------------------------------------------------
>
>                 Key: HADOOP-4640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4640
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred
>            Reporter: Johan Oskarsson
>            Assignee: Johan Oskarsson
>            Priority: Trivial
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4640.patch, HADOOP-4640.patch
>
>
> Right now any file compressed with lzop will be processed by one mapper. This 
> is a shame since the lzo algorithm would be very suitable for large log files 
> and similar common hadoop data sets. The compression rate is not the best out 
> there but the decompression speed is amazing.  Since lzo writes compressed 
> data in blocks it would be possible to make an input format that can split 
> the files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4640) Add ability to split text files compressed with lzo

Reply via email to