[ https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648887#action_12648887 ]
Chris Douglas commented on HADOOP-4640: --------------------------------------- bq. As for the close() I did as suggested, although it rubs me the wrong way to read all those bytes without needing to. I guess the practical performance impact will be minimal though. It's only calculating a checksum of the remaining bytes from a direct buffer. For the default 64k block, I'd guess it adds somewhere between 20 and 50ms in the close. If it had to make another trip to the native code, I agree that would be improper, but this should be a trivial cost. I'm not sure I follow LzoIndex::findIndexPosition. Given {{\{0, 5, 10, 15\}}} as block positions, findIndexPosition(1) will return 10, but findIndexPosition(5) returns 5. Should the former case also return 5? findIndexPosition(11) returns -1, which also seems contrary to its javadoc explanation. > Add ability to split text files compressed with lzo > --------------------------------------------------- > > Key: HADOOP-4640 > URL: https://issues.apache.org/jira/browse/HADOOP-4640 > Project: Hadoop Core > Issue Type: Improvement > Components: io, mapred > Reporter: Johan Oskarsson > Assignee: Johan Oskarsson > Priority: Trivial > Fix For: 0.20.0 > > Attachments: HADOOP-4640.patch, HADOOP-4640.patch, HADOOP-4640.patch > > > Right now any file compressed with lzop will be processed by one mapper. This > is a shame since the lzo algorithm would be very suitable for large log files > and similar common hadoop data sets. The compression rate is not the best out > there but the decompression speed is amazing. Since lzo writes compressed > data in blocks it would be possible to make an input format that can split > the files. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.