[ https://issues.apache.org/jira/browse/CRUNCH-632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15820377#comment-15820377 ]
Gabriel Reid commented on CRUNCH-632: ------------------------------------- Yep, you're right. Looking into things a bit more in detail, it appears that there is only one compression codec (BZip2Codec) which does allow splits and reading from an arbitrary point in a file, but looking at the extra effort that is required to make this work (CompressedSplitLineReader), and particularly considering that bzip2 doesn't seem to be used all that much, it doesn't seem worth the extra work. > Add compression support for CSVFileSource > ----------------------------------------- > > Key: CRUNCH-632 > URL: https://issues.apache.org/jira/browse/CRUNCH-632 > Project: Crunch > Issue Type: Improvement > Reporter: Jim McStanton > Assignee: Micah Whitacre > Priority: Minor > Attachments: CRUNCH-632.patch, CRUNCH-632b.patch > > > Currently CSVFileSource does not support decompressing files before reading > them, and simply opens the file and starts reading the contents: > https://github.com/apache/crunch/blob/6280983179e9c690af69c2bf0e296b054122d724/crunch-core/src/main/java/org/apache/crunch/io/text/csv/CSVRecordReader.java#L127. > > This source would more closely match TextFileSource if this support was > added. The {{LineRecordReader}} supports this behavior > [here|http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-mapreduce-client-core/2.7.1/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java?av=f#87]. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)