[jira] [Commented] (CRUNCH-632) Add compression support for CSVFileSource

Gabriel Reid (JIRA) Wed, 11 Jan 2017 23:32:04 -0800

    [ 
https://issues.apache.org/jira/browse/CRUNCH-632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15820377#comment-15820377
 ]


Gabriel Reid commented on CRUNCH-632:
-------------------------------------

Yep, you're right.

Looking into things a bit more in detail, it appears that there is only one 
compression codec (BZip2Codec) which does allow splits and reading from an 
arbitrary point in a file, but looking at the extra effort that is required to 
make this work (CompressedSplitLineReader), and particularly considering that 
bzip2 doesn't seem to be used all that much, it doesn't seem worth the extra 
work.

> Add compression support for CSVFileSource
> -----------------------------------------
>
>                 Key: CRUNCH-632
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-632
>             Project: Crunch
>          Issue Type: Improvement
>            Reporter: Jim McStanton
>            Assignee: Micah Whitacre
>            Priority: Minor
>         Attachments: CRUNCH-632.patch, CRUNCH-632b.patch
>
>
> Currently CSVFileSource does not support decompressing files before reading 
> them, and simply opens the file and starts reading the contents: 
> https://github.com/apache/crunch/blob/6280983179e9c690af69c2bf0e296b054122d724/crunch-core/src/main/java/org/apache/crunch/io/text/csv/CSVRecordReader.java#L127.
>  
> This source would more closely match TextFileSource if this support was 
> added. The {{LineRecordReader}} supports this behavior 
> [here|http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-mapreduce-client-core/2.7.1/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java?av=f#87].
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CRUNCH-632) Add compression support for CSVFileSource

Reply via email to