[ 
https://issues.apache.org/jira/browse/CRUNCH-632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15817513#comment-15817513
 ] 

Gabriel Reid commented on CRUNCH-632:
-------------------------------------

Just to clarify on the combination of compression and text files, you're right 
that they aren't typically splittable (assuming gzip compression is used), but 
for example Snappy compression does support input splits.

The 
[o.a.h.mapreduce.lib.input.TextInputFormat#isSplitable|http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.action/0.2.7/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java#45]
 method can be implemented so that we can decide what to do about splitting.

> Add compression support for CSVFileSource
> -----------------------------------------
>
>                 Key: CRUNCH-632
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-632
>             Project: Crunch
>          Issue Type: Improvement
>            Reporter: Jim McStanton
>            Priority: Minor
>
> Currently CSVFileSource does not support decompressing files before reading 
> them, and simply opens the file and starts reading the contents: 
> https://github.com/apache/crunch/blob/6280983179e9c690af69c2bf0e296b054122d724/crunch-core/src/main/java/org/apache/crunch/io/text/csv/CSVRecordReader.java#L127.
>  
> This source would more closely match TextFileSource if this support was 
> added. The {{LineRecordReader}} supports this behavior 
> [here|http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-mapreduce-client-core/2.7.1/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java?av=f#87].
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to