[
https://issues.apache.org/jira/browse/HADOOP-788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sanjay Dahiya updated HADOOP-788:
---------------------------------
Status: Patch Available (was: Open)
> Streaming should use a subclass of TextInputFormat for reading text inputs.
> ---------------------------------------------------------------------------
>
> Key: HADOOP-788
> URL: https://issues.apache.org/jira/browse/HADOOP-788
> Project: Hadoop
> Issue Type: Improvement
> Components: contrib/streaming
> Reporter: Owen O'Malley
> Assigned To: Sanjay Dahiya
> Attachments: Hadoop-788.patch
>
>
> Currently streaming uses a lot of custom code for processing text inputs.
> I propose:
> 1. Move class LineRecordReader out of TextInputFormat.
> 2. Make class StreamLineRecordReader extend LineRecordReader.
> 3. StreamLineRecordReader uses LineRecordReader.next to read the lines and
> splits them on tab to generate a Text/Text key/value pair.
> This will remove a lot of code from streaming and give it automatic support
> for the compression codecs that the "base" part of Hadoop enjoys. In
> particular, if the native zlib code is used, it will remove the 2gb limit on
> compressed files.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.