[
https://issues.apache.org/jira/browse/HADOOP-3227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12587688#action_12587688
]
Runping Qi commented on HADOOP-3227:
------------------------------------
+1 overall.
Some details have to be considered.
How does the input format define split boundary?
You may still use "\n" as the separator. But that is not a true binary input
format.
What does the binary output format do?
a pure bytewritable record writer?
I think the streaming framework needs to make some changes
so that it does not do any parsing on the streaming output data.
Rather, it passes the output data to the byte writable output record writer
directly.
> Implement a binary input/output format for Streaming
> ----------------------------------------------------
>
> Key: HADOOP-3227
> URL: https://issues.apache.org/jira/browse/HADOOP-3227
> Project: Hadoop Core
> Issue Type: Improvement
> Components: contrib/streaming
> Reporter: Arun C Murthy
> Assignee: Arun C Murthy
> Fix For: 0.18.0
>
>
> Lots of streaming applications process textual data with 1 record per line
> and fields separated by a delimiter. It turns out that there is no point in
> using any of Hadoop's input/output formats since the streaming script/binary
> itself will parse the input and break into records and fields. In such cases
> we should provide users with a binary input/output format which just sends
> 64k (or so) blocks of data directly from HDFS to the streaming application.
> I did something very similar for Pig-Streaming (PIG-94 - BinaryStorage) which
> resulted in 300%+ speedup for scanning (identity mapper & map-only jobs)
> data... the parsing done by input/output formats in these cases were
> pure-overhead.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.