Dmitry Sivachenko created MAPREDUCE-6085:
--------------------------------------------
Summary: Facilitate processing of text files without key/value
split
Key: MAPREDUCE-6085
URL: https://issues.apache.org/jira/browse/MAPREDUCE-6085
Project: Hadoop Map/Reduce
Issue Type: Improvement
Affects Versions: 2.4.1
Reporter: Dmitry Sivachenko
There is a rather popular type of task: processing of text files line by line
without splitting line to key/value pair in streaming mode. (UNIX commands
like grep, awk, etc, any filter scripts).
By default, Hadoop streaming interface uses TextInputFormat which suites well
for this task: it passes the input line itself to streaming job stdin.
TextOutputReader class, which receives streaming job's output, splits it for
key and value pair, and TextOutputFormat tries to merge this pair with
separator.
This results in extra separator appearing in the output in some cases.
KeyOnlyTextOutputReader solves this problem: it passes the whole line as a key
with null value, and TextOutputFormat correctly writes it without any
separators inserted.
I propose to add another IdentifierResolver: "keyonlytextoutput", which uses
standard TextInputWriter but replaces TextOutputReader with
KeyOnlyTextOutputReader).
As a result, lines of text are never split into key/value pair and never joined
back, so lines appear in the output unmodified.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)