Dmitry Sivachenko created MAPREDUCE-6085:
--------------------------------------------

             Summary: Facilitate processing of text files without key/value 
split
                 Key: MAPREDUCE-6085
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6085
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
    Affects Versions: 2.4.1
            Reporter: Dmitry Sivachenko


There is a rather popular type of task: processing of text files line by line 
without splitting line to key/value pair in streaming mode.  (UNIX commands 
like grep, awk, etc, any filter scripts).

By default, Hadoop streaming interface uses TextInputFormat which suites well 
for this task: it passes the input line itself to streaming job stdin.

TextOutputReader class, which receives streaming job's output, splits it for 
key and value pair, and TextOutputFormat tries to merge this pair with 
separator.
This results in extra separator appearing in the output in some cases.

KeyOnlyTextOutputReader solves this problem: it passes the whole line as a key 
with null value, and TextOutputFormat correctly writes it without any 
separators inserted.

I propose to add another IdentifierResolver: "keyonlytextoutput", which uses 
standard TextInputWriter but replaces TextOutputReader with 
KeyOnlyTextOutputReader).

As a result, lines of text are never split into key/value pair and never joined 
back, so lines appear in the output unmodified.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to