[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated MAPREDUCE-6085:
----------------------------------------
    Labels: BB2015-05-TBR  (was: )

> Facilitate processing of text files without key/value split
> -----------------------------------------------------------
>
>                 Key: MAPREDUCE-6085
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6085
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 2.4.1
>            Reporter: Dmitry Sivachenko
>              Labels: BB2015-05-TBR
>         Attachments: IdentifierResolver1.java.patch
>
>
> There is a rather popular type of task: processing of text files line by line 
> without splitting line to key/value pair in streaming mode.  (UNIX commands 
> like grep, awk, etc, any filter scripts).
> By default, Hadoop streaming interface uses TextInputFormat which suites well 
> for this task: it passes the input line itself to streaming job stdin.
> TextOutputReader class, which receives streaming job's output, splits it for 
> key and value pair, and TextOutputFormat tries to merge this pair with 
> separator.
> This results in extra separator appearing in the output in some cases.
> KeyOnlyTextOutputReader solves this problem: it passes the whole line as a 
> key with null value, and TextOutputFormat correctly writes it without any 
> separators inserted.
> I propose to add another IdentifierResolver: "keyonlytextoutput", which uses 
> standard TextInputWriter but replaces TextOutputReader with 
> KeyOnlyTextOutputReader).
> As a result, lines of text are never split into key/value pair and never 
> joined back, so lines appear in the output unmodified.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to