[
https://issues.apache.org/jira/browse/MAPREDUCE-6085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Allen Wittenauer updated MAPREDUCE-6085:
----------------------------------------
Labels: BB2015-05-TBR (was: )
> Facilitate processing of text files without key/value split
> -----------------------------------------------------------
>
> Key: MAPREDUCE-6085
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6085
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Affects Versions: 2.4.1
> Reporter: Dmitry Sivachenko
> Labels: BB2015-05-TBR
> Attachments: IdentifierResolver1.java.patch
>
>
> There is a rather popular type of task: processing of text files line by line
> without splitting line to key/value pair in streaming mode. (UNIX commands
> like grep, awk, etc, any filter scripts).
> By default, Hadoop streaming interface uses TextInputFormat which suites well
> for this task: it passes the input line itself to streaming job stdin.
> TextOutputReader class, which receives streaming job's output, splits it for
> key and value pair, and TextOutputFormat tries to merge this pair with
> separator.
> This results in extra separator appearing in the output in some cases.
> KeyOnlyTextOutputReader solves this problem: it passes the whole line as a
> key with null value, and TextOutputFormat correctly writes it without any
> separators inserted.
> I propose to add another IdentifierResolver: "keyonlytextoutput", which uses
> standard TextInputWriter but replaces TextOutputReader with
> KeyOnlyTextOutputReader).
> As a result, lines of text are never split into key/value pair and never
> joined back, so lines appear in the output unmodified.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)