[
https://issues.apache.org/jira/browse/HADOOP-1284?
page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Runping Qi updated HADOOP-1284:
-------------------------------
Description:
Right now, the protocol between stream mapper/reducer and the
framework is very inflexible.
The mapper/reducer generates line oriented output. The framework
picks
up line by line, and split
each line into a key/value pair. By default, the substring up to the
first tab char is the key, and the
substring after the first tab char is the value.
However, in many cases, the application wants some control over how
the pair is split.
Here, I'd like to introduce the following configuration variables for
that:
1. "streaming.output.field.separator": the value will be the tab key,
by default.
But the user can specify a different one (e.g. ':', or ', ', etc.)
A map output line can be considered as a list of fields separated by
the separator.
2. "streaming.num.fields.for.mapout.key": the number of the first
fields will be used the map output key
(and for sorting in the reduce side).
The default value is 1.
The rest of the fields will be used as the value. For example, I can
specify the first 5 fields as my mapout key.
3. "streaming.num.fields.for.partitioning": Sometimes, I want to use
fewer fields for partitioning to
achieve "primary/secondary" composite
key effect as proposed in HADOOP485. The default value is 1.
For example, I can set "streaming.num.fields.for.partitioning" to 3
and "streaming.num.fields.for.mapout.key" to 5.
This effectively amounts to saying that fields 4 and 5 are my
secondary key.
With the above default values, it is compatible with the current
behavior
while introducing a new desirable feature in a clean way.
Thoughts?
was:
Right now, the protocol between stream mapper/reducer and the
framework is very inflexible.
The mapper/reducer generates line oriented output. The framework
picks
up line by line, and split
each line into a key/value pair. By default, the substring up to the
first tab char is the key, and the
substring after the first tab char is the value.
However, in many cases, the application wants some control over how
the pair is split.
Here, I'd like to introduce the following configuration variables for
that:
1. "streaming.output.field.separator": the value will be the tab key,
by default. But the user can specify a different one (e.g. '|', or '
', etc.)
A map output line can be considered as a list of fields separated by
the separator.
2. "streaming.num.fields.for.mapout.key": the number of the first
fields will be used the map output key (and for sorting in the
reduce
side).
The default value is 1.
The rest of the fields will be used as the value. For example, I can
specify the first 5 fields as my mapout key.
3. "streaming.num.fields.for.partitioning": Sometimes, I want to use
fewer fields for partitioning to achieve "primary/secondary"
composite
key effect as proposed in HADOOP485. The default value is 1. For
example, I can set "streaming.num.fields.for.partitioning" to 3
and "streaming.num.fields.for.mapout.key" to 5. This effectively
amounts to saying that fields 4 and 5 are my secondary key.
With the above default values, it is compatible with the current
behavior while introducing a new desirable feature in a clean way.
Thoughts?
This patch implemented the proposed protocol.
With this patch, the streaming user can specify a field separatot for
the mapper's output and/or a field separator
for the reducer's output. The default will be the tab char.
The user can also specify how many fields in the output consitute the
keys. The default is 1.
The rest part of a line will be the value.
A partitioner class, KeyFieldBasedPartitioner in mapred.lib, is also
implemented.
The user can specify the number of the fields in the map output keys
will be used for partitioning.
Also a urility class, FieldSelectionMapReduce in mapred.lib, is
added.
This class allows the
user to create map/reduce jobs that manapulate text data like the
Unix
cut utility.
The user can specify field separator (delimiter for cut) and specify
which fields to select, and
by which fields to partition/sort.
Two unit tests are introduced.
All the unit tests passed.
[ Show > ] Runping Qi [25/Apr/07 11:07 AM] This patch implemented the
proposed protocol. With this patch, the streaming user can specify a
field separatot for the mapper's output and/or a field separator for
the reducer's output. The default will be the tab char. The user can
also specify how many fields in the output consitute the keys. The
default is 1. The rest part of a line will be the value. A
partitioner
class, KeyFieldBasedPartitioner in mapred.lib, is also implemented.
The user can specify the number of the fields in the map output keys
will be used for partitioning. Also a urility class,
FieldSelectionMapReduce in mapred.lib, is added. This class allows
the
user to create map/reduce jobs that manapulate text data like the
Unix
cut utility. The user can specify field separator (delimiter for cut)
and specify which fields to select, and by which fields to
partition/sort. Two unit tests are introduced. All the unit tests
passed.
clean up the protocol between stream mapper/reducer and the
framework
--------------------------------------------------------------------
-
Key: HADOOP-1284
URL:
https://issues.apache.org/jira/browse/HADOOP-1284
Project: Hadoop
Issue Type: Improvement
Reporter: Runping Qi
Assigned To: Runping Qi
Attachments: patch-1284.txt
Right now, the protocol between stream mapper/reducer and the
framework is very inflexible.
The mapper/reducer generates line oriented output. The framework
picks up line by line, and split
each line into a key/value pair. By default, the substring up to the
first tab char is the key, and the
substring after the first tab char is the value.
However, in many cases, the application wants some control over how
the pair is split.
Here, I'd like to introduce the following configuration variables
for
that:
1. "streaming.output.field.separator": the value will be the tab
key,
by default.
But the user can specify a different one (e.g. ':', or ', ', etc.)
A map output line can be considered as a list of fields separated by
the separator.
2. "streaming.num.fields.for.mapout.key": the number of the first
fields will be used the map output key
(and for sorting in the reduce side).
The default value is 1.
The rest of the fields will be used as the value. For example, I
can
specify the first 5 fields as my mapout key.
3. "streaming.num.fields.for.partitioning": Sometimes, I want to use
fewer fields for partitioning to
achieve "primary/secondary" composite
key effect as proposed in HADOOP485. The default value is 1.
For example, I can set "streaming.num.fields.for.partitioning" to 3
and "streaming.num.fields.for.mapout.key" to 5.
This effectively amounts to saying that fields 4 and 5 are my
secondary key.
With the above default values, it is compatible with the current
behavior
while introducing a new desirable feature in a clean way.
Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.