David Mollitor created MAPREDUCE-7194:
-----------------------------------------
Summary: New Method For CombineFile
Key: MAPREDUCE-7194
URL: https://issues.apache.org/jira/browse/MAPREDUCE-7194
Project: Hadoop Map/Reduce
Issue Type: Improvement
Components: mrv2
Affects Versions: 3.2.0
Reporter: David Mollitor
Assignee: David Mollitor
Rhe {{CombineFileInputFormat}} class is responsible for grouping blocks
together to form larger splits. The current implementation is very naive. It
iterates over the list of available blocks and as long as the current group of
blocks is less than the maximum split size, it will keep added blocks. The
check for if a split has reached its maximum size happens *after* each block is
added. For example given a certain maximum "M", and two blocks which are both
7/8M, they will be grouped together to create a split which is 14/8M. If M is
a large number, this split will be very large and not what the operator would
expect.
I'll propose a general clean up and also, enforcing that, unless a files cannot
be split, that its splits will not be larger than the configured maximum size.
This will provide operators a much more straight-forward way of calculating the
expected number of splits.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]