You can also get the input file name with conf.get("map.input.file") and
reuse the last part of the filename (i.e. part-00000) with the
OutputCommitter.

-----Original Message-----
From: jason hadoop [mailto:jason.had...@gmail.com] 
Sent: 14 May 2009 16:25
To: core-user@hadoop.apache.org
Subject: Re: Map-side join: Sort order preserved?

Sort order is preserved if your Mapper doesn't change the key ordering
in
output. Partition name is not preserved.

What I have done is to manually work out what the partition number of
the
output file should be for each map task, by calling the partitioner on
an
input key, and then renaming the output in the close method.

Conceptually the place for this dance is in the OutputCommitter, but I
haven't used them in production code, and my mapside join examples come
from
before they were available.

the Hadoop join framework handles setting the split size to
Long.MAX_VALUE
for you.

If you put up a discussion question on www.prohadoopbook.com, I will
fill in
the example on how to do this.

On Thu, May 14, 2009 at 8:04 AM, Stuart White
<stuart.whi...@gmail.com>wrote:

> I'm implementing a map-side join as described in chapter 8 of "Pro
> Hadoop".  I have two files that have been partitioned using the
> TotalOrderPartitioner on the same key into the same number of
> partitions.  I've set mapred.min.split.size to Long.MAX_VALUE so that
> one Mapper will handle an entire partition.
>
> I want the output to be written in the same partitioned, total sort
> order.  If possible, I want to accomplish this by setting my
> NumReducers to 0 and having the output of my Mappers written directly
> to HDFS, thereby skipping the partition/sort step.
>
> My question is this: Am I guaranteed that the Mapper that processes
> part-00000 will have its output written to the output file named
> part-00000, the Mapper that processes part-00001 will have its output
> written to part-00001, etc... ?
>
> If so, then I can preserve the partitioning/sort order of my input
> files without re-partitioning and re-sorting.
>
> Thanks.
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals



This message should be regarded as confidential. If you have received this 
email in error please notify the sender and destroy it immediately.
Statements of intent shall only become binding when confirmed in hard copy by 
an authorised signatory.  The contents of this email may relate to dealings 
with other companies within the Detica Group plc group of companies.

Detica Limited is registered in England under No: 1337451.

Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP, England.


Reply via email to