Re: Partitioning reduce output by date

Otis Gospodnetic Thu, 20 Mar 2008 16:00:53 -0700

Thank you, Doug and Ted, this pointed me in the right direction, which lead to 
a custom OutputFormat and a RecordWriter that opens and closes the 
DataOutputStream based on the current key (if current key diff from previous 
key, close previous output and open a new one, then write....)

As for partitioning, that worked, too.  My getPartition method now has:

            int dateHash = startDate.hashCode();
            if (dateHash < 0)
                dateHash = -dateHash;
            int partitionID = dateHash % numPartitions;
            return partitionID;

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Doug Cutting <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Wednesday, March 19, 2008 4:39:04 PM
Subject: Re: Partitioning reduce output by date

Otis Gospodnetic wrote:
> That "numPartitions" corresponds to the number of reduce tasks.  What I need 
> is partitioning that corresponds to the number of unique dates (yyyy-mm-dd) 
> processed by the Mapper and not the number of reduce tasks.  I don't know the 
> number of distinct dates in the input ahead of time, though, so I cannot just 
> specify the same number of reduces.
> 
> I *can* get the number of unique dates by keeping track of dates in map().  I 
> was going to take this approach and use this number in the getPartition(....) 
> method, but apparently getPartition(...) is called as each input row is 
> processed by map() call.  This causes a problem for me, as I know the total 
> number of unique dates only after *all* of the input is processed by map().

The number of partitions is indeed the number of reduces.  If you were 
to compute it during map, then each map might generate a different 
number.  Each map must partition into the same space, so that all 
partition 0 data can go to one reduce, partition 1 to another, and so on.

I think Ted pointed you in the right direction: your Partitioner should 
partition by the hash of the date, then your OutputFormat should start 
writing a new file each time the date changes.  That will give you a 
unique file per date.

Doug

Re: Partitioning reduce output by date

Reply via email to