Thank you, Doug and Ted, this pointed me in the right direction, which lead to a custom OutputFormat and a RecordWriter that opens and closes the DataOutputStream based on the current key (if current key diff from previous key, close previous output and open a new one, then write....)
As for partitioning, that worked, too. My getPartition method now has: int dateHash = startDate.hashCode(); if (dateHash < 0) dateHash = -dateHash; int partitionID = dateHash % numPartitions; return partitionID; Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- From: Doug Cutting <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Wednesday, March 19, 2008 4:39:04 PM Subject: Re: Partitioning reduce output by date Otis Gospodnetic wrote: > That "numPartitions" corresponds to the number of reduce tasks. What I need > is partitioning that corresponds to the number of unique dates (yyyy-mm-dd) > processed by the Mapper and not the number of reduce tasks. I don't know the > number of distinct dates in the input ahead of time, though, so I cannot just > specify the same number of reduces. > > I *can* get the number of unique dates by keeping track of dates in map(). I > was going to take this approach and use this number in the getPartition(....) > method, but apparently getPartition(...) is called as each input row is > processed by map() call. This causes a problem for me, as I know the total > number of unique dates only after *all* of the input is processed by map(). The number of partitions is indeed the number of reduces. If you were to compute it during map, then each map might generate a different number. Each map must partition into the same space, so that all partition 0 data can go to one reduce, partition 1 to another, and so on. I think Ted pointed you in the right direction: your Partitioner should partition by the hash of the date, then your OutputFormat should start writing a new file each time the date changes. That will give you a unique file per date. Doug