My end goal is to have all the records sorted chronologically, regardless of 
the source file. To present it formally:

Let there are X servers.
Let each server produce one chronological log file that records who operated on 
the server and when.
Let there be Y users.
Assume a given user can operate on any number of servers simultaneously.
Assume a given user can perform any number of operations a second.

My goal would be to have Y output files, each containing the records for only 
that user, sorted chronologically.
So working backwards from the output.

In order for records to be written chronologically to the file:
- All records for a given user must arrive at the same reducer (or the file IO 
will mess with the order)
- All records arriving at a given reducer must be chronological with respect to 
a given user

In order for records to arrive a reducer in chronological with respect to a 
given user:
- The sorter must be set to sort by time and operate over all records for a user

In order for the sorter to operate over all records for a user
- The grouper must be set to group by user, or not group at all (each record is 
a group)

In order for all records for a given user to arrive at the same reducer:
- The partitioner must be set to partition by user (i.e., user number mod 
number of partitions)

>From this vantage point I see two possible ways to do this.
1. Set the Key to be the user number, set the grouper to group by key. This 
results in all records for a user being aggregated (very large)
2. Set they Key to be {user number, time}, set the grouper to group by key. 
This results in each record being emitted to the reducer one at a time (lots of 
overhead)

Neither of those seems very favorable. Is anyone aware of a different means to 
achieve that goal?


From: Steve Lewis [mailto:lordjoe2...@gmail.com] 
Sent: Thursday, June 28, 2012 3:43 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: Map Reduce Theory Question, getting OutOfMemoryError while reducing

It is NEVER a good idea to hold items in memory - after all this is big data 
and you want it to scale - 
I do not see what stops you from reading one record, processing it and writing 
it out without retaining it.
It is OK to keep statistics while iterating through a key and output them at 
the end but holding all values for a key is almost never a good idea unless you 
can guarantee limits to these
On Thu, Jun 28, 2012 at 2:37 PM, Berry, Matt <mwbe...@amazon.com> wrote:
I have a MapReduce job that reads in several gigs of log files and separates 
the records based on who generated them. My MapReduce job looks like this:
-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Reply via email to