My end goal is to have all the records sorted chronologically, regardless of the source file. To present it formally:
Let there are X servers. Let each server produce one chronological log file that records who operated on the server and when. Let there be Y users. Assume a given user can operate on any number of servers simultaneously. Assume a given user can perform any number of operations a second. My goal would be to have Y output files, each containing the records for only that user, sorted chronologically. So working backwards from the output. In order for records to be written chronologically to the file: - All records for a given user must arrive at the same reducer (or the file IO will mess with the order) - All records arriving at a given reducer must be chronological with respect to a given user In order for records to arrive a reducer in chronological with respect to a given user: - The sorter must be set to sort by time and operate over all records for a user In order for the sorter to operate over all records for a user - The grouper must be set to group by user, or not group at all (each record is a group) In order for all records for a given user to arrive at the same reducer: - The partitioner must be set to partition by user (i.e., user number mod number of partitions) >From this vantage point I see two possible ways to do this. 1. Set the Key to be the user number, set the grouper to group by key. This results in all records for a user being aggregated (very large) 2. Set they Key to be {user number, time}, set the grouper to group by key. This results in each record being emitted to the reducer one at a time (lots of overhead) Neither of those seems very favorable. Is anyone aware of a different means to achieve that goal? From: Steve Lewis [mailto:lordjoe2...@gmail.com] Sent: Thursday, June 28, 2012 3:43 PM To: mapreduce-user@hadoop.apache.org Subject: Re: Map Reduce Theory Question, getting OutOfMemoryError while reducing It is NEVER a good idea to hold items in memory - after all this is big data and you want it to scale - I do not see what stops you from reading one record, processing it and writing it out without retaining it. It is OK to keep statistics while iterating through a key and output them at the end but holding all values for a key is almost never a good idea unless you can guarantee limits to these On Thu, Jun 28, 2012 at 2:37 PM, Berry, Matt <mwbe...@amazon.com> wrote: I have a MapReduce job that reads in several gigs of log files and separates the records based on who generated them. My MapReduce job looks like this: -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com