Use a custom partitioner and grouping comparator as described here http://pkghosh.wordpress.com/2011/04/13/map-reduce-secondary-sort-does-it-all/
in effect make the time part of the key for sort but not grouping or partitioning - also you might frameworks like pig On Thu, Jun 28, 2012 at 4:20 PM, Berry, Matt <mwbe...@amazon.com> wrote: > My end goal is to have all the records sorted chronologically, regardless > of the source file. To present it formally: > > Let there are X servers. > Let each server produce one chronological log file that records who > operated on the server and when. > Let there be Y users. > Assume a given user can operate on any number of servers simultaneously. > Assume a given user can perform any number of operations a second. > > My goal would be to have Y output files, each containing the records for > only that user, sorted chronologically. > So working backwards from the output. > > In order for records to be written chronologically to the file: > - All records for a given user must arrive at the same reducer (or the > file IO will mess with the order) > - All records arriving at a given reducer must be chronological with > respect to a given user > > In order for records to arrive a reducer in chronological with respect to > a given user: > - The sorter must be set to sort by time and operate over all records for > a user > > In order for the sorter to operate over all records for a user > - The grouper must be set to group by user, or not group at all (each > record is a group) > > In order for all records for a given user to arrive at the same reducer: > - The partitioner must be set to partition by user (i.e., user number mod > number of partitions) > > From this vantage point I see two possible ways to do this. > 1. Set the Key to be the user number, set the grouper to group by key. > This results in all records for a user being aggregated (very large) > 2. Set they Key to be {user number, time}, set the grouper to group by > key. This results in each record being emitted to the reducer one at a time > (lots of overhead) > > Neither of those seems very favorable. Is anyone aware of a different > means to achieve that goal? > > > From: Steve Lewis [mailto:lordjoe2...@gmail.com] > Sent: Thursday, June 28, 2012 3:43 PM > To: mapreduce-user@hadoop.apache.org > Subject: Re: Map Reduce Theory Question, getting OutOfMemoryError while > reducing > > It is NEVER a good idea to hold items in memory - after all this is big > data and you want it to scale - > I do not see what stops you from reading one record, processing it and > writing it out without retaining it. > It is OK to keep statistics while iterating through a key and output them > at the end but holding all values for a key is almost never a good idea > unless you can guarantee limits to these > On Thu, Jun 28, 2012 at 2:37 PM, Berry, Matt <mwbe...@amazon.com> wrote: > I have a MapReduce job that reads in several gigs of log files and > separates the records based on who generated them. My MapReduce job looks > like this: > -- > Steven M. Lewis PhD > 4221 105th Ave NE > Kirkland, WA 98033 > 206-384-1340 (cell) > Skype lordjoe_com > > -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com