Use a custom partitioner and grouping comparator as described here
http://pkghosh.wordpress.com/2011/04/13/map-reduce-secondary-sort-does-it-all/

in effect make the time part of the key for sort but not grouping or
partitioning -
also you might frameworks like pig

On Thu, Jun 28, 2012 at 4:20 PM, Berry, Matt <mwbe...@amazon.com> wrote:

> My end goal is to have all the records sorted chronologically, regardless
> of the source file. To present it formally:
>
> Let there are X servers.
> Let each server produce one chronological log file that records who
> operated on the server and when.
> Let there be Y users.
> Assume a given user can operate on any number of servers simultaneously.
> Assume a given user can perform any number of operations a second.
>
> My goal would be to have Y output files, each containing the records for
> only that user, sorted chronologically.
> So working backwards from the output.
>
> In order for records to be written chronologically to the file:
> - All records for a given user must arrive at the same reducer (or the
> file IO will mess with the order)
> - All records arriving at a given reducer must be chronological with
> respect to a given user
>
> In order for records to arrive a reducer in chronological with respect to
> a given user:
> - The sorter must be set to sort by time and operate over all records for
> a user
>
> In order for the sorter to operate over all records for a user
> - The grouper must be set to group by user, or not group at all (each
> record is a group)
>
> In order for all records for a given user to arrive at the same reducer:
> - The partitioner must be set to partition by user (i.e., user number mod
> number of partitions)
>
> From this vantage point I see two possible ways to do this.
> 1. Set the Key to be the user number, set the grouper to group by key.
> This results in all records for a user being aggregated (very large)
> 2. Set they Key to be {user number, time}, set the grouper to group by
> key. This results in each record being emitted to the reducer one at a time
> (lots of overhead)
>
> Neither of those seems very favorable. Is anyone aware of a different
> means to achieve that goal?
>
>
> From: Steve Lewis [mailto:lordjoe2...@gmail.com]
> Sent: Thursday, June 28, 2012 3:43 PM
> To: mapreduce-user@hadoop.apache.org
> Subject: Re: Map Reduce Theory Question, getting OutOfMemoryError while
> reducing
>
> It is NEVER a good idea to hold items in memory - after all this is big
> data and you want it to scale -
> I do not see what stops you from reading one record, processing it and
> writing it out without retaining it.
> It is OK to keep statistics while iterating through a key and output them
> at the end but holding all values for a key is almost never a good idea
> unless you can guarantee limits to these
> On Thu, Jun 28, 2012 at 2:37 PM, Berry, Matt <mwbe...@amazon.com> wrote:
> I have a MapReduce job that reads in several gigs of log files and
> separates the records based on who generated them. My MapReduce job looks
> like this:
> --
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com
>
>


-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Reply via email to