Harsh, Thanks I think this is what I was looking for. I have 3 related questions.
1.) Will this work in 0.20.2-cdh3u3 2.) What is the hard limit that you mean? 3.)Can this be applied for streaming? Thanks, Subir On 7/14/12, Harsh J <ha...@cloudera.com> wrote: > If you wish to impose a limit on the max reducer input to be allowed > in a job, you may set "mapreduce.reduce.input.limit" on your job, as > total bytes allowed per reducer. > > But this is more of a hard limit, which I suspect your question wasn't > about. Your question is indeed better off on the pig's user lists. > > On Tue, Jul 10, 2012 at 8:59 PM, Subir S <subir.sasiku...@gmail.com> wrote: >> Is there any property to convey the maximum amount of data each >> reducer/partition may take for processing. Like the bytes_per_reducer >> of pig, so that the count of reducers can be controlled based on size >> of intermediate map output data size? >> >> On 7/10/12, Karthik Kambatla <ka...@cloudera.com> wrote: >>> The partitioner is configurable. The default partitioner, from what I >>> remember, computes the partition as the hashcode modulo number of >>> reducers/partitions. For random input, it is balanced, but some cases >>> can >>> have very skewed key distribution. Also, as you have pointed out, the >>> number of values per key can also vary. Together, both of them determine >>> "weight" of each partition as you call it. >>> >>> Karthik >>> >>> On Mon, Jul 9, 2012 at 8:15 PM, Grandl Robert <rgra...@yahoo.com> wrote: >>> >>>> Thanks Arun. >>>> >>>> So just for my clarification. The map will create partitions according >>>> to >>>> the number of reducers s.t. each reducer to get almost same number of >>>> keys >>>> in its partition. However, each key can have different number of values >>>> so >>>> the "weight" of each partition will depend on that. Also when a new >>>> <key, >>>> value> is added into a partition a hash on the partition ID will be >>>> computed to find the corresponding partition ? >>>> >>>> Robert >>>> >>>> ------------------------------ >>>> *From:* Arun C Murthy <a...@hortonworks.com> >>>> *To:* mapreduce-user@hadoop.apache.org >>>> *Sent:* Monday, July 9, 2012 4:33 PM >>>> >>>> *Subject:* Re: Basic question on how reducer works >>>> >>>> >>>> On Jul 9, 2012, at 12:55 PM, Grandl Robert wrote: >>>> >>>> Thanks a lot guys for answers. >>>> >>>> Still I am not able to find exactly the code for the following things: >>>> >>>> 1. reducer to read from a Map output only its partition. I looked into >>>> ReduceTask#getMapOutput which do the actual read in >>>> ReduceTask#shuffleInMemory, but I don't see where it specify which >>>> partition to read(reduceID). >>>> >>>> >>>> Look at TaskTracker.MapOutputServlet. >>>> >>>> 2. still don't understand very well in which part of the >>>> code(MapTask.java) the intermediate data is written do which partition. >>>> So >>>> MapOutputBuffer is the one who actually writes the data to buffer and >>>> spill >>>> after buffer is full. Could you please elaborate a bit on how the data >>>> is >>>> written to which partition ? >>>> >>>> >>>> Essentially you can think of the partition-id as the 'primary key' and >>>> the >>>> actual 'key' in the map-output of <key, value> as the 'secondary key'. >>>> >>>> hth, >>>> Arun >>>> >>>> Thanks, >>>> Robert >>>> >>>> ------------------------------ >>>> *From:* Arun C Murthy <a...@hortonworks.com> >>>> *To:* mapreduce-user@hadoop.apache.org >>>> *Sent:* Monday, July 9, 2012 9:24 AM >>>> *Subject:* Re: Basic question on how reducer works >>>> >>>> Robert, >>>> >>>> On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote: >>>> >>>> Hi, >>>> >>>> I have some questions related to basic functionality in Hadoop. >>>> >>>> 1. When a Mapper process the intermediate output data, how it knows how >>>> many partitions to do(how many reducers will be) and how much data to >>>> go >>>> in >>>> each partition for each reducer ? >>>> >>>> 2. A JobTracker when assigns a task to a reducer, it will also specify >>>> the >>>> locations of intermediate output data where it should retrieve it right >>>> ? >>>> But how a reducer will know from each remote location with intermediate >>>> output what portion it has to retrieve only ? >>>> >>>> >>>> To add to Harsh's comment. Essentially the TT *knows* where the output >>>> of >>>> a given map-id/reduce-id pair is present via an output-file/index-file >>>> combination. >>>> >>>> Arun >>>> >>>> -- >>>> Arun C. Murthy >>>> Hortonworks Inc. >>>> http://hortonworks.com/ >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Arun C. Murthy >>>> Hortonworks Inc. >>>> http://hortonworks.com/ >>>> >>>> >>>> >>>> >>>> >>> > > > > -- > Harsh J >