Re: Basic question on how reducer works

Subir S Sat, 14 Jul 2012 05:00:49 -0700

Harsh, Thanks I think this is what I was looking for. I have 3 related
questions.


1.) Will this work in 0.20.2-cdh3u3

2.) What is the hard limit that you mean?

3.)Can this be applied for streaming?

Thanks, Subir

On 7/14/12, Harsh J <ha...@cloudera.com> wrote:
> If you wish to impose a limit on the max reducer input to be allowed
> in a job, you may set "mapreduce.reduce.input.limit" on your job, as
> total bytes allowed per reducer.
>
> But this is more of a hard limit, which I suspect your question wasn't
> about. Your question is indeed better off on the pig's user lists.
>
> On Tue, Jul 10, 2012 at 8:59 PM, Subir S <subir.sasiku...@gmail.com> wrote:
>> Is there any property to convey the maximum amount of data each
>> reducer/partition may take for processing. Like the bytes_per_reducer
>> of pig, so that the count of reducers can be controlled based on size
>> of intermediate map output data size?
>>
>> On 7/10/12, Karthik Kambatla <ka...@cloudera.com> wrote:
>>> The partitioner is configurable. The default partitioner, from what I
>>> remember, computes the partition as the hashcode modulo number of
>>> reducers/partitions. For random input, it is balanced, but some cases
>>> can
>>> have very skewed key distribution. Also, as you have pointed out, the
>>> number of values per key can also vary. Together, both of them determine
>>> "weight" of each partition as you call it.
>>>
>>> Karthik
>>>
>>> On Mon, Jul 9, 2012 at 8:15 PM, Grandl Robert <rgra...@yahoo.com> wrote:
>>>
>>>> Thanks Arun.
>>>>
>>>> So just for my clarification. The map will create partitions according
>>>> to
>>>> the number of reducers s.t. each reducer to get almost same number of
>>>> keys
>>>> in its partition. However, each key can have different number of values
>>>> so
>>>> the "weight" of each partition will depend on that. Also when a new
>>>> <key,
>>>> value> is added into a partition a hash on the partition ID will be
>>>> computed to find the corresponding partition ?
>>>>
>>>> Robert
>>>>
>>>>   ------------------------------
>>>> *From:* Arun C Murthy <a...@hortonworks.com>
>>>> *To:* mapreduce-user@hadoop.apache.org
>>>> *Sent:* Monday, July 9, 2012 4:33 PM
>>>>
>>>> *Subject:* Re: Basic question on how reducer works
>>>>
>>>>
>>>> On Jul 9, 2012, at 12:55 PM, Grandl Robert wrote:
>>>>
>>>> Thanks a lot guys for answers.
>>>>
>>>> Still I am not able to find exactly the code for the following things:
>>>>
>>>> 1. reducer to read from a Map output only its partition. I looked into
>>>> ReduceTask#getMapOutput which do the actual read in
>>>> ReduceTask#shuffleInMemory, but I don't see where it specify which
>>>> partition to read(reduceID).
>>>>
>>>>
>>>> Look at TaskTracker.MapOutputServlet.
>>>>
>>>> 2. still don't understand very well in which part of the
>>>> code(MapTask.java) the intermediate data is written do which partition.
>>>> So
>>>> MapOutputBuffer is the one who actually writes the data to buffer and
>>>> spill
>>>> after buffer is full. Could you please elaborate a bit on how the data
>>>> is
>>>> written to which partition ?
>>>>
>>>>
>>>> Essentially you can think of the partition-id as the 'primary key' and
>>>> the
>>>> actual 'key' in the map-output of <key, value> as the 'secondary key'.
>>>>
>>>> hth,
>>>> Arun
>>>>
>>>> Thanks,
>>>> Robert
>>>>
>>>>   ------------------------------
>>>> *From:* Arun C Murthy <a...@hortonworks.com>
>>>> *To:* mapreduce-user@hadoop.apache.org
>>>> *Sent:* Monday, July 9, 2012 9:24 AM
>>>> *Subject:* Re: Basic question on how reducer works
>>>>
>>>> Robert,
>>>>
>>>> On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote:
>>>>
>>>> Hi,
>>>>
>>>> I have some questions related to basic functionality in Hadoop.
>>>>
>>>> 1. When a Mapper process the intermediate output data, how it knows how
>>>> many partitions to do(how many reducers will be) and how much data to
>>>> go
>>>> in
>>>> each  partition for each reducer ?
>>>>
>>>> 2. A JobTracker when assigns a task to a reducer, it will also specify
>>>> the
>>>> locations of intermediate output data where it should retrieve it right
>>>> ?
>>>> But how a reducer will know from each remote location with intermediate
>>>> output what portion it has to retrieve only ?
>>>>
>>>>
>>>> To add to Harsh's comment. Essentially the TT *knows* where the output
>>>> of
>>>> a given map-id/reduce-id pair is present via an output-file/index-file
>>>> combination.
>>>>
>>>> Arun
>>>>
>>>> --
>>>> Arun C. Murthy
>>>> Hortonworks Inc.
>>>> http://hortonworks.com/
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Arun C. Murthy
>>>> Hortonworks Inc.
>>>> http://hortonworks.com/
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>
>
>
> --
> Harsh J
>

Re: Basic question on how reducer works

Reply via email to