Don't assume that any variables are shared between reducers or between maps,
or between maps and reducers.
If you want to share data, put it into HDFS.
On 4/17/08 4:01 AM, "Aayush Garg" <[EMAIL PROTECTED]> wrote:
> One more thing:::
> The HashMap that I am generating in the reduce phase will be on single node
> or multiple nodes in the distributed enviornment? If my dataset is large
> will this approach work? If not what can I do for this?
> Also same thing with the file that I am writing in the run function (simple
> file opening FileStream) ??
>
>
>
> On Thu, Apr 17, 2008 at 6:04 AM, Amar Kamat <[EMAIL PROTECTED]> wrote:
>
>> Ted Dunning wrote:
>>
>>> The easiest solution is to not worry too much about running an extra MR
>>> step.
>>>
>>> So,
>>>
>>> - run a first pass to get the counts. Use word count as the pattern.
>>> Store
>>> the results in a file.
>>>
>>> - run the second pass. You can now read the hash-table from the file
>>> you
>>> stored in pass 1.
>>>
>>> Another approach is to do the counting in your maps as specified and
>>> then
>>> before exiting, you can emit special records for each key to suppress.
>>> With
>>> the correct sort and partition functions, you can make these killer
>>> records
>>> appear first in the reduce input. Then, if your reducer sees the kill
>>> flag
>>> in the front of the values, it can avoid processing any extra data.
>>>
>>>
>>>
>> Ted,
>> Will this work for the case where the cutoff frequency/count requires a
>> global picture? I guess not.
>>
>> In general, it is better to not try to communicate between map and reduce
>>> except via the expected mechanisms.
>>>
>>>
>>> On 4/16/08 1:33 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote:
>>>
>>>
>>>
>>>> We can not read HashMap in the configure method of the reducer because
>>>> it is
>>>> called before reduce job.
>>>> I need to eliminate rows from the HashMap when all the keys are read.
>>>> Also my concern is if dataset is large will this HashMap thing work??
>>>>
>>>>
>>>> On Wed, Apr 16, 2008 at 10:07 PM, Ted Dunning <[EMAIL PROTECTED]>
>>>> wrote:
>>>>
>>>>
>>>>
>>>>> That design is fine.
>>>>>
>>>>> You should read your map in the configure method of the reducer.
>>>>>
>>>>> There is a MapFile format supported by Hadoop, but they tend to be
>>>>> pretty
>>>>> slow. I usually find it better to just load my hash table by hand.
>>>>> If
>>>>> you
>>>>> do this, you should use whatever format you like.
>>>>>
>>>>>
>>>>> On 4/16/08 12:41 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote:
>>>>>
>>>>>
>>>>>
>>>>>> HI,
>>>>>>
>>>>>> The current structure of my program is::
>>>>>> Upper class{
>>>>>> class Reduce{
>>>>>> reduce function(K1,V1,K2,V2){
>>>>>> // I count the frequency for each key
>>>>>> // Add output in HashMap(Key,value) instead of
>>>>>> output.collect()
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> void run()
>>>>>> {
>>>>>> runjob();
>>>>>> // Now eliminate top frequency keys in HashMap built in reduce
>>>>>>
>>>>>>
>>>>> function
>>>>>
>>>>>
>>>>>> here because only now hashmap is complete.
>>>>>> // Write this hashmap to a file in such a format so that I can
>>>>>> use
>>>>>>
>>>>>>
>>>>> this
>>>>>
>>>>>
>>>>>> hashmap in next MapReduce job and key of this hashmap is taken as
>>>>>> key in
>>>>>> mapper function of that Map Reduce. ?? How and which format should
>>>>>> I
>>>>>> choose??? Is this design and approach ok?
>>>>>>
>>>>>> }
>>>>>>
>>>>>> public static void main() {}
>>>>>> }
>>>>>> I hope you have got my question.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 16, 2008 at 8:33 AM, Amar Kamat <[EMAIL PROTECTED]>
>>>>>>
>>>>>>
>>>>> wrote:
>>>>>
>>>>>
>>>>>> Aayush Garg wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Are you sure that another MR is required for eliminating some
>>>>>>>> rows?
>>>>>>>> Can't I
>>>>>>>> just somehow eliminate from main() when I know the keys which
>>>>>>>> are
>>>>>>>>
>>>>>>>>
>>>>>>> needed
>>>>>
>>>>>
>>>>>> to
>>>>>>>> remove?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> Can you provide some more details on how exactly are you
>>>>>>> filtering?
>>>>>>> Amar
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>>
>