Hi, I have a couple of *basic* questions about Hadoop internals.
1) If I understood correctly the ideal number of Reducers is equal to number of distinct keys (or custom Partitioners) emitted from from all Mappers at given Map-Reduce iteration. Is that correct? 2) In configuration there can be set maximum number of Reducers. How does Hadoop handle the situation when there are more intermediate keys emitted from Mappers then this number? AFAIK the intermediate results are stored in SequenceFiles. Does it mean that this intermetidate persistent storeage is somehow scanned to all records of the same key (or custom Partinioner value) and such chunk of data is send to one Reduced and if no Reducer is left them the process waits unitl some of them is done and can be assigned a new chunk of data? 3) Is there any recommendation about how to set up a job if number of intermediate keys is not know beforehand? 4) Is there any physical limit of number of Reducers given by internal Hadoop architecture? ... and finally ... 5) Does anybody know how and what exactly do folks in Yahoo! use Hadoop for? If the biggest reported Hadoop cluster has something like 2000 machines then the total number of Mappers/Reducers can be like 2000*200 (assuming there are for example 200 Reducers running on each machine), which is a big number but still probably not big enough to handle processing of really large graphs data structures IMHO. As far as I understood Google is not directly using Map-Reduce form of PageRank calculation for whole internet graph processing (see http://www.youtube.com/watch?v=BT-piFBP4fE). So, if Yahoo! needs scaling algorithm for really large tasks, what do they use if not Hadoop? Regards, Lukas -- http://blog.lukas-vlcek.com/
