Hi 李钰 The size of map output depends on your Mapper class. The Mapper class will do processing on the input data.
2010/6/23 李钰 <car...@gmail.com>: > Hi Sriguru, > > Thanks a lot for your comments and suggestions! > Here I still have some questions: since map mainly do data preparation, > say split input data into KVPs, sort and partition before spill, would the > size of map output KVPs be much larger than the input data size? If not, > since one map task deals with one input split, and one input split is > usually 64M, the map KVPs size would be proximately 64M. Could you please > give me some example on map output much larger than the input split? It > really confuse me for some time, thanks. > > Others, > > Also badly need your help if you know about this, thanks. > > Best Regards, > Carp > > 在 2010年6月23日 下午5:11,Srigurunath Chakravarthi <srig...@yahoo-inc.com>写道: > >> Hi Carp, >> Your assumption is right that this is a per-map-task setting. >> However, this buffer stores map output KVPs, not input. Therefore the >> optimal value depends on how much data your map task is generating. >> >> If your output per map is greater than io.sort.mb, these rules of thumb >> that could work for you: >> >> 1) Increase max heap of map tasks to use RAM better, but not hit swap. >> 2) Set io.sort.mb to ~70% of heap. >> >> Overall, causing extra "spills" (because of insufficient io.sort.mb) is >> much better than risking swapping (by setting io.sort.mb and heap too >> large), in terms of relative performance penalty you will pay. >> >> Cheers, >> Sriguru >> >> >-----Original Message----- >> >From: 李钰 [mailto:car...@gmail.com] >> >Sent: Wednesday, June 23, 2010 12:27 PM >> >To: common-dev@hadoop.apache.org >> >Subject: Questions about recommendation value of the "io.sort.mb" >> >parameter >> > >> >Dear all, >> > >> >Here I've got a question about the "io.sort.mb" parameter. We can find >> >material from Yahoo! or Cloudera which recommend setting this value to >> >200 >> >if the job scale is large, but I'm confused about this. As I know, >> >the tasktracker will launch a child-JVM for each task, and >> >“*io.sort.mb*” >> >presents the buffer size in memory inside *one map task child-JVM*, the >> >default value 100MB should be large enough because the input split of >> >one >> >map task is usually 64MB, as large as the block size we usually set. >> >Then >> >why the recommendation of “*io.sort.mb*” is 200MB for large jobs (and >> >it >> >really works)? How could the job size affect the procedure? >> >Is there any fault here of my understanding? Any comment/suggestion >> >will be >> >highly valued, thanks in advance. >> > >> >Best Regards, >> >Carp >> > -- Best Regards Jeff Zhang