thanks scott, some great things to think about! the only "tuning" i did was to set mapred.reduce.tasks and mapred.map.tasks to 30 to correspond to the capability specified by the html ui. i admit i did this without a deep understanding what it meant, i do know that when i did not specify these then only a few mappers would be utilised (due to the same input data size)
in relation to scheduling i was taking the simple approach of running the streaming jobs sequentially with the default scheduler. even from watching output scroll past it is obvious that a _lot_ of time is being taken up in setup related activities. this is most apparent in the single document case. something is just not right... i had read in http://issues.apache.org/jira/browse/HADOOP-2721 "Use job control for tasks (and therefore for pipes and streaming)" that jobcontrol (specifically representing job dependencies) was not yet available for streaming. as such i dismissed any scheduling changes. i'll revisit this to make sure i understand what i can and can't do in streaming. if nothing else i can try fairscheduling with my own rolled version of dependencies. i'm orchestrating the job runs from rake and i've got my own homebrew libraries for this type of dependency management, though i'm also loath to roll my own versions of things. so lots of ideas and things to check, i'll rerun trying some of the things you've mentioned. thanks again for the feedback! mat
