In our clusters, number of containers we can get is high but memory per container is low : which is why avg_nodes_not_hosting data is rarely zero for ML tasks :-)
To update - to unblock our current implementation efforts, we went with broadcast - since it is intutively easier and minimal change; and compress the array as bytes in TaskResult. This is then stored in disk backed maps - to remove memory pressure on master and workers (else MapOutputTracker becomes a memory hog). But I agree, compressed bitmap to represent 'large' blocks (anything larger that maxBytesInFlight actually) and probably existing to track non zero should be fine (we should not really track zero output for reducer - just waste of space). Regards, Mridul On Fri, Jul 4, 2014 at 3:43 AM, Reynold Xin <r...@databricks.com> wrote: > Note that in my original proposal, I was suggesting we could track whether > block size = 0 using a compressed bitmap. That way we can still avoid > requests for zero-sized blocks. > > > > On Thu, Jul 3, 2014 at 3:12 PM, Reynold Xin <r...@databricks.com> wrote: > >> Yes, that number is likely == 0 in any real workload ... >> >> >> On Thu, Jul 3, 2014 at 8:01 AM, Mridul Muralidharan <mri...@gmail.com> >> wrote: >> >>> On Thu, Jul 3, 2014 at 11:32 AM, Reynold Xin <r...@databricks.com> wrote: >>> > On Wed, Jul 2, 2014 at 3:44 AM, Mridul Muralidharan <mri...@gmail.com> >>> > wrote: >>> > >>> >> >>> >> > >>> >> > The other thing we do need is the location of blocks. This is >>> actually >>> >> just >>> >> > O(n) because we just need to know where the map was run. >>> >> >>> >> For well partitioned data, wont this not involve a lot of unwanted >>> >> requests to nodes which are not hosting data for a reducer (and lack >>> >> of ability to throttle). >>> >> >>> > >>> > Was that a question? (I'm guessing it is). What do you mean exactly? >>> >>> >>> I was not sure if I understood the proposal correctly - hence the >>> query : if I understood it right - the number of wasted requests goes >>> up by num_reducers * avg_nodes_not_hosting data. >>> >>> Ofcourse, if avg_nodes_not_hosting data == 0, then we are fine ! >>> >>> Regards, >>> Mridul >>> >> >>