In our clusters, number of containers we can get is high but memory
per container is low : which is why avg_nodes_not_hosting data is
rarely zero for ML tasks :-)

To update - to unblock our current implementation efforts, we went
with broadcast - since it is intutively easier and minimal change; and
compress the array as bytes in TaskResult.
This is then stored in disk backed maps - to remove memory pressure on
master and workers (else MapOutputTracker becomes a memory hog).

But I agree, compressed bitmap to represent 'large' blocks (anything
larger that maxBytesInFlight actually) and probably existing to track
non zero should be fine (we should not really track zero output for
reducer - just waste of space).


Regards,
Mridul

On Fri, Jul 4, 2014 at 3:43 AM, Reynold Xin <r...@databricks.com> wrote:
> Note that in my original proposal, I was suggesting we could track whether
> block size = 0 using a compressed bitmap. That way we can still avoid
> requests for zero-sized blocks.
>
>
>
> On Thu, Jul 3, 2014 at 3:12 PM, Reynold Xin <r...@databricks.com> wrote:
>
>> Yes, that number is likely == 0 in any real workload ...
>>
>>
>> On Thu, Jul 3, 2014 at 8:01 AM, Mridul Muralidharan <mri...@gmail.com>
>> wrote:
>>
>>> On Thu, Jul 3, 2014 at 11:32 AM, Reynold Xin <r...@databricks.com> wrote:
>>> > On Wed, Jul 2, 2014 at 3:44 AM, Mridul Muralidharan <mri...@gmail.com>
>>> > wrote:
>>> >
>>> >>
>>> >> >
>>> >> > The other thing we do need is the location of blocks. This is
>>> actually
>>> >> just
>>> >> > O(n) because we just need to know where the map was run.
>>> >>
>>> >> For well partitioned data, wont this not involve a lot of unwanted
>>> >> requests to nodes which are not hosting data for a reducer (and lack
>>> >> of ability to throttle).
>>> >>
>>> >
>>> > Was that a question? (I'm guessing it is). What do you mean exactly?
>>>
>>>
>>> I was not sure if I understood the proposal correctly - hence the
>>> query : if I understood it right - the number of wasted requests goes
>>> up by num_reducers * avg_nodes_not_hosting data.
>>>
>>> Ofcourse, if avg_nodes_not_hosting data == 0, then we are fine !
>>>
>>> Regards,
>>> Mridul
>>>
>>
>>

Reply via email to