Can you elaborate? Not 100% sure if I understand what you mean.

On Thu, Nov 20, 2014 at 7:14 PM, Shixiong Zhu <zsxw...@gmail.com> wrote:

> Is it possible that Spark buffers the messages
> of mapOutputStatuses(Array[Byte]) according to the size
> of mapOutputStatuses which have already sent but not yet ACKed? The buffer
> will be cheap since the mapOutputStatuses messages are same and the memory
> cost is only a few pointers.
>
> Best Regards,
> Shixiong Zhu
>
> 2014-09-20 16:24 GMT+08:00 Reynold Xin <r...@databricks.com>:
>
>> BTW - a partial solution here: https://github.com/apache/spark/pull/2470
>>
>> This doesn't address the 0 size block problem yet, but makes my large job
>> on hundreds of terabytes of data much more reliable.
>>
>>
>> On Fri, Jul 4, 2014 at 2:28 AM, Mridul Muralidharan <mri...@gmail.com>
>> wrote:
>>
>> > In our clusters, number of containers we can get is high but memory
>> > per container is low : which is why avg_nodes_not_hosting data is
>> > rarely zero for ML tasks :-)
>> >
>> > To update - to unblock our current implementation efforts, we went
>> > with broadcast - since it is intutively easier and minimal change; and
>> > compress the array as bytes in TaskResult.
>> > This is then stored in disk backed maps - to remove memory pressure on
>> > master and workers (else MapOutputTracker becomes a memory hog).
>> >
>> > But I agree, compressed bitmap to represent 'large' blocks (anything
>> > larger that maxBytesInFlight actually) and probably existing to track
>> > non zero should be fine (we should not really track zero output for
>> > reducer - just waste of space).
>> >
>> >
>> > Regards,
>> > Mridul
>> >
>> > On Fri, Jul 4, 2014 at 3:43 AM, Reynold Xin <r...@databricks.com>
>> wrote:
>> > > Note that in my original proposal, I was suggesting we could track
>> > whether
>> > > block size = 0 using a compressed bitmap. That way we can still avoid
>> > > requests for zero-sized blocks.
>> > >
>> > >
>> > >
>> > > On Thu, Jul 3, 2014 at 3:12 PM, Reynold Xin <r...@databricks.com>
>> wrote:
>> > >
>> > >> Yes, that number is likely == 0 in any real workload ...
>> > >>
>> > >>
>> > >> On Thu, Jul 3, 2014 at 8:01 AM, Mridul Muralidharan <
>> mri...@gmail.com>
>> > >> wrote:
>> > >>
>> > >>> On Thu, Jul 3, 2014 at 11:32 AM, Reynold Xin <r...@databricks.com>
>> > wrote:
>> > >>> > On Wed, Jul 2, 2014 at 3:44 AM, Mridul Muralidharan <
>> > mri...@gmail.com>
>> > >>> > wrote:
>> > >>> >
>> > >>> >>
>> > >>> >> >
>> > >>> >> > The other thing we do need is the location of blocks. This is
>> > >>> actually
>> > >>> >> just
>> > >>> >> > O(n) because we just need to know where the map was run.
>> > >>> >>
>> > >>> >> For well partitioned data, wont this not involve a lot of
>> unwanted
>> > >>> >> requests to nodes which are not hosting data for a reducer (and
>> lack
>> > >>> >> of ability to throttle).
>> > >>> >>
>> > >>> >
>> > >>> > Was that a question? (I'm guessing it is). What do you mean
>> exactly?
>> > >>>
>> > >>>
>> > >>> I was not sure if I understood the proposal correctly - hence the
>> > >>> query : if I understood it right - the number of wasted requests
>> goes
>> > >>> up by num_reducers * avg_nodes_not_hosting data.
>> > >>>
>> > >>> Ofcourse, if avg_nodes_not_hosting data == 0, then we are fine !
>> > >>>
>> > >>> Regards,
>> > >>> Mridul
>> > >>>
>> > >>
>> > >>
>> >
>>
>
>

Reply via email to