Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/14239
This sounds interesting, I haven't looked at the code, but I have some
questions/concerns. Could you perhaps give some more description clarification.
Are you saying that you are loading all the data for all the maps from disk
into memory and caching it waiting for the reducer to fetch it? If so this may
work ok for small data but very quickly you would run out of memory for large
data. Especially if say the YARN nodemanager is running the shuffle handler.
many people run nodemanagers with only 1-2 GB of memory.
does it conditionally do this or always do it?
How exactly does the timing work on this, aren't you going to send the
prepare immediately before sending the fetch? does the fetch block on waiting
on the prepare to cache the data?
what testing have you done with this and what size of data? What type of
load was on the nodes when testing, etc?
Note that in many cases if you have enough free memory for the OS (atleast
with linux), the data won't be read from disk anyway, it will still be in the
page cache. Now if the box doesn't have enough free memory (data to large or
other apps using it), that would be pushed out and you would have to read from
disk. Here you have the same problem though if you are going to read it back
into memory, you have to make sure your process has enough memory to store it,
which can be huge.
I have been looking at adding a readahead to the shufflehandler that uses
the os fadvise on the file, this allows the os to read it before its needed and
increases disk throughput. MapReduce has this functionality and it helps a lot
there, I have a test branch on spark but still need to finish evaluating its
performance. The other part of this is exactly how to integrate it in because
it uses native code from Hadoop.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]