[GitHub] spark issue #14239: [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism...

tgravescs Tue, 23 Aug 2016 07:47:33 -0700

Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/14239
  
    This sounds interesting, I haven't looked at the code, but I have some 
questions/concerns.  Could you perhaps give some more description clarification.
    
    Are you saying that you are loading all the data for all the maps from disk 
into memory and caching it waiting for the reducer to fetch it?  If so this may 
work ok for small data but very quickly you would run out of memory for large 
data.  Especially if say the YARN nodemanager is running the shuffle handler.  
many people run nodemanagers with only 1-2 GB of memory.
    
    does it conditionally do this or always do it?
    
    How exactly does the timing work on this, aren't you going to send the 
prepare immediately before sending the fetch?  does the fetch block on waiting 
on the prepare to cache the data?
    
    what testing have you done with this and what size of data?  What type of 
load was on the nodes when testing, etc?
    
    Note that in many cases if you have enough free memory for the OS (atleast 
with linux), the data won't be read from disk anyway, it will still be in the 
page cache.  Now if the box doesn't have enough free memory (data to large or 
other apps using it), that would be pushed out and you would have to read from 
disk. Here you have the same problem though if you are going to read it back 
into memory, you have to make sure your process has enough memory to store it, 
which can be huge.  
    
    I have been looking at adding a readahead to the shufflehandler that uses 
the os fadvise on the file, this allows the os to read it before its needed and 
increases disk throughput.  MapReduce has this functionality and it helps a lot 
there, I have a test branch on spark but still need to finish evaluating its 
performance.  The other part of this is exactly how to integrate it in because 
it uses native code from Hadoop.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #14239: [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism...

Reply via email to