[GitHub] [spark] tgravescs commented on issue #23695: [SPARK-26780][CORE]Improve shuffle read using ReadAheadInputStream

GitBox Wed, 01 May 2019 10:46:35 -0700

tgravescs commented on issue #23695: [SPARK-26780][CORE]Improve shuffle read 
using ReadAheadInputStream 
URL: https://github.com/apache/spark/pull/23695#issuecomment-488355875
 
 
   Glad to see someone working on this, I had started looking at this a long 
time back, but I was looking at using the linux readahead call. I assume you 
didn't try the linux readahead vs this to see performance difference?  I know 
the ReadAheadInputStream was already here so easier to use, but I would also 
think the os one might be more efficient if it doesn't have to copy into user 
space and then you copy it again to send it out, the os one just pulls into the 
page cache.
   
   In your performance tests, how did you test? if the data was in the page 
cache then you wouldn't see much benefit, its when the data actually has to be 
pulled from disk and the readahead does enough to get it into memory before you 
need it.
   Have you run this through any real spark applications or on a real cluster 
to see results?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] tgravescs commented on issue #23695: [SPARK-26780][CORE]Improve shuffle read using ReadAheadInputStream

Reply via email to