Github user kayousterhout commented on the pull request:
https://github.com/apache/spark/pull/4878#issuecomment-77264094
Sorry to chime in late, but have you done performance tests with this to
see if it makes a difference? There two issues I see here:
(1) This isn't the place where I observed issues. I only observed problems
for the read that's done in FileSegmentManagedBuffer:
https://github.com/apache/spark/blob/master/network/common/src/main/java/org/apache/spark/network/buffer/FileSegmentManagedBuffer.java#L100
which is the one I had thought should use a buffered input stream.
(2) After much more experimentation, I found that the lack of buffering was
only an issue when shuffle compression is turned off (I had turned it off in
the earlier experiments I was running to do some network benchmarking). When
compression is on, the compression libraries read data in larger chunks, so
essentially do buffering on their own. Given that I don't know of any use
cases where people turn compression off (this is recommended against in the
Spark conf), I wonder if the added complexity from this is worthwhile?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]