Github user squito commented on the issue:
https://github.com/apache/spark/pull/21346
yeah I see what you're saying about better error handling, but I'd really
rather not take that on here. I think some prior attempts at solving the 2gb
limit have tried to take on too much, and I'd like to keep this is simple as
possible, and leave more for future improvements. I guess it means that when
(if) we do make the changes you're proposing, we'd have to go back to changing
the network layer again, possibly introducing new message types etc. But we're
not really painting ourselves in a corner at all, we can do that if it becomes
necessary.
fwiw, there are other things that are higher on my list to fix when the
basic functionality goes in:
1) when you do a remote read of a cached data, even if you fetch to disk,
you memory map the entire file, rather than just using a FileInputStream
2) if you replicate a disk-cached block, it'll get written to disk to a
temp file, then read back from that file into memory, and then written to the
new location.
3) when you a do remote read of cached data, you shouldn't actually have to
wait till you fetch all the data, you should just be able to treat it as an
inputstream
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]