Github user sarutak commented on the pull request:
https://github.com/apache/spark/pull/1619#issuecomment-50442727
@witgo @pwendell I have already noticed there is not a configuration for
timeout for ConnectionManager, but the timeout for ConnectionManager does not
resolve this issue because the channel used by receiving ack is implemented as
non blocking I.O and SO_TIMEOUT is effects read after establishing connection.
So, if remote executor hangs, it cannot establish connections with fetching
executors.
Additionally, BasicBlockFetcherIterator is wait on LinkedBlockingQueue#take
(result.take) so we should set FetchResult object which size is -1 to result
queue of BasicBlockFetcherIterator.
(FetchResult which size is -1 means fetch failed)
I think remote errors can be classified following 2 cases.
1) Remote Executor hang
In this case, we need timeout for Fetch Request (Not read timeout)
I'm trying to resolve this case in https://github.com/apache/spark/pull/1632
2) Remote Executor not hang but error occurred
In this case, remote executor should send message which means error
occurred in remote Executor.
I'm trying to resolve this case in https://github.com/apache/spark/pull/1490
This is ongoing.
Can anyone review this too?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---