Bill commented on a change in pull request #6561:
URL: https://github.com/apache/geode/pull/6561#discussion_r654062169
##########
File path:
geode-core/src/main/java/org/apache/geode/cache/client/internal/QueryOp.java
##########
@@ -120,7 +122,11 @@ protected Object processResponse(Message msg) throws
Exception {
queryResult = resultPart.getObject();
} catch (Exception e) {
String s = "While deserializing " + getOpName() + " result";
- exceptionRef[0] = new SerializationException(s, e);
+ if (e instanceof PdxSerializationException) {
+ exceptionRef[0] = new IOException(s, e);
Review comment:
Thank you for the extra comment in `processResponse()` @gesterzhou! I
now understand that by throwing an `IOException` from any
`AbstractOp.processResponse()` (like `QueryOp.processResponse()`) it will cause
retrying by `OpExecutorImpl.execute()` and `OpExecutorImpl.handleException()`.
I understand your reasoning, that if those methods in `OpExecutorImpl` are
retrying (on a single server, and across servers in the pool), and if _any_
server returns valid chunks (within the retry window), the client will
eventually succeed.
What I do not understand, from this PR, and what I do not see represented in
tests, is the mechanism by which the corruption of chunks occurs in the first
place. You theorize that when a client communicates with a server that is
crashing, that we might see corrupt chunks. If that's true, I'd ideally like
test code to show that. And if that is infeasible, I'd like a more in-depth
explanation of how that could happen.
Without that understanding, I am left wondering: what are the _other_ ways
(besides your theorized crashing-server-sends-corrupt-chunk) that the second
part of a chunk could be corrupted. And are there ways it could be corrupted on
_all_ the servers—not just a single, crashing one.
And if a chunk were corrupted on all the servers, this PR would cause the
client to retry across all servers, a total of `servers * retries` tries. If I
adopt your theory then perhaps this is not deal. But if I do not adopt your
theory, I have a fear that a corrupt chunk could appear on all servers
frequently and this could cause a lot of extra client-server traffic.
If I believed your theory, I'd wonder: how (and where, in the code) can we
end up with a chunk that is corrupt. That is where the bug seems to lie. The
current PR papers over that problem by masking it as an `IOException`. It makes
me think there was a deeper `IOException` we failed to handle properly.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]