[GitHub] [geode] Bill commented on a change in pull request #6561: GEODE-9346: When client received incorrect byte array of PdxType due …

GitBox Fri, 18 Jun 2021 13:56:35 -0700


Bill commented on a change in pull request #6561:
URL: https://github.com/apache/geode/pull/6561#discussion_r654062169




##########
File path: 
geode-core/src/main/java/org/apache/geode/cache/client/internal/QueryOp.java
##########
@@ -120,7 +122,11 @@ protected Object processResponse(Message msg) throws 
Exception {
           queryResult = resultPart.getObject();
         } catch (Exception e) {
           String s = "While deserializing " + getOpName() + " result";
-          exceptionRef[0] = new SerializationException(s, e);
+          if (e instanceof PdxSerializationException) {
+            exceptionRef[0] = new IOException(s, e);

Review comment:
       Thank you for the extra comment in `processResponse()` @gesterzhou! I 
now understand that by throwing an `IOException` from any 
`AbstractOp.processResponse()` (like `QueryOp.processResponse()`) it will cause 
retrying by `OpExecutorImpl.execute()` and `OpExecutorImpl.handleException()`.
   
   I understand your reasoning, that if those methods in `OpExecutorImpl` are 
retrying (on a single server, and across servers in the pool), and if _any_ 
server returns valid chunks (within the retry window), the client will 
eventually succeed.
   
   What I do not understand, from this PR, and what I do not see represented in 
tests, is the mechanism by which the corruption of chunks occurs in the first 
place. You theorize that when a client communicates with a server that is 
crashing, that we might see corrupt chunks. If that's true, I'd ideally like 
test code to show that. And if that is infeasible, I'd like a more in-depth 
explanation of how that could happen.
   
   Without that understanding, I am left wondering: what are the _other_ ways 
(besides your theorized crashing-server-sends-corrupt-chunk) that the second 
part of a chunk could be corrupted. And are there ways it could be corrupted on 
_all_ the servers—not just a single, crashing one.
   
   And if a chunk were corrupted on all the servers, this PR would cause the 
client to retry across all servers, a total of `servers * retries` tries. If I 
adopt your theory then perhaps this is not deal. But if I do not adopt your 
theory, I have a fear that a corrupt chunk could appear on all servers 
frequently and this could cause a lot of extra client-server traffic.
   
   If I believed your theory, I'd wonder: how (and where, in the code) can we 
end up with a chunk that is corrupt. That is where the bug seems to lie. The 
current PR papers over that problem by masking it as an `IOException`. It makes 
me think there was a deeper `IOException` we failed to handle properly.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [geode] Bill commented on a change in pull request #6561: GEODE-9346: When client received incorrect byte array of PdxType due …

Reply via email to