pradeeee opened a new issue, #18866:
URL: https://github.com/apache/pinot/issues/18866

   ### Summary
   When the broker fails to deserialize a `DataTable` response from a server, 
the failure is silently swallowed: only a log line is emitted and the 
`DATA_TABLE_DESERIALIZATION_EXCEPTIONS` metric is incremented. The query then 
blocks until the broker timeout fires and returns `BROKER_TIMEOUT` with partial 
results — masking the actual root cause and making operational triage difficult.
   
   ### Affected code
   
`pinot-core/src/main/java/org/apache/pinot/core/transport/DataTableHandler.java#channelRead0`
 — the `catch (Exception e)` block (lines 94–98 on current master):
   
   ```java
   } catch (Exception e) {
     LOGGER.error("Caught exception while deserializing data table of size: {} 
from server: {}",
         responseSize, _serverRoutingInstance, e);
     
_brokerMetrics.addMeteredGlobalValue(BrokerMeter.DATA_TABLE_DESERIALIZATION_EXCEPTIONS,
 1);
     // No signal back to QueryRouter — AsyncQueryResponse latch never 
decrements for this server.
   }
   ```
   
   Because nothing notifies `QueryRouter`, the corresponding 
`AsyncQueryResponse` `CountDownLatch` for that server is never decremented. 
`getFinalResponses()` then blocks for the full query timeout.
   
   ### Why it matters
   - Real failure mode (corrupt/incompatible response bytes) is hidden behind a 
generic `BROKER_TIMEOUT` error.
   - Latency penalty: query waits the full timeout (often seconds) instead of 
failing fast.
   - Operators see "broker degraded / timeouts" rather than "server X is 
returning undeserializable responses".
   
   ### Reproduction
   The existing 
`pinot-core/src/test/java/org/apache/pinot/core/transport/QueryRoutingTest.java`
 harness (`getQueryServer(delayMs, responseBytes)` + `mockQueryScheduler`) 
supports injecting arbitrary `byte[]` as a server response. A test that ships 
garbage bytes (e.g., `new byte[]{0,1,2,3,4,5}`) as the response and submits a 
query with a short timeout will:
   - block for the full timeout window,
   - return a `ServerResponse` with `getDataTable() == null` and 
`_finishResponseTimeMs == 0`,
   - increment `DATA_TABLE_DESERIALIZATION_EXCEPTIONS`.
   
   After the fix this same test should return well under the timeout, with a 
specific deserialization error code.
   
   ### Suggested fix
   1. In `DataTableHandler#channelRead0`'s catch block, surface the failure to 
`QueryRouter` so the latch for this server is decremented and the error is 
reported on the response. Two options:
      - Add a package-private 
`QueryRouter#markServerDeserializationFailure(serverRoutingInstance, 
exception)` that fails the in-flight queries for that server with a dedicated 
error.
      - Or, treat the channel as compromised and call 
`markServerDown(_serverRoutingInstance, e)` — simpler but more disruptive 
(drops all in-flight queries on the channel).
   2. Add a `DESERIALIZATION_ERROR` value to 
`org.apache.pinot.spi.exception.QueryErrorCode` so this failure mode is 
distinguishable from `INTERNAL` / `BROKER_TIMEOUT`.
   3. Add a regression test in `QueryRoutingTest` as described above.
   
   ### Environment
   Observed on current master; the code path is unchanged across recent 
releases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to