paul-rogers opened a new issue #11811: URL: https://github.com/apache/druid/issues/11811
Code inspection revealed a potential bug. I don't have the setup required to verify it. When a data node query runs, it will report if any of its assigned segments are, in fact, unavailable. The Broker will retry these segments on another node. The data node reports the segments in the response context field `missingSegments`. To save space, the data node may truncate this list. In such a case, the retry "operator" cannot retry: it does not know which segment were actually missing. Instead, the Broker simply runs the query without the missing segments, potentially returning incorrect results. ### Affected Version Latest GitHub sources as of 2021-10-18. ### Description Here are more details. The data node writes the list of missing segments into the `missingSegments` field within the response context. The [`ResponseContext.toHeader().serializeWith`](https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/context/ResponseContext.java#L361) method forces the response context into some given number of types, dropping list entries to fit. The dropped list entries can include the missing segments. This function sets the `truncated` to indicate that truncation occurred. The response context is passed back to the broker which aggregates the individual responses to create the overall response. That response is then passed to [`RetryQueryRunner.getMissingSegments()`](https://github.com/apache/druid/blob/master/server/src/main/java/org/apache/druid/query/RetryQueryRunner.java#L139) which determines which segments to retry. The `getMissingSegments()` function does not check the `truncated` field, however and so blindly proceeds without the missing segments. Since the segments may contain data for the query, the query result is incomplete. Further, since the set of missing segments is transitory, this issue means that different runs of the same query, with the same data, may return differing results. Suggestion: fail the query if the number of missing segments is large enough to cause the response header to be truncated. That is, simply check the `truncated` flag and fail the query if set. This issue is only likely to occur for a large table (data source), on cluster undergoing startup or rebalancing such that the set of assigned segments is unstable. That its, it is transient and only occurs when it is hardest to debug. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
