ngsg commented on PR #4239: URL: https://github.com/apache/hive/pull/4239#issuecomment-2577511386
Thank you to @abstractdog and @deniskuzZ for reviewing the patch. After studying the issue again, I have concluded that the proposed patch is insufficient to fully address the issue, so I have decided to close this PR. We reviewed the issue and determined that this patch could be helpful for datasets where the null-distribution is similar to TPC-DS data. Specifically, if `numNulls` is closely related to the default partition(null partition), this improves the accuracy of the `numNulls` returned by `AggregateStatsCache`. However, not all datasets have a null distribution similar to TPC-DS, meaning this patch isn't universally applicable. Additionally, it appears that the non-deterministic behavior we encountered is unrelated to `numNulls`. We observed that `AggregateStatsCache` makes SemiJoin branch removal non-deterministic, but it uses `numRows` and `NDV`, not `numNulls`. Therefore, I think this non-deterministic behavior is not suitable for verifying the patch. While some optimizers may depend on `numNulls`, I have not yet been able to write a qfile to properly verify this patch. As a result, I believe it is best to close the PR for now and reopen it once I have either a more effective patch or a feasible qfile to verify this patch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For additional commands, e-mail: gitbox-h...@hive.apache.org