vasiliy-mikhailov opened a new pull request, #693: URL: https://github.com/apache/commons-collections/pull/693
## Problem `SetOperations.cosineSimilarity` computes the product of the two cardinalities as an `int` before passing it to `Math.sqrt`: ```java // Given that the cardinality is an int then the product as a double will not // overflow, we can use one sqrt: return numerator == 0 ? 0 : numerator / Math.sqrt(cardinality(first) * cardinality(second)); ``` The comment states the product is meant to be computed as a `double` so it will not overflow, but without a cast the multiplication happens in `int`. For large inputs (each cardinality > 46341) the product exceeds `Integer.MAX_VALUE` and wraps to a negative value, so `Math.sqrt` returns `NaN` and both `cosineSimilarity` and `cosineDistance` become `NaN`. For example, two identical extractors of cardinality 50000 give `50000 * 50000 = 2_500_000_000`, which overflows to a negative `int`; the similarity should be `1.0` but is `NaN`. ## Fix Cast the first operand to `double` so the product is computed in `double`, as the comment already intended. Both the `BitMapExtractor` and `BloomFilter` overloads had the same issue and are fixed. ## Test Adds `testCosineSimilarityLargeCardinalityNoOverflow` to `SetOperationsTest`, building extractors with cardinality 50000 and asserting `cosineSimilarity == 1.0` / `cosineDistance == 0.0`. It fails on `master` (`expected: <1.0> but was: <NaN>`) and passes with this change; the existing `SetOperationsTest` suite stays green. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
