tarun11Mavani commented on code in PR #18760:
URL: https://github.com/apache/pinot/pull/18760#discussion_r3479514949
##########
pinot-core/src/main/java/org/apache/pinot/core/query/aggregation/function/funnel/DictIdsWrapper.java:
##########
@@ -18,19 +18,151 @@
*/
package org.apache.pinot.core.query.aggregation.function.funnel;
+import it.unimi.dsi.fastutil.ints.IntArrayList;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
import org.apache.pinot.segment.spi.index.reader.Dictionary;
import org.roaringbitmap.RoaringBitmap;
+/**
+ * Holds per-step RoaringBitmaps keyed by correlation dictionary IDs.
+ *
+ * <p>For single-key CORRELATE_BY, stores raw dictionary IDs directly in the
bitmaps (compact, fits in one
+ * RoaringBitmap container for typical segment sizes).
+ *
+ * <p>For multi-key CORRELATE_BY, composite IDs are assigned via stride-based
arithmetic (when the combined key
+ * space fits in int) or a HashMap fallback for large key spaces.
+ */
final class DictIdsWrapper {
- final Dictionary _dictionary;
+ final Dictionary[] _dictionaries;
final RoaringBitmap[] _stepsBitmaps;
+ // Stride-based composite mapping (non-null only for multi-key when product
of dict sizes fits in int)
+ private final int[] _strides;
+
+ // HashMap-based composite mapping (non-null only for multi-key when stride
overflows int)
+ private final Map<IntArrayList, Integer> _compositeKeyMap;
+ private final List<int[]> _compositeKeyReverse;
+ private final IntArrayList _lookupKey;
+
DictIdsWrapper(int numSteps, Dictionary dictionary) {
- _dictionary = dictionary;
+ _dictionaries = new Dictionary[]{dictionary};
_stepsBitmaps = new RoaringBitmap[numSteps];
for (int n = 0; n < numSteps; n++) {
_stepsBitmaps[n] = new RoaringBitmap();
}
+ _strides = null;
+ _compositeKeyMap = null;
+ _compositeKeyReverse = null;
+ _lookupKey = null;
+ }
+
+ DictIdsWrapper(int numSteps, Dictionary[] dictionaries) {
+ _dictionaries = dictionaries;
+ _stepsBitmaps = new RoaringBitmap[numSteps];
+ for (int n = 0; n < numSteps; n++) {
+ _stepsBitmaps[n] = new RoaringBitmap();
+ }
+
+ if (dictionaries.length > 1) {
+ long totalSpace = 1;
+ boolean fitsInInt = true;
+ for (Dictionary d : dictionaries) {
+ totalSpace *= d.length();
+ if (totalSpace > Integer.MAX_VALUE) {
+ fitsInInt = false;
+ break;
+ }
+ }
+
+ if (fitsInInt) {
+ _strides = new int[dictionaries.length];
+ _strides[dictionaries.length - 1] = 1;
+ for (int k = dictionaries.length - 2; k >= 0; k--) {
+ _strides[k] = _strides[k + 1] * dictionaries[k + 1].length();
+ }
+ _compositeKeyMap = null;
+ _compositeKeyReverse = null;
+ _lookupKey = null;
+ } else {
+ _strides = null;
+ _compositeKeyMap = new HashMap<>();
+ _compositeKeyReverse = new ArrayList<>();
+ _lookupKey = new IntArrayList(dictionaries.length);
+ }
+ } else {
+ _strides = null;
+ _compositeKeyMap = null;
+ _compositeKeyReverse = null;
+ _lookupKey = null;
+ }
+ }
+
+ boolean isMultiKey() {
+ return _dictionaries.length > 1;
+ }
+
+ boolean isHashMapPath() {
+ return _compositeKeyMap != null;
+ }
+
+ /**
+ * Maps a tuple of per-column dictionary IDs to a single composite int
suitable for RoaringBitmap.
+ * Only used for multi-key; for single-key, callers should add the dictId
directly.
+ */
+ int getCompositeCorrelationId(int[] dictIds) {
Review Comment:
Good point on the collision semantics. Here's the current precision
breakdown per strategy for multi-key:
- **bitmap**: stride-based composite IDs (collision-free within segment) →
hashed to 32-bit at extraction for cross-segment merge. Approximate at merge.
- **set**: stride-based → `toCompositeString()` at extraction →
`Set<String>`. Exact end-to-end.
- **theta_sketch**: `toCompositeString()` fed directly into sketch (doesn't
use stride/composite IDs at all). Approximate by design.
- **partitioned**: stride-based, counts directly from segment-local bitmaps
— no conversion to global identity needed. Exact end-to-end.
- **sorted+partitioned**: flat-array tracking with raw dictIds, direct
counters (doesn't use `DictIdsWrapper`). Exact end-to-end.
So the stride machinery is collision-free within a segment, which matters
for set and partitioned (both exact end-to-end). For bitmap, the within-segment
precision gets discarded at extraction so it's wasted for that strategy.
Re: monotonicity contract (counting by user+device should never exceed
counting by user alone), this holds naturally since adding a secondary key can
only split entities into finer groups, never merge them. Will document this.
Open to simplifying the composite ID scheme further in a follow-up if you
have a specific approach in mind for the bitmap path. For this PR the stride
approach covers the set and partitioned strategies cleanly without allocation
overhead.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]