davecromberge commented on issue #14310:
URL: https://github.com/apache/pinot/issues/14310#issuecomment-2449522350
I have done further investigation as to how both options might be
implemented and have concluded that it might be best to pursue dimensionality
reduction in the first pass and re-evaluate whether varying aggregate behaviour
is necessary.
#### Dimensionality reduction/erasure
Requires additional configuration for each time bucket with a list of
dimension names to erase. In this context, erasing a dimension refers to
looking up the `defaultNullValue` in the `fieldSpec` for that dimension.
The configuration might change to include an additional array configuration
field as follows:
```
"MergeRollupTask": {
"1hour.mergeType": "rollup",
"1hour.bucketTimePeriod": "1h",
"1hour.bufferTimePeriod": "3h",
"1hour.maxNumRecordsPerSegment": "1000000",
"1hour.maxNumRecordsPerTask": "5000000",
"1hour.maxNumParallelBuckets": "5",
"1day.eraseDimensionValues": ["dimColA"],
"1day.mergeType": "rollup",
"1day.bucketTimePeriod": "1d",
"1day.bufferTimePeriod": "1d",
"1day.roundBucketTimePeriod": "1d",
"1day.maxNumRecordsPerSegment": "1000000",
"1day.maxNumRecordsPerTask": "5000000",
"1day.eraseDimensionValues": ["dimColA", "dimColB"],
"metricColA.aggregationType": "sum",
"metricColB.aggregationType": "max"
}
```
In the example above, only `dimColA` is eliminated in the 1 hour merge task,
where both `dimColA` and `dimColB` are eliminated in the 1 day merge tasks.
Concerning implementation, if the new field is provided, the MergeRollupTask
will have to pass a custom `RecordTransformer` to the
`SegmentProcessorFramework`. This custom record transformer will:
1. For each dimension value name, lookup the corresponding `fieldSpec` in
the table configuration.
2. Use the `defaultNullValue` to transform (overwrite) the existing record
value.
3. Log warning messages for dimensions that are invalid
Note: The custom record transformer precedes all existing record
transformers. Finally, the rollup process will consolidate all records where
the dimensions have matching coordinates. The transformed records should
result in a greater degree of rollup expressed as a fraction of the number of
input records.
#### Varying aggregate behaviour over time (abandoned?)
Varying aggregate behaviour over time introduces complexity for
indeterminate gains.
Firstly, the configuration for sketch precision would have to be defined for
both the different metrics and time periods which introduces confusion for how
the current task is configured. Example:
```
"MergeRollupTask": {
"1hour.mergeType": "rollup",
"1hour.bucketTimePeriod": "1h",
"1hour.bufferTimePeriod": "3h",
"1hour.maxNumRecordsPerSegment": "1000000",
"1hour.maxNumRecordsPerTask": "5000000",
"1hour.maxNumParallelBuckets": "5",
"1hour.metricColA.functionParameters": { "nominalEntries": "4096" },
"1hour.metricColB.functionParameters": { "nominalEntries": "8192" },
"1day.mergeType": "rollup",
"1day.bucketTimePeriod": "1d",
"1day.bufferTimePeriod": "1d",
"1day.roundBucketTimePeriod": "1d",
"1day.maxNumRecordsPerSegment": "1000000",
"1day.maxNumRecordsPerTask": "5000000",
"1day.metricColA.functionParameters": { "nominalEntries": "2048" },
"1day.metricColB.functionParameters": { "nominalEntries": "4096" },
"metricColA.aggregationType": "distinctCountThetaSketch",
"metricColB.aggregationType": "distinctCountThetaSketch"
}
```
In the example above, the aggregation function is configured both within the
time buckets as well as on the metrics directly which might be confusing.
Alternatively, the function parameters could be supplied directly on the
metrics which still requires additional time configuration for each parameter.
Secondly, and more importantly, varying aggregate behaviour over time can
lead to incorrect results. This is because StarTree indexes are constructed
using the `functionParameters` configuration that is present on the StarTree.
Constructing new trees from merged segments may no longer be possible at the
given `functionParameter` configuration if the underlying aggregates have
varying precision (in the case of sketches). This would not be necessarily be
a problem for Apache Datasketches but it might not hold true for other
aggregation types. Finally, the `canUseStarTree` method will consult the
configured `functionParameters` and determine whether queries can be serviced
directly from the StarTree, even though the underlying sketch aggregates might
have a different precision.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]