Re: [I] Proposal: Extending merge rollup capabilities [pinot]

via GitHub Thu, 31 Oct 2024 03:26:59 -0700


davecromberge commented on issue #14310:
URL: https://github.com/apache/pinot/issues/14310#issuecomment-2449522350


   I have done further investigation as to how both options might be 
implemented and have concluded that it might be best to pursue dimensionality 
reduction in the first pass and re-evaluate whether varying aggregate behaviour 
is necessary.
   
   #### Dimensionality reduction/erasure
   
   Requires additional configuration for each time bucket with a list of 
dimension names to erase.  In this context, erasing a dimension refers to 
looking up the `defaultNullValue` in the `fieldSpec` for that dimension.  
   
   The configuration might change to include an additional array configuration 
field as follows:
   ```
   "MergeRollupTask": {
     "1hour.mergeType": "rollup",
     "1hour.bucketTimePeriod": "1h",
     "1hour.bufferTimePeriod": "3h",
     "1hour.maxNumRecordsPerSegment": "1000000",
     "1hour.maxNumRecordsPerTask": "5000000",
     "1hour.maxNumParallelBuckets": "5",
     "1day.eraseDimensionValues": ["dimColA"],
     "1day.mergeType": "rollup",
     "1day.bucketTimePeriod": "1d",
     "1day.bufferTimePeriod": "1d",
     "1day.roundBucketTimePeriod": "1d",
     "1day.maxNumRecordsPerSegment": "1000000",
     "1day.maxNumRecordsPerTask": "5000000",
     "1day.eraseDimensionValues": ["dimColA", "dimColB"],
     "metricColA.aggregationType": "sum",
     "metricColB.aggregationType": "max"
   }
   ```
   
   In the example above, only `dimColA` is eliminated in the 1 hour merge task, 
where both `dimColA` and `dimColB` are eliminated in the 1 day merge tasks.  
   
   Concerning implementation, if the new field is provided, the MergeRollupTask 
will have to pass a custom `RecordTransformer` to the 
`SegmentProcessorFramework`.  This custom record transformer will:
   
   1. For each dimension value name, lookup the corresponding `fieldSpec` in 
the table configuration.  
   2. Use the `defaultNullValue` to transform (overwrite) the existing record 
value.
   3. Log warning messages for dimensions that are invalid
   
   Note: The custom record transformer precedes all existing record 
transformers.  Finally, the rollup process will consolidate all records where 
the dimensions have matching coordinates.  The transformed records should 
result in a greater degree of rollup expressed as a fraction of the number of 
input records.
   
   #### Varying aggregate behaviour over time (abandoned?)
   
   Varying aggregate behaviour over time introduces complexity for 
indeterminate gains.  
   
   Firstly, the configuration for sketch precision would have to be defined for 
both the different metrics and time periods which introduces confusion for how 
the current task is configured.  Example:
   
   ```
   "MergeRollupTask": {
     "1hour.mergeType": "rollup",
     "1hour.bucketTimePeriod": "1h",
     "1hour.bufferTimePeriod": "3h",
     "1hour.maxNumRecordsPerSegment": "1000000",
     "1hour.maxNumRecordsPerTask": "5000000",
     "1hour.maxNumParallelBuckets": "5",
     "1hour.metricColA.functionParameters": { "nominalEntries":  "4096" },
     "1hour.metricColB.functionParameters": { "nominalEntries":  "8192" },
     "1day.mergeType": "rollup",
     "1day.bucketTimePeriod": "1d",
     "1day.bufferTimePeriod": "1d",
     "1day.roundBucketTimePeriod": "1d",
     "1day.maxNumRecordsPerSegment": "1000000",
     "1day.maxNumRecordsPerTask": "5000000",
     "1day.metricColA.functionParameters": { "nominalEntries":  "2048" },
     "1day.metricColB.functionParameters": { "nominalEntries":  "4096" },
     "metricColA.aggregationType": "distinctCountThetaSketch",
     "metricColB.aggregationType": "distinctCountThetaSketch"
   }
   ```
   
   In the example above, the aggregation function is configured both within the 
time buckets as well as on the metrics directly which might be confusing.  
Alternatively, the function parameters could be supplied directly on the 
metrics which still requires additional time configuration for each parameter.
   
   Secondly, and more importantly, varying aggregate behaviour over time can 
lead to incorrect results.  This is because StarTree indexes are constructed 
using the `functionParameters` configuration that is present on the StarTree.  
Constructing new trees from merged segments may no longer be possible at the 
given `functionParameter` configuration if the underlying aggregates have 
varying precision (in the case of sketches).  This would not be necessarily be 
a problem for Apache Datasketches but it might not hold true for other 
aggregation types.  Finally, the `canUseStarTree` method will consult the 
configured `functionParameters` and determine whether queries can be serviced 
directly from the StarTree, even though the underlying sketch aggregates might 
have a different precision.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Proposal: Extending merge rollup capabilities [pinot]

Reply via email to