davecromberge opened a new issue, #14310:
URL: https://github.com/apache/pinot/issues/14310

   ### What needs to be done?
   
   Extend the merge-rollup framework to create additional transformations:
   - dimensionality reduction/erasure
   - varying aggregate behaviour over time
   
   #### Dimensionality reduction/erasure
   
   Eliminate a particular dimension column's values to allow more rows to 
aggregate as duplicates.
   
   For example:
   
   | Dimension  | Pre-transformation | Post-transformation |
   | ----------- | ------------------- | -------------------- |
   | Country      | United States          | United States            |
   | Device        | Mobile                      | Mobile                       
|
   | Browser      | Safari                       | Null / Other               |
   
   The above example shows the Browser dimension erased or set to some default 
value after some time window has passed. 
   
   #### Varying aggregate behaviour over time
   
   Some aggregate values could change precision over time.  The multi-level 
merge functionality can be used to reduce the resolution or precision of 
aggregates for older segments.   This applies primarily to sketches, but could 
also be used for other binary aggregate types.
   
   | Sketch | Pre-transformation | Post-transformation |
   | ------- | ------------------- | -------------------- |
   | Theta sketch 1  | 512kb  | 256kb |
   | Theta sketch 2  | 400kb | 200kb |
   | Theta sketch 3  | 512kb | 256kb |
   
   The above example shows a size reduction of 2x on existing sketches which 
could be achieved by reducing the lgK value by a factor of 1 as data ages.  Be 
aware that this could cause varying precisions for queries that span time 
ranges, where the sketch implementation supports this.
   
   ### Why the feature is needed (e.g. describing the use case).
   
   The primary justification for such a feature is more aggressive space saving 
for historic data.  As the merge rollup task processes older time windows, 
users could eliminate non-critical dimensions which would result in a greater 
degree of documents rolling up into a single aggregate.  Similarly, users could 
sacrifice aggregate accuracy for historic queries and thus trade this off for a 
smaller storage footprint - especially when dealing with Theta / Tuple sketches 
which can be in the order of megabytes at lgK = 16. 
   
   ### Idea on how this may be implemented
   
   Both extensions would require changes to the 
[configuration](https://docs.pinot.apache.org/operators/operating-pinot/minion-merge-rollup-task#configure-the-minion-merge-rollup-task)
 for the Minion Merge rollup task.  In particular, the most flexible approach 
would be to have a dynamic bag of properties that could apply to each 
individual aggregation function, where these could be interpreted before 
rolling up or merging the data.
   
   #### Dimensionality reduction/erasure
   
   - applies to “map” phase of the 
[[SegmentProcessorFramework](https://github.com/apache/pinot/blob/master/pinot-core/src/main/java/org/apache/pinot/core/segment/processing/framework/SegmentProcessorFramework.java#L167)](https://github.com/apache/pinot/blob/master/pinot-core/src/main/java/org/apache/pinot/core/segment/processing/framework/SegmentProcessorFramework.java#L167).
   - default reducer will function as normal
   - configuration should include:
       - time bucket periods
       - dimension name
       - leverage default value
   - configuration should be part of merge rollup task / segment refresh config
       - "dimensionName.eliminate.after": "7d",
   
   #### Varying aggregate behaviour over time
   
   - applies to “map” phase of the 
[[SegmentProcessorFramework](https://github.com/apache/pinot/blob/master/pinot-core/src/main/java/org/apache/pinot/core/segment/processing/framework/SegmentProcessorFramework.java#L167)](https://github.com/apache/pinot/blob/master/pinot-core/src/main/java/org/apache/pinot/core/segment/processing/framework/SegmentProcessorFramework.java#L167).
   - configuration could be uniformly applied in a global manner or part of the 
specific table task config:
       - hard coded parameters for Theta and Tuple sketch lgK (cumbersome)
       - dynamic bag of properties associated with time bucket (hard to 
validate)
       - not necessary to extend the function name parameter parser
   
   _Note: This issue should be treated PEP-request._


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to