[GitHub] [pinot] lksvenoy-r7 opened a new issue, #9082: Support multi-valued dimensions in the StarTree index

GitBox Thu, 21 Jul 2022 03:43:10 -0700


lksvenoy-r7 opened a new issue, #9082:
URL: https://github.com/apache/pinot/issues/9082


   Currently, the star-tree index can not be used with multi-valued dimensional 
columns. The problem with supporting multi-valued columns in the split order of 
the star-tree index is that the behaviour for splitting on multiple values in 
the split-order would be defined, as the behaviour is no longer additive.
   
   This ticket is a suggestion to implement support for multi-valued 
dimensional columns in the split order for the star-tree index, with a simply 
constraint of only allowing one specific value to be used when splitting for 
the index.
   
   Suppose we have a table which has the following dimensions:
   ```
   entityId: String
   groupIds: List of strings
   status: String
   ```
   And a query looking like this
   `SELECT COUNT(*) FROM table WHERE entityId = 'ID' AND groupIds CONTAINS 
'group_1' AND status = 'ONLINE'`
   
   If there are many rows here, this could be a very intense query to resolve. 
If the star tree index could pre-aggregate on each value in groupId, we could 
make this lookup much more performant. This is a problem that we are facing 
today. A group can contain multiple entities, and an entity can be in multiple 
groups. This many to many relationship means there is a dynamic constraint on 
our queries.
   
   Some queries will ignore group (that would be the star node), while some 
need to access the aggregation on a group basis.
   
   The problem with implementing support for this, is that it wouldn't work for 
a query defined as this:
   `SELECT COUNT(*) FROM table WHERE entityId = 'ID' AND groupIds IN 
('group_1', 'group_2') AND status = 'ONLINE'`
   
   This behaviour would be undefined, as there is no way to create one 
aggregation per groupId if we also need to handle all the permutations of 
possible values in the multi-valued column. This is why I propose that we 
support multi-valued columns for the star-tree index with the constraint that 
each value would have an aggregation associated with it, but not allowing it to 
span a selection of multiple values.
   
   For example..
   
   With the following star-tree index:
   ```
   "starTreeIndexConfigs": [
           {
             "dimensionsSplitOrder": [
               "entityId",
               "groupIds",
               "status"
             ],
             "skipStarNodeCreationForDimensions": [],
             "functionColumnPairs": [
               "COUNT__*"
             ],
             "maxLeafRecords": 10000
           }
         ],
   ```
   This would be possible
   ```
   entity_1 -> count(*) = 500
      entity_1 -> group_1 -> count(*) = (200)
          entity_1 -> group_1 -> ONLINE -> count(*) = 100
          entity_1 -> group_1 -> OFFLINE -> count(*) = 100
      entity_1 -> group_2 -> count(*) = (300)
          entity_1 -> group_2 -> ONLINE -> count(*) = 150
          entity_1 -> group_2 -> OFFLINE -> count(*) = 150
   ```
      
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [pinot] lksvenoy-r7 opened a new issue, #9082: Support multi-valued dimensions in the StarTree index

Reply via email to