[GitHub] [pinot] gortiz commented on pull request #10553: Allow to overwrite index configs at tier level

via GitHub Tue, 11 Apr 2023 00:52:41 -0700


gortiz commented on PR #10553:
URL: https://github.com/apache/pinot/pull/10553#issuecomment-1502856849


   In order to add the ability to override configs by adding a tiered field, 
here you follow a pattern that is quite common in Pinot but I tried to avoid in 
`index-spi`. This pattern is the following:
   1. We add a new concept that modify several different parts of the code.
   2. This new concept change how different parts of the code behave.
   3. We drag this concept until the specific parts of the code that are 
actually modified by the new concept.
   
   A clear example of this (used in the index spi PEP) is the bloom filter 
config. The first Pinot versions that supported bloom filter expected to 
configure them in `bloomFilterColumns` attribute. Then we wanted to be able to 
configure the bloom filter on each column, so we added `bloomFilterConfigs`. 
Then each time we wanted to know the bloom filter config for a column we had to 
first look for the column in `bloomFIlterConfig` and in case we don't find it, 
look for it in `bloomFilterColumns` (and if we found it there we generate a 
default config). Not being centralized, we needed to do that several times in 
different parts of the code. This is a problem because each time we add a new 
concept like that we have to be sure we touch all the places where this logic 
is affected. In index-spi PEP I said that this is a problem I associate with 
not having different models for user specific syntax (in this case, the 
different ways a user may define a bloom filter) and internal and developer 
 oriented syntax (where we don't care about the different ways the user may 
configure a bloom filter, we just want them to be unified in a single model). 
`index-spi` solved this problem by extracting all the different ways an index 
can be configured in the TableConfig (customer specific syntax/model) and 
returning a single config object (developer specific model).
   
   In this PR I think we are falling in the same pattern. The new concept is 
the ability to override configs by tier. The customer specific syntax is still 
the TableConfig and instead of centralize the transformation between how the 
user writes something and how the code reacts to it, we are again spreading the 
logic in several parts of the code. In this case, index types and startree 
indexes need to know about tiered (when in fact they don't actually care). IICU 
the code, encodingType is not actually being override, so we should also add 
this logic in at least another place.
   
   Instead of doing this, I suggest to try to follow another pattern. For 
example:
   1. We still add `tiered` concept in different places in TableConfig
   2. We add a preprocessor layer where the original TableConfig with N tiered 
configs is transformed into N+1 TableConfig objects without tiered.
   3. For each N+1 TableConfig, we apply the logic we have right now.
   
   I think it is easier to see this with an example. In this PR you modifies 
the airlineStats_offline_table_config.json in order to add the following in 
FieldConfigList (I'm going to ignore the changes in Startree indexes, but it 
would also apply there):
   ```js
   {
         "name": "ArrTimeBlk",
         "encodingType": "DICTIONARY",
         "indexes": {
           "inverted": {
             "enabled": "true"
           }
         },
         "tierOverwrites": {
           "hotTier": {
             "encodingType": "DICTIONARY",
             "indexes": {
               "bloom": {
                 "enabled": "true"
               }
             }
           },
           "coldTier": {
             "encodingType": "RAW",
             "indexes": {
               "text": {
                 "enabled": "true"
               }
             }
           }
         }
   ```
   
   This means that we have 2 tiers plus the default one, so the preprocessor 
would generate 3 tables that will be equal except for the FieldConfig named 
`ArrTimeBlk`, where we will have 3 different versions:
   
   The default would be:
   ```js
         "name": "ArrTimeBlk",
         "encodingType": "DICTIONARY",
         "indexes": {
           "inverted": {
             "enabled": "true"
           }
         }
   ```
   
   The hotTier would be:
   ```js
         "name": "ArrTimeBlk",
         "encodingType": "DICTIONARY",
         "indexes": {
           "inverted": {
             "enabled": "true"
           },
           "bloom": {
             "enabled": "true"
           }
         }
   ```
   
   The coldTier would be:
   
   ```js
         "name": "ArrTimeBlk",
         "encodingType": "RAW", // This was also modified!
         "indexes": {
           "inverted": {
             "enabled": "true"
           },
           "indexes": {
             "text": {
               "enabled": "true"
             }
           }
         }
   ```
   
   In these 3 new TableConfigs there is no tierOverride, so we can read indexes 
(and any other config!) without actually needed to know about them. 
   
   The preprocessor should not be very complex. It needs to know how to merge 
jsons (which is not that difficult) and needs to verify that the thing that has 
been override is correct (for example in this case we could forbid to override 
the name).
   
   In the abstract void, this approach seems more elegant than the other. The 
separation of responsibilities is clear: The preprocessor takes care of the 
tieredOverride, the IndexType.getConfig() take a TableConfig without overrides 
and generates an index config. The main issue here is that I don't know how 
much it may affect the rest of the code. In fact I would like to use different 
classes for TableConfigWithTieredOverride and TieredConfigWithoutOverride, but 
that would imply so many changes in the code that I don't think it would be 
feasible right now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [pinot] gortiz commented on pull request #10553: Allow to overwrite index configs at tier level

Reply via email to