gortiz commented on PR #10553:
URL: https://github.com/apache/pinot/pull/10553#issuecomment-1502856849
In order to add the ability to override configs by adding a tiered field,
here you follow a pattern that is quite common in Pinot but I tried to avoid in
`index-spi`. This pattern is the following:
1. We add a new concept that modify several different parts of the code.
2. This new concept change how different parts of the code behave.
3. We drag this concept until the specific parts of the code that are
actually modified by the new concept.
A clear example of this (used in the index spi PEP) is the bloom filter
config. The first Pinot versions that supported bloom filter expected to
configure them in `bloomFilterColumns` attribute. Then we wanted to be able to
configure the bloom filter on each column, so we added `bloomFilterConfigs`.
Then each time we wanted to know the bloom filter config for a column we had to
first look for the column in `bloomFIlterConfig` and in case we don't find it,
look for it in `bloomFilterColumns` (and if we found it there we generate a
default config). Not being centralized, we needed to do that several times in
different parts of the code. This is a problem because each time we add a new
concept like that we have to be sure we touch all the places where this logic
is affected. In index-spi PEP I said that this is a problem I associate with
not having different models for user specific syntax (in this case, the
different ways a user may define a bloom filter) and internal and developer
oriented syntax (where we don't care about the different ways the user may
configure a bloom filter, we just want them to be unified in a single model).
`index-spi` solved this problem by extracting all the different ways an index
can be configured in the TableConfig (customer specific syntax/model) and
returning a single config object (developer specific model).
In this PR I think we are falling in the same pattern. The new concept is
the ability to override configs by tier. The customer specific syntax is still
the TableConfig and instead of centralize the transformation between how the
user writes something and how the code reacts to it, we are again spreading the
logic in several parts of the code. In this case, index types and startree
indexes need to know about tiered (when in fact they don't actually care). IICU
the code, encodingType is not actually being override, so we should also add
this logic in at least another place.
Instead of doing this, I suggest to try to follow another pattern. For
example:
1. We still add `tiered` concept in different places in TableConfig
2. We add a preprocessor layer where the original TableConfig with N tiered
configs is transformed into N+1 TableConfig objects without tiered.
3. For each N+1 TableConfig, we apply the logic we have right now.
I think it is easier to see this with an example. In this PR you modifies
the airlineStats_offline_table_config.json in order to add the following in
FieldConfigList (I'm going to ignore the changes in Startree indexes, but it
would also apply there):
```js
{
"name": "ArrTimeBlk",
"encodingType": "DICTIONARY",
"indexes": {
"inverted": {
"enabled": "true"
}
},
"tierOverwrites": {
"hotTier": {
"encodingType": "DICTIONARY",
"indexes": {
"bloom": {
"enabled": "true"
}
}
},
"coldTier": {
"encodingType": "RAW",
"indexes": {
"text": {
"enabled": "true"
}
}
}
}
```
This means that we have 2 tiers plus the default one, so the preprocessor
would generate 3 tables that will be equal except for the FieldConfig named
`ArrTimeBlk`, where we will have 3 different versions:
The default would be:
```js
"name": "ArrTimeBlk",
"encodingType": "DICTIONARY",
"indexes": {
"inverted": {
"enabled": "true"
}
}
```
The hotTier would be:
```js
"name": "ArrTimeBlk",
"encodingType": "DICTIONARY",
"indexes": {
"inverted": {
"enabled": "true"
},
"bloom": {
"enabled": "true"
}
}
```
The coldTier would be:
```js
"name": "ArrTimeBlk",
"encodingType": "RAW", // This was also modified!
"indexes": {
"inverted": {
"enabled": "true"
},
"indexes": {
"text": {
"enabled": "true"
}
}
}
```
In these 3 new TableConfigs there is no tierOverride, so we can read indexes
(and any other config!) without actually needed to know about them.
The preprocessor should not be very complex. It needs to know how to merge
jsons (which is not that difficult) and needs to verify that the thing that has
been override is correct (for example in this case we could forbid to override
the name).
In the abstract void, this approach seems more elegant than the other. The
separation of responsibilities is clear: The preprocessor takes care of the
tieredOverride, the IndexType.getConfig() take a TableConfig without overrides
and generates an index config. The main issue here is that I don't know how
much it may affect the rest of the code. In fact I would like to use different
classes for TableConfigWithTieredOverride and TieredConfigWithoutOverride, but
that would imply so many changes in the code that I don't think it would be
feasible right now.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]