npawar opened a new issue #5135: Transform functions in Pinot schema URL: https://github.com/apache/incubator-pinot/issues/5135 Consider, X: Data at source. This can be either a stream or data files. The formats are typically JSON, AVRO, CSV etc. Y: Data in Pinot. This is the record/document in Pinot. When data is ingested into Pinot (either realtime ingestion or batch ingestion), all columns in X directly need to map to Y. The only exception to this is the time column, where we allow transformation from one time format to another, but we are limited to 1 column. This means that every column in the destination schema should be present exactly as it is in the source schema (except the time column). This is not always practical. It is often desirable to have some amount of transformations to the source columns before they get to the destination. For example, consider this sample ads data schema Source columns - **userID, name.firstName, name.lastName, IP, eventType, cost, timestamp ** ``` { "userID": 1, "name”: { “firstName": "John", "lastName": "Doe"}, "IP": "10.1.2.3", "eventType": "IMPRESSION", "cost": 2000, “timestamp”: 1583882502198 }, { "userID": 2, “name”: { "firstName": "Mary", "lastName": "Smith"}, "IP": "10.5.6.7", "eventType": "IMPRESSION", "cost": 4000, “timestamp”: 1583882502198 }, { "userID": 3, “name”: { "firstName": "Rita", "lastName": "Skeeter"}, "IP": "10.9.8.7", "eventType": "CLICK", "cost": 600, “timestamp”: 1583882502198 } ``` Destination columns - **userId, fullName, country, zipcode, impressions, clicks, cost, hoursSinceEpoch, daysSinceEpoch** userId - Map userID to userId fullName - Concat name.firstName and name.lastName country - Extract country from IP zipcode - Extract zipcode from IP impressions - 1 if eventType=IMPRESSION, 0 otherwise clicks - 1 if eventType=CLICK, 0 otherwise cost - Directly maps from cost, no transformations hoursSinceEpoch - convert timestamp to epoch hours daysSinceEpoch - convert timestamp to epoch days The only way to achieve this in Pinot is for the user to write a custom transformation job and prepare data based on the destination schema Hence, the motivations for this proposal are as follows: 1. Source and destination are not always 1:1 - Users have to write a transformation job, separately for realtime and for offline, which can lead to inconsistencies. It also adds an additional step for user onboarding. 2. Be able to read nested source data fields 3. Be able to support multiple time columns - in order to use dataTimeSpec, we need to have support for derived functions. 4. Be able to share transformation functions across usecases, instead of each user writing one for themselves 5. Better schema evolution - When you add a new column, if it is derived from existing columns it can be backfilled with the correct values instead of default null values.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
