[GitHub] [incubator-pinot] npawar opened a new issue #5135: Transform functions in Pinot schema

GitBox Tue, 10 Mar 2020 16:49:24 -0700

npawar opened a new issue #5135: Transform functions in Pinot schema
URL: https://github.com/apache/incubator-pinot/issues/5135
 
 
   Consider,
   X: Data at source. This can be either a stream or data files. The formats 
are typically JSON, AVRO, CSV etc.
   Y: Data in Pinot. This is the record/document in Pinot.
   
   When data is ingested into Pinot (either realtime ingestion or batch 
ingestion), all columns in X directly need to map to Y. The only exception to 
this is the time column, where we allow transformation from one time format to 
another, but we are limited to 1 column. This means that every column in the 
destination schema should be present exactly as it is in the source schema 
(except the time column).
   This is not always practical. It is often desirable to have some amount of 
transformations to the source columns before they get to the destination. 
   
   For example, consider this sample ads data schema
   Source columns - **userID, name.firstName, name.lastName, IP, eventType, 
cost, timestamp **
   ```
     {
       "userID": 1,
       "name”: { “firstName": "John", "lastName": "Doe"},
       "IP": "10.1.2.3",
       "eventType": "IMPRESSION",
       "cost": 2000,
       “timestamp”: 1583882502198
     },
     {
       "userID": 2,
       “name”: { "firstName": "Mary", "lastName": "Smith"},
       "IP": "10.5.6.7",
       "eventType": "IMPRESSION",
       "cost": 4000,
       “timestamp”: 1583882502198
     },
     {
       "userID": 3,
       “name”: { "firstName": "Rita", "lastName": "Skeeter"},
       "IP": "10.9.8.7",
       "eventType": "CLICK",
       "cost": 600,
       “timestamp”: 1583882502198
     }
   ```
   
   Destination columns - **userId, fullName, country, zipcode, impressions, 
clicks, cost, hoursSinceEpoch, daysSinceEpoch**
   userId - Map userID to userId
   fullName - Concat name.firstName and name.lastName
   country  - Extract country from IP
   zipcode - Extract zipcode from IP
   impressions - 1 if eventType=IMPRESSION, 0 otherwise
   clicks - 1 if eventType=CLICK, 0 otherwise
   cost - Directly maps from cost, no transformations
   hoursSinceEpoch - convert timestamp to epoch hours
   daysSinceEpoch - convert timestamp to epoch days
   
   The only way to achieve this in Pinot is for the user to write a custom 
transformation job and prepare data based on the destination schema
   
   Hence, the motivations for this proposal are as follows:
   1. Source and destination are not always 1:1 - Users have to write a 
transformation job, separately for realtime and for offline, which can lead to 
inconsistencies. It also adds an additional step for user onboarding.
   2. Be able to read nested source data fields 
   3. Be able to support multiple time columns - in order to use dataTimeSpec, 
we need to have support for derived functions. 
   4. Be able to share transformation functions across usecases, instead of 
each user writing one for themselves
   5. Better schema evolution - When you add a new column, if it is derived 
from existing columns it can be backfilled with the correct values instead of 
default null values.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-pinot] npawar opened a new issue #5135: Transform functions in Pinot schema

Reply via email to