npawar opened a new pull request #5238: Evaluate schema transform expressions 
during ingestion
URL: https://github.com/apache/incubator-pinot/pull/5238
 
 
   This PR adds the feature to execute transform functions written in the 
schema. This transformation will happen during ingestion. Detailed design doc 
listing all changes and design: [Column Transformation during ingestion in 
Pinot.
   
](https://docs.google.com/document/d/13BywJncHrLAFLm-qy4kfKaPxXfAg9XE5v3_fk9sGVSo/edit?usp=sharing)
   
   Changes mainly are:
   1. `transformFunction` can be written in the schema for any fieldSpec using 
Groovy. The convention to follow is:
   `"transformFunction": "FunctionType({function}, argument1, 
argument2,...argumentN)"`
   For example: `"transformFunction" : "Groovy({firstName + ' ' + lastName}, 
firstName, lastName)"`
   2. The `RecordReader` will provide the Pinot schema to the 
`SourceFieldNamesExtractor` utility to get source field names.
   3. `RecordExtractor` interface is introduced, one per input format. The 
`RecordReader` will pass the source field names and the source record to the 
`RecordExtractor`, which will provide the destination `GenericRow`.
   4. The `ExpressionTransformer` will create `ExpressionEvaluator` for each 
transform function and execute the functions.
   5. `ExpressionTransformer` will go before all other `RecordTransformers`, so 
that every other transformer has the real values.
   
   
   **I'll be happy to break this down into smaller PRs, if this is getting too 
big to review.**
   
   
   **Pending**
   1) Add transform functions in some integration test
   2) JMH benchmarks
   
   **Some open questions**
   1. We no longer know the data type of the source fields. This is a problem 
in CSV and JSON. 
   **CSV**
   a) Everything is read as String, right until the DataTypeTransformer. The 
function will have to take care of handling the type conversion.
   b) Cannot distinguish between MV columns of single value vs single value 
column. Function will have to take care
   c) All empty values will be null values. Cannot distinguish between genuine 
“” and null in String
   **JSON**
   a) Cannot distinguish between INT/LONG and DOUBLE/FLOAT, until 
DataTypeTransformer.
   2. What should we do if any of the inputs to the transform function is null? 
Currently, it is skipped. But should we make it the responsibility of the 
function to handle this?
   3. `KafkaJSONDecoder` needs to create `JSONRecordExtractor` by default. But 
we cannot access JSONRecordExtractor of `input-format` module in the 
`stream-ingestion` module. Did not face this problem in Avro, because 
everything is in `input-format`
   4. Before ExpressionTransformer, the GenericRecord contains only **source** 
columns. After ExpressionTransformer, the GenericRecord contains **source + 
destination** columns, all the way up to the indexing. Should we introduce a 
Transformer which will create new GenericRecord with only the destination 
columns, to avoid the memory consumption by the extra columns?
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to