npawar opened a new pull request #5238: Evaluate schema transform expressions during ingestion URL: https://github.com/apache/incubator-pinot/pull/5238 This PR adds the feature to execute transform functions written in the schema. This transformation will happen during ingestion. Detailed design doc listing all changes and design: [Column Transformation during ingestion in Pinot. ](https://docs.google.com/document/d/13BywJncHrLAFLm-qy4kfKaPxXfAg9XE5v3_fk9sGVSo/edit?usp=sharing) Changes mainly are: 1. `transformFunction` can be written in the schema for any fieldSpec using Groovy. The convention to follow is: `"transformFunction": "FunctionType({function}, argument1, argument2,...argumentN)"` For example: `"transformFunction" : "Groovy({firstName + ' ' + lastName}, firstName, lastName)"` 2. The `RecordReader` will provide the Pinot schema to the `SourceFieldNamesExtractor` utility to get source field names. 3. `RecordExtractor` interface is introduced, one per input format. The `RecordReader` will pass the source field names and the source record to the `RecordExtractor`, which will provide the destination `GenericRow`. 4. The `ExpressionTransformer` will create `ExpressionEvaluator` for each transform function and execute the functions. 5. `ExpressionTransformer` will go before all other `RecordTransformers`, so that every other transformer has the real values. **I'll be happy to break this down into smaller PRs, if this is getting too big to review.** **Pending** 1) Add transform functions in some integration test 2) JMH benchmarks **Some open questions** 1. We no longer know the data type of the source fields. This is a problem in CSV and JSON. **CSV** a) Everything is read as String, right until the DataTypeTransformer. The function will have to take care of handling the type conversion. b) Cannot distinguish between MV columns of single value vs single value column. Function will have to take care c) All empty values will be null values. Cannot distinguish between genuine “” and null in String **JSON** a) Cannot distinguish between INT/LONG and DOUBLE/FLOAT, until DataTypeTransformer. 2. What should we do if any of the inputs to the transform function is null? Currently, it is skipped. But should we make it the responsibility of the function to handle this? 3. `KafkaJSONDecoder` needs to create `JSONRecordExtractor` by default. But we cannot access JSONRecordExtractor of `input-format` module in the `stream-ingestion` module. Did not face this problem in Avro, because everything is in `input-format` 4. Before ExpressionTransformer, the GenericRecord contains only **source** columns. After ExpressionTransformer, the GenericRecord contains **source + destination** columns, all the way up to the indexing. Should we introduce a Transformer which will create new GenericRecord with only the destination columns, to avoid the memory consumption by the extra columns?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
