npawar opened a new pull request #5934:
URL: https://github.com/apache/incubator-pinot/pull/5934
## Description
A Segment Processing Framework to convert "m" segments to "n" segments
The phases of the Segment Processor are
1. **Map**
- RecordTransformation (using transform functions)
- Partitioning (Column value based, transform function based, table config's
partition config based)
- PartitionFiltering (using filter function)
2. **Reduce**
- Rollup/Concat records
- Split into parts
3. **Segment generation**
A SegmentProcessorFrameworkCommand is provided to run this on demand. Run
using command
`bin/pinot-admin.sh SegmentProcessorFramework -segmentProcessorFrameworkSpec
/<path>/spec.json`
where spec.json is
```
{
"inputSegmentsDir": "/<base_dir>/segmentsDir",
"outputSegmentsDir": "/<base_dir>/outputDir/",
"schemaFile": "/<base_dir>/schema.json",
"tableConfigFile": "/<base_dir>/table.json",
"recordTransformerConfig": {
"transformFunctionsMap": {
"epochMillis": "round(epochMillis, 86400000)" // round to nearest day
}
},
"partitioningConfig": {
"partitionerType": "COLUMN_VALUE", // partition on epochMillis
"columnName": "epochMillis"
},
"collectorConfig": {
"collectorType": "ROLLUP", // rollup clicks by summing
"aggregatorTypeMap": {
"clicks": "SUM"
}
},
"segmentConfig": {
"maxNumRecordsPerSegment": 200_000
}
}
```
Note:
1. Currently this framework attempts to do no parallelism in the
map/reduce/segment creation jobs. Each input file will be processed
sequentially in map stage, each part will be executed sequentially in reduce,
and each segment will be built one after another. We can change this in the
future if the need arises to make this more advanced.
2. The framework makes the assumption that there's enough memory to hold all
records of a partition in memory, during rollups in reducer. A limit of 5M
records has been set on the Reducer as the number of records to collect before
forcing a flush, as a safety measure. In future we could consider using off
heap processing, if memory becomes a problem.
This framework will typically be used by minion tasks, which want to
perform some processing on segments
(eg task which merges segments, tasks which aligns segments per time
boundaries etc). The existing Segment merge jobs can be changed to use this
framework.
**Pending**
An end-to-end test for the framework (WIP)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]