[GitHub] [incubator-pinot] npawar opened a new pull request #5934: Segment processing framework

GitBox Thu, 27 Aug 2020 15:30:29 -0700


npawar opened a new pull request #5934:
URL: https://github.com/apache/incubator-pinot/pull/5934



   ## Description
   A Segment Processing Framework to convert "m" segments to "n" segments
   The phases of the Segment Processor are
   1. **Map**
   - RecordTransformation (using transform functions)
   - Partitioning (Column value based, transform function based, table config's 
partition config based)
   - PartitionFiltering (using filter function)
   2. **Reduce**
   - Rollup/Concat records
   - Split into parts 
   3. **Segment generation**
   
   A SegmentProcessorFrameworkCommand is provided to run this on demand. Run 
using command
   `bin/pinot-admin.sh SegmentProcessorFramework -segmentProcessorFrameworkSpec 
/<path>/spec.json` 
   where spec.json is
   ```
   {
     "inputSegmentsDir": "/<base_dir>/segmentsDir",
     "outputSegmentsDir": "/<base_dir>/outputDir/",
     "schemaFile": "/<base_dir>/schema.json",
     "tableConfigFile": "/<base_dir>/table.json",
     "recordTransformerConfig": {
       "transformFunctionsMap": {
         "epochMillis": "round(epochMillis, 86400000)" // round to nearest day
       }
     },
     "partitioningConfig": {
       "partitionerType": "COLUMN_VALUE", // partition on epochMillis
       "columnName": "epochMillis"
     },
     "collectorConfig": {
       "collectorType": "ROLLUP", // rollup clicks by summing
       "aggregatorTypeMap": {
         "clicks": "SUM"
       }
     },
     "segmentConfig": {
       "maxNumRecordsPerSegment": 200_000
     }
   }
   ```
   
   Note:
   1. Currently this framework attempts to do no parallelism in the 
map/reduce/segment creation jobs. Each input file will be processed 
sequentially in map stage, each part will be executed sequentially in reduce, 
and each segment will be built one after another. We can change this in the 
future if the need arises to make this more advanced.
   2. The framework makes the assumption that there's enough memory to hold all 
records of a partition in memory, during rollups in reducer. A limit of 5M 
records has been set on the Reducer as the number of records to collect before 
forcing a flush, as a safety measure. In future we could consider using off 
heap processing, if memory becomes a problem.
   
    This framework will typically be used by minion tasks, which want to 
perform some processing on segments
    (eg task which merges segments, tasks which aligns segments per time 
boundaries etc). The existing Segment merge jobs can be changed to use this 
framework.
   
   **Pending**
   An end-to-end test for the framework (WIP)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-pinot] npawar opened a new pull request #5934: Segment processing framework

Reply via email to