[I] SegmentTarPushJobRunner is pushing segments got generated in sequential in manner leads to higher batch ingestion times. [pinot]

via GitHub Mon, 25 Nov 2024 02:32:56 -0800


chrajeshbabu opened a new issue, #14530:
URL: https://github.com/apache/pinot/issues/14530


   When running spark batch ingestion of data contained in multiple raw data 
files, the time taken to push the segments is very high compared to segments 
generation. When I went through the code pushing segments happening 
sequentially one at a time in single  spark job, which is parallel in 
SparkSegmentUriPushJobRunner. Would be better to parallelise the pushing 
segments similar to SparkSegmentUriPushJobRunner to improve the ingestion time.
   
   Time taken at each stage:
   
   ```
   SparkSegmentGenerationJobRunner.java(483 tasks): 16 min
   SparkSegmentTarPushJobRunner.java (2 tasks): 1.4 h
   
   ```
   
   
   ```
   
     public static void pushSegments(SegmentGenerationJobSpec spec, PinotFS 
fileSystem, List<String> tarFilePaths,
         List<Header> headers, List<NameValuePair> parameters)
         throws RetriableOperationException, AttemptsExceededException {
       String tableName = spec.getTableSpec().getTableName();
       TableType tableType = tableName.endsWith("_" + 
TableType.REALTIME.name()) ? TableType.REALTIME : TableType.OFFLINE;
       boolean cleanUpOutputDir = spec.isCleanUpOutputDir();
       LOGGER.info("Start pushing segments: {}... to locations: {} for table 
{}",
           Arrays.toString(tarFilePaths.subList(0, Math.min(5, 
tarFilePaths.size())).toArray()),
           Arrays.toString(spec.getPinotClusterSpecs()), tableName);
       for (String tarFilePath : tarFilePaths) {
         URI tarFileURI = URI.create(tarFilePath);
   ....
             } finally {
               if (cleanUpOutputDir) {
                 fileSystem.delete(tarFileURI, true);
               }
             }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] SegmentTarPushJobRunner is pushing segments got generated in sequential in manner leads to higher batch ingestion times. [pinot]

Reply via email to