[I] Make use of move APIs instead of copy segments from staging directory to output directory in the SparkSegmentGenerationJobRunner [pinot]

via GitHub Thu, 02 Jan 2025 23:01:59 -0800


chrajeshbabu opened a new issue, #14746:
URL: https://github.com/apache/pinot/issues/14746


   Currently in the SparkSegmentGenerationJobRunner is using  copyDir API after 
generating the segments to move the segments from staging directory to output 
directory. Loading jobs taking too much time when hundreds of segments in a 
single job. 
   `      if (stagingDirURI != null) {
           LOGGER.info("Trying to copy segment tars from staging directory: 
[{}] to output directory [{}]", stagingDirURI,
               outputDirURI);
           outputDirFS.copyDir(stagingDirURI, outputDirURI);
         }
       } finally {
         if (stagingDirURI != null) {
           LOGGER.info("Trying to clean up staging directory: [{}]", 
stagingDirURI);
           outputDirFS.delete(stagingDirURI, true);
         }`
         
     End of the data we are clearing the staging directory even after partial 
segments copy succeeded.
     It would be better to make use of move APIs which will improve the loading 
time drastically mainly when we use HDFS kind of file systems.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Make use of move APIs instead of copy segments from staging directory to output directory in the SparkSegmentGenerationJobRunner [pinot]

Reply via email to