chrajeshbabu opened a new issue, #14746:
URL: https://github.com/apache/pinot/issues/14746
Currently in the SparkSegmentGenerationJobRunner is using copyDir API after
generating the segments to move the segments from staging directory to output
directory. Loading jobs taking too much time when hundreds of segments in a
single job.
` if (stagingDirURI != null) {
LOGGER.info("Trying to copy segment tars from staging directory:
[{}] to output directory [{}]", stagingDirURI,
outputDirURI);
outputDirFS.copyDir(stagingDirURI, outputDirURI);
}
} finally {
if (stagingDirURI != null) {
LOGGER.info("Trying to clean up staging directory: [{}]",
stagingDirURI);
outputDirFS.delete(stagingDirURI, true);
}`
End of the data we are clearing the staging directory even after partial
segments copy succeeded.
It would be better to make use of move APIs which will improve the loading
time drastically mainly when we use HDFS kind of file systems.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]