chrajeshbabu opened a new issue, #14530:
URL: https://github.com/apache/pinot/issues/14530
When running spark batch ingestion of data contained in multiple raw data
files, the time taken to push the segments is very high compared to segments
generation. When I went through the code pushing segments happening
sequentially one at a time in single spark job, which is parallel in
SparkSegmentUriPushJobRunner. Would be better to parallelise the pushing
segments similar to SparkSegmentUriPushJobRunner to improve the ingestion time.
Time taken at each stage:
```
SparkSegmentGenerationJobRunner.java(483 tasks): 16 min
SparkSegmentTarPushJobRunner.java (2 tasks): 1.4 h
```
```
public static void pushSegments(SegmentGenerationJobSpec spec, PinotFS
fileSystem, List<String> tarFilePaths,
List<Header> headers, List<NameValuePair> parameters)
throws RetriableOperationException, AttemptsExceededException {
String tableName = spec.getTableSpec().getTableName();
TableType tableType = tableName.endsWith("_" +
TableType.REALTIME.name()) ? TableType.REALTIME : TableType.OFFLINE;
boolean cleanUpOutputDir = spec.isCleanUpOutputDir();
LOGGER.info("Start pushing segments: {}... to locations: {} for table
{}",
Arrays.toString(tarFilePaths.subList(0, Math.min(5,
tarFilePaths.size())).toArray()),
Arrays.toString(spec.getPinotClusterSpecs()), tableName);
for (String tarFilePath : tarFilePaths) {
URI tarFileURI = URI.create(tarFilePath);
....
} finally {
if (cleanUpOutputDir) {
fileSystem.delete(tarFileURI, true);
}
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]