priyen commented on code in PR #14139:
URL: https://github.com/apache/pinot/pull/14139#discussion_r1792190115


##########
pinot-plugins/pinot-batch-ingestion/pinot-batch-ingestion-spark-3/src/main/java/org/apache/pinot/plugin/ingestion/batch/spark3/SparkSegmentMetadataPushJobRunner.java:
##########
@@ -175,28 +192,52 @@ private void handleNonConsistentPush(List<String> 
segmentsToPush, PinotFS output
     } else {
       JavaSparkContext sparkContext = 
JavaSparkContext.fromSparkContext(SparkContext.getOrCreate());
       JavaRDD<String> pathRDD = sparkContext.parallelize(segmentsToPush, 
pushParallelism);
-      // Prevent using lambda expression in Spark to avoid potential 
serialization exceptions, use inner function
-      // instead.
-      pathRDD.foreach(new VoidFunction<String>() {
-        @Override
-        public void call(String segmentTarPath)
-            throws Exception {
-          PluginManager.get().init();
-          setupFileSystems();
-          try {
-            Map<String, String> segmentUriToTarPathMap =
-                SegmentPushUtils.getSegmentUriToTarPathMap(outputDirURI, 
_spec.getPushJobSpec(),
-                    new String[]{segmentTarPath});
-            SegmentPushUtils.sendSegmentUriAndMetadata(_spec, 
PinotFSFactory.create(outputDirURI.getScheme()),
-                segmentUriToTarPathMap);
-          } catch (RetriableOperationException | AttemptsExceededException e) {
-            throw new RuntimeException(e);
+
+      if (_spec.getPushJobSpec().isBatchSegmentUpload()) {
+        // Process segments in batch mode using foreachPartition
+        pathRDD.foreachPartition(new VoidFunction<Iterator<String>>() {

Review Comment:
   myself/[swaminathanmanish](https://github.com/swaminathanmanish)/@rajagopr 
discussed this in slack and we are happy with the current implementation in 
this PR since it gives users the flexibility to do it all in 1 go (push 
parallelism == 1) or increase it if the scale is too large to do in 1 go.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to