siddharthteotia commented on a change in pull request #4464: Delete extra
refreshed segments after segment push
URL: https://github.com/apache/incubator-pinot/pull/4464#discussion_r307036800
##########
File path:
pinot-hadoop/src/main/java/org/apache/pinot/hadoop/job/SegmentTarPushJob.java
##########
@@ -47,9 +54,57 @@ protected boolean isDataFile(String fileName) {
public void run()
throws Exception {
FileSystem fileSystem = FileSystem.get(_conf);
+ List<Path> segmentsToPush = getDataFilePaths(_segmentPattern);
try (ControllerRestApi controllerRestApi = getControllerRestApi()) {
- controllerRestApi.pushSegments(fileSystem,
getDataFilePaths(_segmentPattern));
+ // TODO: Deal with invalid prefixes in the future
+
+ List<String> allSegments = controllerRestApi.getAllSegments("OFFLINE");
+
+ controllerRestApi.pushSegments(fileSystem, segmentsToPush);
+
+ if (_deleteExtraSegments) {
+ controllerRestApi.deleteSegmentUris(getSegmentsToDelete(allSegments,
segmentsToPush));
+ }
+ }
+ }
+
+ /**
+ * Deletes extra segments after pushing to the controller
+ * @param allSegments all segments on the controller for the table
+ * @param segmentsToPush segments that will be pushed to the controller
+ * @throws IOException
+ */
+ public List<String> getSegmentsToDelete(List<String> allSegments, List<Path>
segmentsToPush) {
+ Set<String> uniqueSegmentPrefixes = new HashSet<>();
+
+ // Get all relevant segment prefixes that we are planning on pushing
+ List<String> segmentsToPushNames = segmentsToPush.stream().map(s ->
s.getName()).collect(Collectors.toList());
+ for (String segmentName : segmentsToPushNames) {
+ String segmentNamePrefix = removeSequenceId(segmentName);
+ uniqueSegmentPrefixes.add(segmentNamePrefix);
+ }
+
+ List<String> relevantSegments = new ArrayList<>();
+ // Get relevant segments already pushed that we are planning on refreshing
+ for (String segmentName : allSegments) {
+ if (uniqueSegmentPrefixes.contains(removeSequenceId(segmentName))) {
+ relevantSegments.add(segmentName);
+ }
}
+
+ relevantSegments.removeAll(segmentsToPushNames);
Review comment:
This will be a O(N^2) operation and potentially more heap heavy because of
constant over-writing.
I think we should use a hashset (or any alternative) to make this constant
time operation -- more efficient at the cost of extra space.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]