[GitHub] [carbondata] akashrn5 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update

GitBox Mon, 19 Oct 2020 08:44:02 -0700


akashrn5 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r507834166




##########
File path: 
core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java
##########
@@ -415,44 +415,66 @@ public boolean accept(CarbonFile pathName) {
   }
 
   /**
-   * Return all delta file for a block.
-   * @param segmentId
-   * @param blockName
-   * @return
+   * Get all delete delta files mapped to each block of the specified segment.
+   * First list all deletedelta files in the segment dir, then loop the files 
and find
+   * a map of blocks and .deletedelta files related to each block.
+   *
+   * @param seg the segment which is to find blocks
+   * @return a map of block and its file list
    */
-  public CarbonFile[] getDeleteDeltaFilesList(final Segment segmentId, final 
String blockName) {
-    String segmentPath = CarbonTablePath.getSegmentPath(
-        identifier.getTablePath(), segmentId.getSegmentNo());
-    CarbonFile segDir =
-        FileFactory.getCarbonFile(segmentPath);
+  public Map<String, List<CarbonFile>> getDeleteDeltaFilesList(final Segment 
seg) {
+
+    Map<String, long[]> blockDeltaStartAndEndTimestampMap = new HashMap<>();

Review comment:
       @shenjiayu17 why exactly do we need to change here? Introduce map and 
all, still we are doing the list files which is costly operation. I have some 
points, please check
   
   1. Here no need to create these map, again listing files to fill map.
   we are already getting the blockname and every blovck will have one 
corresponding deletedelta file only right. So always the delta files per block 
block will be one. 
   2. The updatedetails contains the blockname, actual blockname, timestamps 
and delete delta timestamps so you can check if the timestamp not empty, you 
can yourself form the delta file name based on these info and return from this 
method.
   
   With the above approach, you will avoid list files operation, filtering 
operation based on timestamp and creating all these maps. So you can avoid all 
these changes.
   
   Always we keep the horizontal compaction threshold as 1, and we dont change 
and dont recommend users to change to get the better performance.

##########
File path: 
integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/HorizontalCompaction.scala
##########
@@ -173,6 +176,9 @@ object HorizontalCompaction {
 
     val db = carbonTable.getDatabaseName
     val table = carbonTable.getTableName
+
+    LOG.info(s"Horizontal Delete Compaction operation is getting valid 
segments for [$db.$table].")

Review comment:
       same as above

##########
File path: 
integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/HorizontalCompaction.scala
##########
@@ -125,6 +125,9 @@ object HorizontalCompaction {
       segLists: util.List[Segment]): Unit = {
     val db = carbonTable.getDatabaseName
     val table = carbonTable.getTableName
+
+    LOG.info(s"Horizontal Update Compaction operation is getting valid 
segments for [$db.$table].")

Review comment:
       i think this log does not give any useful info here, if you put some log 
after line 133 and print `validSegList ` it looks little useful




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [carbondata] akashrn5 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update

Reply via email to