akashrn5 commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r578921513



##########
File path: 
core/src/main/java/org/apache/carbondata/core/metadata/SegmentFileStore.java
##########
@@ -1135,13 +1135,16 @@ public static void deleteSegment(String tablePath, 
Segment segment,
     Map<String, List<String>> indexFilesMap = fileStore.getIndexFilesMap();
     for (Map.Entry<String, List<String>> entry : indexFilesMap.entrySet()) {
       FileFactory.deleteFile(entry.getKey());
+      LOGGER.info("File deleted after clean files operation: " + 
entry.getKey());
       for (String file : entry.getValue()) {
         String[] deltaFilePaths =
             updateStatusManager.getDeleteDeltaFilePath(file, 
segment.getSegmentNo());
         for (String deltaFilePath : deltaFilePaths) {
           FileFactory.deleteFile(deltaFilePath);
+          LOGGER.info("File deleted after clean files operation: " + 
deltaFilePath);

Review comment:
       instead of logging each file name for every delete which will increase 
these logs when many delta files are there, once the loop completes, you can 
log once for all files in line 1147, along with actual block path or else you 
can say delete the the block file(print block file name) and the corresponding 
delta files as filestamp will same i guess. Just check once and add

##########
File path: 
core/src/main/java/org/apache/carbondata/core/metadata/SegmentFileStore.java
##########
@@ -1135,13 +1135,16 @@ public static void deleteSegment(String tablePath, 
Segment segment,
     Map<String, List<String>> indexFilesMap = fileStore.getIndexFilesMap();
     for (Map.Entry<String, List<String>> entry : indexFilesMap.entrySet()) {
       FileFactory.deleteFile(entry.getKey());
+      LOGGER.info("File deleted after clean files operation: " + 
entry.getKey());

Review comment:
       ```suggestion
         LOGGER.info("Deleted  file: " + entry.getKey() +  ", on clean files" );
   ```
   
   can do same for all 

##########
File path: core/src/main/java/org/apache/carbondata/core/util/TrashUtil.java
##########
@@ -152,46 +153,77 @@ public static void copyFilesToTrash(List<String> 
filesToCopy,
    * The below method deletes timestamp subdirectories in the trash folder 
which have expired as
    * per the user defined retention time
    */
-  public static void deleteExpiredDataFromTrash(String tablePath) {
+  public static long[] deleteExpiredDataFromTrash(String tablePath, Boolean 
isDryRun) {

Review comment:
       update the method comment based on method signature changes and return 
values, follow the same for other also if changed

##########
File path: docs/clean-files.md
##########
@@ -64,4 +64,40 @@ The stale_inprogress option with force option will delete 
Marked for delete, Com
 
   ```
   CLEAN FILES FOR TABLE TABLE_NAME options('stale_inprogress'='true', 
'force'='true')
-  ```
\ No newline at end of file
+  ```
+### DRY RUN OPTION
+Clean files also support a dry run option which will let the user know how 
much space fill we freed 
+during the actual clean files operation. The dry run operation will not delete 
any data but will just give
+size bases statistics to the data. Dry run operation will return two columns 
where the first will 
+show how much space will be freed by that clean files operation and the second 
column will show the 
+remaining stale data(data which can be deleted but has not yet expired as per 
the ```max.query.execution.time``` and ``` carbon.trash.retention.days``` values
+).  By default the value of ```dryrun``` option is ```false```.
+
+Dry Run Operation is supported with four types of commands:
+  ```
+  CLEAN FILES FOR TABLE TABLE_NAME options('dryrun'='true')
+  ```
+  ```
+  CLEAN FILES FOR TABLE TABLE_NAME options('force'='true', 'dryrun'='true')
+  ```
+  ```
+  CLEAN FILES FOR TABLE TABLE_NAME 
options('stale_inprogress'='true','dryrun'='true')
+  ```
+
+  ```
+  CLEAN FILES FOR TABLE TABLE_NAME options('stale_inprogress'='true', 
'force'='true','dryrun'='true')
+  ```
+
+**NOTE**:
+  * Since the dry run operation will calculate size and will access File level 
API's, the operation can
+  be a costly and a time consuming operation in case of tables with large 
number of segments.

Review comment:
       here better to add point of when dry run is true, other options doesnt 
matter except force = true, and i hope you have handled this in code

##########
File path: 
core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentStatusManager.java
##########
@@ -983,7 +983,28 @@ public static boolean 
isLoadInProgress(AbsoluteTableIdentifier absoluteTableIden
     }
   }
 
-  private static boolean isLoadDeletionRequired(LoadMetadataDetails[] details) 
{
+  public static boolean isExpiredSegment(LoadMetadataDetails oneLoad, 
AbsoluteTableIdentifier
+      absoluteTableIdentifier) {
+    boolean result = false;

Review comment:
       rename to `isExpiredSegment`

##########
File path: 
core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentStatusManager.java
##########
@@ -983,7 +983,28 @@ public static boolean 
isLoadInProgress(AbsoluteTableIdentifier absoluteTableIden
     }
   }
 
-  private static boolean isLoadDeletionRequired(LoadMetadataDetails[] details) 
{
+  public static boolean isExpiredSegment(LoadMetadataDetails oneLoad, 
AbsoluteTableIdentifier
+      absoluteTableIdentifier) {
+    boolean result = false;
+    if (oneLoad.getSegmentStatus() == SegmentStatus.COMPACTED || 
oneLoad.getSegmentStatus() ==
+        SegmentStatus.MARKED_FOR_DELETE) {
+      return true;
+    } else if (oneLoad.getSegmentStatus() == SegmentStatus.INSERT_IN_PROGRESS 
|| oneLoad
+        .getSegmentStatus() == SegmentStatus.INSERT_OVERWRITE_IN_PROGRESS) {
+      // check if lock can be acquired
+      ICarbonLock segmentLock = 
CarbonLockFactory.getCarbonLockObj(absoluteTableIdentifier,
+          CarbonTablePath.addSegmentPrefix(oneLoad.getLoadName()) + 
LockUsage.LOCK);
+      if (segmentLock.lockWithRetries()) {
+        result = true;
+        segmentLock.unlock();

Review comment:
       use `CarbonLockUtil.fileUnlock` method as it logs also, please follow 
other places if taking lock and better to have it in finally block

##########
File path: 
core/src/main/java/org/apache/carbondata/core/util/CleanFilesUtil.java
##########
@@ -71,14 +71,19 @@ public static void cleanStaleSegments(CarbonTable 
carbonTable)
             // Deleting the stale Segment folders and the segment file.
             try {
               CarbonUtil.deleteFoldersAndFiles(segmentPath);
+              LOGGER.info("Deleted the segment folder :" + 
segmentPath.getAbsolutePath() + " after"
+                  + " moving it to the trash folder");
               // delete the segment file as well
               
FileFactory.deleteFile(CarbonTablePath.getSegmentFilePath(carbonTable.getTablePath(),
                   staleSegmentFile));
+              LOGGER.info("Deleted stale segment file after moving it to the 
trash folder :"
+                  + staleSegmentFile);

Review comment:
       ```suggestion
                     deleted stale segment file <segment_file_name> after 
moving it to the trash folder
   ```

##########
File path: core/src/main/java/org/apache/carbondata/core/util/TrashUtil.java
##########
@@ -152,46 +153,77 @@ public static void copyFilesToTrash(List<String> 
filesToCopy,
    * The below method deletes timestamp subdirectories in the trash folder 
which have expired as
    * per the user defined retention time
    */
-  public static void deleteExpiredDataFromTrash(String tablePath) {
+  public static long[] deleteExpiredDataFromTrash(String tablePath, Boolean 
isDryRun) {
     CarbonFile trashFolder = FileFactory.getCarbonFile(CarbonTablePath
         .getTrashFolderPath(tablePath));
+    long sizeFreed = 0;
+    long trashFolderSize = 0;
     // Deleting the timestamp based subdirectories in the trashfolder by the 
given timestamp.
     try {
       if (trashFolder.isFileExist()) {
+        trashFolderSize = 
FileFactory.getDirectorySize(trashFolder.getAbsolutePath());

Review comment:
       i think you should not calculate size when the statistics is false, so 
calculate only when it is true else it will affect clean files time, please 
take care not to impact time of clean files

##########
File path: 
core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentStatusManager.java
##########
@@ -983,7 +983,28 @@ public static boolean 
isLoadInProgress(AbsoluteTableIdentifier absoluteTableIden
     }
   }
 
-  private static boolean isLoadDeletionRequired(LoadMetadataDetails[] details) 
{
+  public static boolean isExpiredSegment(LoadMetadataDetails oneLoad, 
AbsoluteTableIdentifier
+      absoluteTableIdentifier) {
+    boolean result = false;
+    if (oneLoad.getSegmentStatus() == SegmentStatus.COMPACTED || 
oneLoad.getSegmentStatus() ==
+        SegmentStatus.MARKED_FOR_DELETE) {
+      return true;

Review comment:
       here also assign to the boolean variable as its just checking for one 
segment

##########
File path: 
core/src/main/java/org/apache/carbondata/core/util/CleanFilesUtil.java
##########
@@ -71,14 +71,19 @@ public static void cleanStaleSegments(CarbonTable 
carbonTable)
             // Deleting the stale Segment folders and the segment file.
             try {
               CarbonUtil.deleteFoldersAndFiles(segmentPath);
+              LOGGER.info("Deleted the segment folder :" + 
segmentPath.getAbsolutePath() + " after"
+                  + " moving it to the trash folder");
               // delete the segment file as well
               
FileFactory.deleteFile(CarbonTablePath.getSegmentFilePath(carbonTable.getTablePath(),
                   staleSegmentFile));
+              LOGGER.info("Deleted stale segment file after moving it to the 
trash folder :"
+                  + staleSegmentFile);
               for (String duplicateStaleSegmentFile : redundantSegmentFile) {
                 if 
(DataFileUtil.getSegmentNoFromSegmentFile(duplicateStaleSegmentFile)
                     .equals(segmentNumber)) {
                   
FileFactory.deleteFile(CarbonTablePath.getSegmentFilePath(carbonTable
                       .getTablePath(), duplicateStaleSegmentFile));
+                  LOGGER.info("Deleted redundant segment file :" + 
duplicateStaleSegmentFile);

Review comment:
       can log together after loop, all the deleted files and also please check 
all places try to follow the same, this is just to make logs cleaner 

##########
File path: core/src/main/java/org/apache/carbondata/core/util/TrashUtil.java
##########
@@ -152,46 +153,77 @@ public static void copyFilesToTrash(List<String> 
filesToCopy,
    * The below method deletes timestamp subdirectories in the trash folder 
which have expired as
    * per the user defined retention time
    */
-  public static void deleteExpiredDataFromTrash(String tablePath) {
+  public static long[] deleteExpiredDataFromTrash(String tablePath, Boolean 
isDryRun) {
     CarbonFile trashFolder = FileFactory.getCarbonFile(CarbonTablePath
         .getTrashFolderPath(tablePath));
+    long sizeFreed = 0;
+    long trashFolderSize = 0;
     // Deleting the timestamp based subdirectories in the trashfolder by the 
given timestamp.
     try {
       if (trashFolder.isFileExist()) {
+        trashFolderSize = 
FileFactory.getDirectorySize(trashFolder.getAbsolutePath());
         CarbonFile[] timestampFolderList = trashFolder.listFiles();
+        List<CarbonFile> filesToDelete = new ArrayList<>();
         for (CarbonFile timestampFolder : timestampFolderList) {
           // If the timeStamp at which the timeStamp subdirectory has expired 
as per the user
           // defined value, delete the complete timeStamp subdirectory
-          if (timestampFolder.isDirectory() && 
isTrashRetentionTimeoutExceeded(Long
-              .parseLong(timestampFolder.getName()))) {
-            FileFactory.deleteAllCarbonFilesOfDir(timestampFolder);
-            LOGGER.info("Timestamp subfolder from the Trash folder deleted: " 
+ timestampFolder
+          if 
(isTrashRetentionTimeoutExceeded(Long.parseLong(timestampFolder.getName()))) {
+            if (timestampFolder.isDirectory()) {
+              sizeFreed += 
FileFactory.getDirectorySize(timestampFolder.getAbsolutePath());
+              filesToDelete.add(timestampFolder);
+            }
+          }
+        }
+        if (!isDryRun) {
+          for (CarbonFile carbonFile : filesToDelete) {
+            LOGGER.info("Timestamp subfolder from the Trash folder deleted: " 
+ carbonFile
                 .getAbsolutePath());
+            FileFactory.deleteAllCarbonFilesOfDir(carbonFile);
           }
         }
       }
     } catch (IOException e) {
       LOGGER.error("Error during deleting expired timestamp folder from the 
trash folder", e);
     }
+    return new long[] {sizeFreed, trashFolderSize - sizeFreed};
   }
 
   /**
    * The below method deletes all the files and folders in the trash folder of 
a carbon table.
+   * Returns an array in which the first element contains the size freed in 
case of clean files
+   * operation or size that can be freed in case of dry run and the second 
element contains the
+   * remaining size.
    */
-  public static void emptyTrash(String tablePath) {
+  public static long[] emptyTrash(String tablePath, Boolean isDryRun) {
     CarbonFile trashFolder = FileFactory.getCarbonFile(CarbonTablePath
         .getTrashFolderPath(tablePath));
     // if the trash folder exists delete the contents of the trash folder
+    long sizeFreed = 0;
+    long[] sizeStatistics = new long[]{0, 0};
     try {
       if (trashFolder.isFileExist()) {
         CarbonFile[] carbonFileList = trashFolder.listFiles();
+        List<CarbonFile> filesToDelete = new ArrayList<>();
         for (CarbonFile carbonFile : carbonFileList) {
-          FileFactory.deleteAllCarbonFilesOfDir(carbonFile);
+          sizeFreed += 
FileFactory.getDirectorySize(carbonFile.getAbsolutePath());

Review comment:
       same as above

##########
File path: docs/clean-files.md
##########
@@ -64,4 +64,40 @@ The stale_inprogress option with force option will delete 
Marked for delete, Com
 
   ```
   CLEAN FILES FOR TABLE TABLE_NAME options('stale_inprogress'='true', 
'force'='true')
-  ```
\ No newline at end of file
+  ```
+### DRY RUN OPTION
+Clean files also support a dry run option which will let the user know how 
much space fill we freed 
+during the actual clean files operation. The dry run operation will not delete 
any data but will just give
+size bases statistics to the data. Dry run operation will return two columns 
where the first will 

Review comment:
       ```suggestion
   size based statistics on the data which will be cleaned in clean files. Dry 
run operation will return two columns where the first will 
   ```

##########
File path: 
integration/spark/src/main/scala/org/apache/carbondata/trash/DataTrashManager.scala
##########
@@ -72,11 +78,26 @@ object DataTrashManager {
       carbonDeleteSegmentLock = CarbonLockUtil.getLockObject(carbonTable
         .getAbsoluteTableIdentifier, LockUsage.DELETE_SEGMENT_LOCK, 
deleteSegmentErrorMsg)
       // step 1: check and clean trash folder
-      checkAndCleanTrashFolder(carbonTable, isForceDelete)
+      // trashFolderSizeStats(0) contains the size that is freed/or can be 
freed and
+      // trashFolderSizeStats(1) contains the size of remaining data in the 
trash folder
+      val trashFolderSizeStats = checkAndCleanTrashFolder(carbonTable, 
isForceDelete,
+          isDryRun = false)
       // step 2: move stale segments which are not exists in metadata into 
.Trash
       moveStaleSegmentsToTrash(carbonTable)

Review comment:
       if you are calculating the trash size before calling 
`moveStaleSegmentsToTrash`, it gives wrong info right? because after size 
calculation, some folders are getting into trash folder again?

##########
File path: 
integration/spark/src/main/scala/org/apache/carbondata/trash/DataTrashManager.scala
##########
@@ -87,13 +108,70 @@ object DataTrashManager {
     }
   }
 
-  private def checkAndCleanTrashFolder(carbonTable: CarbonTable, 
isForceDelete: Boolean): Unit = {
+  /**
+   * Checks the size of the segment files as well as datafiles, this method is 
used before and after
+   * clean files operation to check how much space is actually freed, during 
the operation.
+   */
+  def getSizeScreenshot(carbonTable: CarbonTable): Long = {
+    val segmentPath = 
CarbonTablePath.getSegmentFilesLocation(carbonTable.getTablePath)
+    var size : Long = 0
+    if (!carbonTable.isHivePartitionTable) {
+      if (carbonTable.getTableInfo.getFactTable.getTableProperties.containsKey(
+          CarbonCommonConstants.FLAT_FOLDER)) {
+        // the size is table size + segment folder size - (metadata folder 
size + lockFiles size)
+        (FileFactory.getDirectorySize(carbonTable.getTablePath) + 
FileFactory.getDirectorySize(

Review comment:
       here do not calculate the fact folder size directly , because when there 
1000s of segments it will be slow. We are already storing the data and index 
size in metadata, please make use of it and it will faster

##########
File path: 
integration/spark/src/main/scala/org/apache/spark/sql/secondaryindex/events/CleanFilesPostEventListener.scala
##########
@@ -56,7 +56,6 @@ class CleanFilesPostEventListener extends 
OperationEventListener with Logging {
           cleanFilesPostEvent.carbonTable,
           cleanFilesPostEvent.options.getOrElse("force", "false").toBoolean,
           cleanFilesPostEvent.options.getOrElse("stale_inprogress", 
"false").toBoolean)
-

Review comment:
       revert this file

##########
File path: 
integration/spark/src/main/scala/org/apache/carbondata/trash/DataTrashManager.scala
##########
@@ -112,13 +141,91 @@ object DataTrashManager {
       carbonTable: CarbonTable,
       isForceDelete: Boolean,
       cleanStaleInProgress: Boolean,
-      partitionSpecsOption: Option[Seq[PartitionSpec]]): Unit = {
+      partitionSpecsOption: Option[Seq[PartitionSpec]]): Seq[Long] = {
     val partitionSpecs = partitionSpecsOption.map(_.asJava).orNull
-    SegmentStatusManager.deleteLoadsAndUpdateMetadata(carbonTable,
+    val sizeStatistics = 
SegmentStatusManager.deleteLoadsAndUpdateMetadata(carbonTable,
       isForceDelete, partitionSpecs, cleanStaleInProgress, true)
     if (carbonTable.isHivePartitionTable && partitionSpecsOption.isDefined) {
       SegmentFileStore.cleanSegments(carbonTable, partitionSpecs, 
isForceDelete)
     }
+    sizeStatistics
+  }
+
+  private def dryRunOnExpiredSegments(
+      carbonTable: CarbonTable,
+      isForceDelete: Boolean,
+      cleanStaleInProgress: Boolean,
+      partitionSpecsOption: Option[Seq[PartitionSpec]]): Seq[Long] = {
+    var sizeFreed: Long = 0
+    var trashSizeRemaining: Long = 0
+    val loadMetadataDetails = 
SegmentStatusManager.readLoadMetadata(carbonTable.getMetadataPath)
+    if (SegmentStatusManager.isLoadDeletionRequired(loadMetadataDetails)) {
+      loadMetadataDetails.foreach { oneLoad =>
+        val segmentFilePath = 
CarbonTablePath.getSegmentFilePath(carbonTable.getTablePath,
+          oneLoad.getSegmentFile)
+        if (DeleteLoadFolders.canDeleteThisLoad(oneLoad, isForceDelete, 
cleanStaleInProgress)) {
+          // No need to consider physical data for external segments, only 
consider metadata.
+          if (oneLoad.getPath() == null || 
oneLoad.getPath().equalsIgnoreCase("NA")) {
+            if (!carbonTable.isHivePartitionTable) {
+              sizeFreed += 
FileFactory.getDirectorySize(CarbonTablePath.getSegmentPath(carbonTable
+                  .getTablePath, oneLoad.getLoadName))
+            } else {
+              sizeFreed += partitionTableSegmentSize(carbonTable, oneLoad, 
loadMetadataDetails,
+                partitionSpecsOption)
+            }
+          }
+          sizeFreed += FileFactory.getCarbonFile(segmentFilePath).getSize
+        } else {
+          if (SegmentStatusManager.isExpiredSegment(oneLoad, carbonTable
+              .getAbsoluteTableIdentifier)) {
+            if (!carbonTable.isHivePartitionTable) {
+              trashSizeRemaining += 
FileFactory.getDirectorySize(CarbonTablePath.getSegmentPath(
+                  carbonTable.getTablePath, oneLoad.getLoadName))
+            } else {
+              trashSizeRemaining += partitionTableSegmentSize(carbonTable, 
oneLoad,
+                  loadMetadataDetails, partitionSpecsOption)
+            }
+            trashSizeRemaining += 
FileFactory.getCarbonFile(segmentFilePath).getSize
+          }
+        }
+      }
+    }
+    Seq(sizeFreed, trashSizeRemaining)
+  }
+
+  def partitionTableSegmentSize( carbonTable: CarbonTable, oneLoad: 
LoadMetadataDetails,
+      loadMetadataDetails: Array[LoadMetadataDetails], partitionSpecsOption:
+      Option[Seq[PartitionSpec]]) : Long = {
+    var segmentSize: Long = 0

Review comment:
       i agree with @ajantha-bhat , may be use loadmetadata details data and 
index size only and for update files you can use some logic, please check




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to