akashrn5 commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r578921513
##########
File path:
core/src/main/java/org/apache/carbondata/core/metadata/SegmentFileStore.java
##########
@@ -1135,13 +1135,16 @@ public static void deleteSegment(String tablePath,
Segment segment,
Map<String, List<String>> indexFilesMap = fileStore.getIndexFilesMap();
for (Map.Entry<String, List<String>> entry : indexFilesMap.entrySet()) {
FileFactory.deleteFile(entry.getKey());
+ LOGGER.info("File deleted after clean files operation: " +
entry.getKey());
for (String file : entry.getValue()) {
String[] deltaFilePaths =
updateStatusManager.getDeleteDeltaFilePath(file,
segment.getSegmentNo());
for (String deltaFilePath : deltaFilePaths) {
FileFactory.deleteFile(deltaFilePath);
+ LOGGER.info("File deleted after clean files operation: " +
deltaFilePath);
Review comment:
instead of logging each file name for every delete which will increase
these logs when many delta files are there, once the loop completes, you can
log once for all files in line 1147, along with actual block path or else you
can say delete the the block file(print block file name) and the corresponding
delta files as filestamp will same i guess. Just check once and add
##########
File path:
core/src/main/java/org/apache/carbondata/core/metadata/SegmentFileStore.java
##########
@@ -1135,13 +1135,16 @@ public static void deleteSegment(String tablePath,
Segment segment,
Map<String, List<String>> indexFilesMap = fileStore.getIndexFilesMap();
for (Map.Entry<String, List<String>> entry : indexFilesMap.entrySet()) {
FileFactory.deleteFile(entry.getKey());
+ LOGGER.info("File deleted after clean files operation: " +
entry.getKey());
Review comment:
```suggestion
LOGGER.info("Deleted file: " + entry.getKey() + ", on clean files" );
```
can do same for all
##########
File path: core/src/main/java/org/apache/carbondata/core/util/TrashUtil.java
##########
@@ -152,46 +153,77 @@ public static void copyFilesToTrash(List<String>
filesToCopy,
* The below method deletes timestamp subdirectories in the trash folder
which have expired as
* per the user defined retention time
*/
- public static void deleteExpiredDataFromTrash(String tablePath) {
+ public static long[] deleteExpiredDataFromTrash(String tablePath, Boolean
isDryRun) {
Review comment:
update the method comment based on method signature changes and return
values, follow the same for other also if changed
##########
File path: docs/clean-files.md
##########
@@ -64,4 +64,40 @@ The stale_inprogress option with force option will delete
Marked for delete, Com
```
CLEAN FILES FOR TABLE TABLE_NAME options('stale_inprogress'='true',
'force'='true')
- ```
\ No newline at end of file
+ ```
+### DRY RUN OPTION
+Clean files also support a dry run option which will let the user know how
much space fill we freed
+during the actual clean files operation. The dry run operation will not delete
any data but will just give
+size bases statistics to the data. Dry run operation will return two columns
where the first will
+show how much space will be freed by that clean files operation and the second
column will show the
+remaining stale data(data which can be deleted but has not yet expired as per
the ```max.query.execution.time``` and ``` carbon.trash.retention.days``` values
+). By default the value of ```dryrun``` option is ```false```.
+
+Dry Run Operation is supported with four types of commands:
+ ```
+ CLEAN FILES FOR TABLE TABLE_NAME options('dryrun'='true')
+ ```
+ ```
+ CLEAN FILES FOR TABLE TABLE_NAME options('force'='true', 'dryrun'='true')
+ ```
+ ```
+ CLEAN FILES FOR TABLE TABLE_NAME
options('stale_inprogress'='true','dryrun'='true')
+ ```
+
+ ```
+ CLEAN FILES FOR TABLE TABLE_NAME options('stale_inprogress'='true',
'force'='true','dryrun'='true')
+ ```
+
+**NOTE**:
+ * Since the dry run operation will calculate size and will access File level
API's, the operation can
+ be a costly and a time consuming operation in case of tables with large
number of segments.
Review comment:
here better to add point of when dry run is true, other options doesnt
matter except force = true, and i hope you have handled this in code
##########
File path:
core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentStatusManager.java
##########
@@ -983,7 +983,28 @@ public static boolean
isLoadInProgress(AbsoluteTableIdentifier absoluteTableIden
}
}
- private static boolean isLoadDeletionRequired(LoadMetadataDetails[] details)
{
+ public static boolean isExpiredSegment(LoadMetadataDetails oneLoad,
AbsoluteTableIdentifier
+ absoluteTableIdentifier) {
+ boolean result = false;
Review comment:
rename to `isExpiredSegment`
##########
File path:
core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentStatusManager.java
##########
@@ -983,7 +983,28 @@ public static boolean
isLoadInProgress(AbsoluteTableIdentifier absoluteTableIden
}
}
- private static boolean isLoadDeletionRequired(LoadMetadataDetails[] details)
{
+ public static boolean isExpiredSegment(LoadMetadataDetails oneLoad,
AbsoluteTableIdentifier
+ absoluteTableIdentifier) {
+ boolean result = false;
+ if (oneLoad.getSegmentStatus() == SegmentStatus.COMPACTED ||
oneLoad.getSegmentStatus() ==
+ SegmentStatus.MARKED_FOR_DELETE) {
+ return true;
+ } else if (oneLoad.getSegmentStatus() == SegmentStatus.INSERT_IN_PROGRESS
|| oneLoad
+ .getSegmentStatus() == SegmentStatus.INSERT_OVERWRITE_IN_PROGRESS) {
+ // check if lock can be acquired
+ ICarbonLock segmentLock =
CarbonLockFactory.getCarbonLockObj(absoluteTableIdentifier,
+ CarbonTablePath.addSegmentPrefix(oneLoad.getLoadName()) +
LockUsage.LOCK);
+ if (segmentLock.lockWithRetries()) {
+ result = true;
+ segmentLock.unlock();
Review comment:
use `CarbonLockUtil.fileUnlock` method as it logs also, please follow
other places if taking lock and better to have it in finally block
##########
File path:
core/src/main/java/org/apache/carbondata/core/util/CleanFilesUtil.java
##########
@@ -71,14 +71,19 @@ public static void cleanStaleSegments(CarbonTable
carbonTable)
// Deleting the stale Segment folders and the segment file.
try {
CarbonUtil.deleteFoldersAndFiles(segmentPath);
+ LOGGER.info("Deleted the segment folder :" +
segmentPath.getAbsolutePath() + " after"
+ + " moving it to the trash folder");
// delete the segment file as well
FileFactory.deleteFile(CarbonTablePath.getSegmentFilePath(carbonTable.getTablePath(),
staleSegmentFile));
+ LOGGER.info("Deleted stale segment file after moving it to the
trash folder :"
+ + staleSegmentFile);
Review comment:
```suggestion
deleted stale segment file <segment_file_name> after
moving it to the trash folder
```
##########
File path: core/src/main/java/org/apache/carbondata/core/util/TrashUtil.java
##########
@@ -152,46 +153,77 @@ public static void copyFilesToTrash(List<String>
filesToCopy,
* The below method deletes timestamp subdirectories in the trash folder
which have expired as
* per the user defined retention time
*/
- public static void deleteExpiredDataFromTrash(String tablePath) {
+ public static long[] deleteExpiredDataFromTrash(String tablePath, Boolean
isDryRun) {
CarbonFile trashFolder = FileFactory.getCarbonFile(CarbonTablePath
.getTrashFolderPath(tablePath));
+ long sizeFreed = 0;
+ long trashFolderSize = 0;
// Deleting the timestamp based subdirectories in the trashfolder by the
given timestamp.
try {
if (trashFolder.isFileExist()) {
+ trashFolderSize =
FileFactory.getDirectorySize(trashFolder.getAbsolutePath());
Review comment:
i think you should not calculate size when the statistics is false, so
calculate only when it is true else it will affect clean files time, please
take care not to impact time of clean files
##########
File path:
core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentStatusManager.java
##########
@@ -983,7 +983,28 @@ public static boolean
isLoadInProgress(AbsoluteTableIdentifier absoluteTableIden
}
}
- private static boolean isLoadDeletionRequired(LoadMetadataDetails[] details)
{
+ public static boolean isExpiredSegment(LoadMetadataDetails oneLoad,
AbsoluteTableIdentifier
+ absoluteTableIdentifier) {
+ boolean result = false;
+ if (oneLoad.getSegmentStatus() == SegmentStatus.COMPACTED ||
oneLoad.getSegmentStatus() ==
+ SegmentStatus.MARKED_FOR_DELETE) {
+ return true;
Review comment:
here also assign to the boolean variable as its just checking for one
segment
##########
File path:
core/src/main/java/org/apache/carbondata/core/util/CleanFilesUtil.java
##########
@@ -71,14 +71,19 @@ public static void cleanStaleSegments(CarbonTable
carbonTable)
// Deleting the stale Segment folders and the segment file.
try {
CarbonUtil.deleteFoldersAndFiles(segmentPath);
+ LOGGER.info("Deleted the segment folder :" +
segmentPath.getAbsolutePath() + " after"
+ + " moving it to the trash folder");
// delete the segment file as well
FileFactory.deleteFile(CarbonTablePath.getSegmentFilePath(carbonTable.getTablePath(),
staleSegmentFile));
+ LOGGER.info("Deleted stale segment file after moving it to the
trash folder :"
+ + staleSegmentFile);
for (String duplicateStaleSegmentFile : redundantSegmentFile) {
if
(DataFileUtil.getSegmentNoFromSegmentFile(duplicateStaleSegmentFile)
.equals(segmentNumber)) {
FileFactory.deleteFile(CarbonTablePath.getSegmentFilePath(carbonTable
.getTablePath(), duplicateStaleSegmentFile));
+ LOGGER.info("Deleted redundant segment file :" +
duplicateStaleSegmentFile);
Review comment:
can log together after loop, all the deleted files and also please check
all places try to follow the same, this is just to make logs cleaner
##########
File path: core/src/main/java/org/apache/carbondata/core/util/TrashUtil.java
##########
@@ -152,46 +153,77 @@ public static void copyFilesToTrash(List<String>
filesToCopy,
* The below method deletes timestamp subdirectories in the trash folder
which have expired as
* per the user defined retention time
*/
- public static void deleteExpiredDataFromTrash(String tablePath) {
+ public static long[] deleteExpiredDataFromTrash(String tablePath, Boolean
isDryRun) {
CarbonFile trashFolder = FileFactory.getCarbonFile(CarbonTablePath
.getTrashFolderPath(tablePath));
+ long sizeFreed = 0;
+ long trashFolderSize = 0;
// Deleting the timestamp based subdirectories in the trashfolder by the
given timestamp.
try {
if (trashFolder.isFileExist()) {
+ trashFolderSize =
FileFactory.getDirectorySize(trashFolder.getAbsolutePath());
CarbonFile[] timestampFolderList = trashFolder.listFiles();
+ List<CarbonFile> filesToDelete = new ArrayList<>();
for (CarbonFile timestampFolder : timestampFolderList) {
// If the timeStamp at which the timeStamp subdirectory has expired
as per the user
// defined value, delete the complete timeStamp subdirectory
- if (timestampFolder.isDirectory() &&
isTrashRetentionTimeoutExceeded(Long
- .parseLong(timestampFolder.getName()))) {
- FileFactory.deleteAllCarbonFilesOfDir(timestampFolder);
- LOGGER.info("Timestamp subfolder from the Trash folder deleted: "
+ timestampFolder
+ if
(isTrashRetentionTimeoutExceeded(Long.parseLong(timestampFolder.getName()))) {
+ if (timestampFolder.isDirectory()) {
+ sizeFreed +=
FileFactory.getDirectorySize(timestampFolder.getAbsolutePath());
+ filesToDelete.add(timestampFolder);
+ }
+ }
+ }
+ if (!isDryRun) {
+ for (CarbonFile carbonFile : filesToDelete) {
+ LOGGER.info("Timestamp subfolder from the Trash folder deleted: "
+ carbonFile
.getAbsolutePath());
+ FileFactory.deleteAllCarbonFilesOfDir(carbonFile);
}
}
}
} catch (IOException e) {
LOGGER.error("Error during deleting expired timestamp folder from the
trash folder", e);
}
+ return new long[] {sizeFreed, trashFolderSize - sizeFreed};
}
/**
* The below method deletes all the files and folders in the trash folder of
a carbon table.
+ * Returns an array in which the first element contains the size freed in
case of clean files
+ * operation or size that can be freed in case of dry run and the second
element contains the
+ * remaining size.
*/
- public static void emptyTrash(String tablePath) {
+ public static long[] emptyTrash(String tablePath, Boolean isDryRun) {
CarbonFile trashFolder = FileFactory.getCarbonFile(CarbonTablePath
.getTrashFolderPath(tablePath));
// if the trash folder exists delete the contents of the trash folder
+ long sizeFreed = 0;
+ long[] sizeStatistics = new long[]{0, 0};
try {
if (trashFolder.isFileExist()) {
CarbonFile[] carbonFileList = trashFolder.listFiles();
+ List<CarbonFile> filesToDelete = new ArrayList<>();
for (CarbonFile carbonFile : carbonFileList) {
- FileFactory.deleteAllCarbonFilesOfDir(carbonFile);
+ sizeFreed +=
FileFactory.getDirectorySize(carbonFile.getAbsolutePath());
Review comment:
same as above
##########
File path: docs/clean-files.md
##########
@@ -64,4 +64,40 @@ The stale_inprogress option with force option will delete
Marked for delete, Com
```
CLEAN FILES FOR TABLE TABLE_NAME options('stale_inprogress'='true',
'force'='true')
- ```
\ No newline at end of file
+ ```
+### DRY RUN OPTION
+Clean files also support a dry run option which will let the user know how
much space fill we freed
+during the actual clean files operation. The dry run operation will not delete
any data but will just give
+size bases statistics to the data. Dry run operation will return two columns
where the first will
Review comment:
```suggestion
size based statistics on the data which will be cleaned in clean files. Dry
run operation will return two columns where the first will
```
##########
File path:
integration/spark/src/main/scala/org/apache/carbondata/trash/DataTrashManager.scala
##########
@@ -72,11 +78,26 @@ object DataTrashManager {
carbonDeleteSegmentLock = CarbonLockUtil.getLockObject(carbonTable
.getAbsoluteTableIdentifier, LockUsage.DELETE_SEGMENT_LOCK,
deleteSegmentErrorMsg)
// step 1: check and clean trash folder
- checkAndCleanTrashFolder(carbonTable, isForceDelete)
+ // trashFolderSizeStats(0) contains the size that is freed/or can be
freed and
+ // trashFolderSizeStats(1) contains the size of remaining data in the
trash folder
+ val trashFolderSizeStats = checkAndCleanTrashFolder(carbonTable,
isForceDelete,
+ isDryRun = false)
// step 2: move stale segments which are not exists in metadata into
.Trash
moveStaleSegmentsToTrash(carbonTable)
Review comment:
if you are calculating the trash size before calling
`moveStaleSegmentsToTrash`, it gives wrong info right? because after size
calculation, some folders are getting into trash folder again?
##########
File path:
integration/spark/src/main/scala/org/apache/carbondata/trash/DataTrashManager.scala
##########
@@ -87,13 +108,70 @@ object DataTrashManager {
}
}
- private def checkAndCleanTrashFolder(carbonTable: CarbonTable,
isForceDelete: Boolean): Unit = {
+ /**
+ * Checks the size of the segment files as well as datafiles, this method is
used before and after
+ * clean files operation to check how much space is actually freed, during
the operation.
+ */
+ def getSizeScreenshot(carbonTable: CarbonTable): Long = {
+ val segmentPath =
CarbonTablePath.getSegmentFilesLocation(carbonTable.getTablePath)
+ var size : Long = 0
+ if (!carbonTable.isHivePartitionTable) {
+ if (carbonTable.getTableInfo.getFactTable.getTableProperties.containsKey(
+ CarbonCommonConstants.FLAT_FOLDER)) {
+ // the size is table size + segment folder size - (metadata folder
size + lockFiles size)
+ (FileFactory.getDirectorySize(carbonTable.getTablePath) +
FileFactory.getDirectorySize(
Review comment:
here do not calculate the fact folder size directly , because when there
1000s of segments it will be slow. We are already storing the data and index
size in metadata, please make use of it and it will faster
##########
File path:
integration/spark/src/main/scala/org/apache/spark/sql/secondaryindex/events/CleanFilesPostEventListener.scala
##########
@@ -56,7 +56,6 @@ class CleanFilesPostEventListener extends
OperationEventListener with Logging {
cleanFilesPostEvent.carbonTable,
cleanFilesPostEvent.options.getOrElse("force", "false").toBoolean,
cleanFilesPostEvent.options.getOrElse("stale_inprogress",
"false").toBoolean)
-
Review comment:
revert this file
##########
File path:
integration/spark/src/main/scala/org/apache/carbondata/trash/DataTrashManager.scala
##########
@@ -112,13 +141,91 @@ object DataTrashManager {
carbonTable: CarbonTable,
isForceDelete: Boolean,
cleanStaleInProgress: Boolean,
- partitionSpecsOption: Option[Seq[PartitionSpec]]): Unit = {
+ partitionSpecsOption: Option[Seq[PartitionSpec]]): Seq[Long] = {
val partitionSpecs = partitionSpecsOption.map(_.asJava).orNull
- SegmentStatusManager.deleteLoadsAndUpdateMetadata(carbonTable,
+ val sizeStatistics =
SegmentStatusManager.deleteLoadsAndUpdateMetadata(carbonTable,
isForceDelete, partitionSpecs, cleanStaleInProgress, true)
if (carbonTable.isHivePartitionTable && partitionSpecsOption.isDefined) {
SegmentFileStore.cleanSegments(carbonTable, partitionSpecs,
isForceDelete)
}
+ sizeStatistics
+ }
+
+ private def dryRunOnExpiredSegments(
+ carbonTable: CarbonTable,
+ isForceDelete: Boolean,
+ cleanStaleInProgress: Boolean,
+ partitionSpecsOption: Option[Seq[PartitionSpec]]): Seq[Long] = {
+ var sizeFreed: Long = 0
+ var trashSizeRemaining: Long = 0
+ val loadMetadataDetails =
SegmentStatusManager.readLoadMetadata(carbonTable.getMetadataPath)
+ if (SegmentStatusManager.isLoadDeletionRequired(loadMetadataDetails)) {
+ loadMetadataDetails.foreach { oneLoad =>
+ val segmentFilePath =
CarbonTablePath.getSegmentFilePath(carbonTable.getTablePath,
+ oneLoad.getSegmentFile)
+ if (DeleteLoadFolders.canDeleteThisLoad(oneLoad, isForceDelete,
cleanStaleInProgress)) {
+ // No need to consider physical data for external segments, only
consider metadata.
+ if (oneLoad.getPath() == null ||
oneLoad.getPath().equalsIgnoreCase("NA")) {
+ if (!carbonTable.isHivePartitionTable) {
+ sizeFreed +=
FileFactory.getDirectorySize(CarbonTablePath.getSegmentPath(carbonTable
+ .getTablePath, oneLoad.getLoadName))
+ } else {
+ sizeFreed += partitionTableSegmentSize(carbonTable, oneLoad,
loadMetadataDetails,
+ partitionSpecsOption)
+ }
+ }
+ sizeFreed += FileFactory.getCarbonFile(segmentFilePath).getSize
+ } else {
+ if (SegmentStatusManager.isExpiredSegment(oneLoad, carbonTable
+ .getAbsoluteTableIdentifier)) {
+ if (!carbonTable.isHivePartitionTable) {
+ trashSizeRemaining +=
FileFactory.getDirectorySize(CarbonTablePath.getSegmentPath(
+ carbonTable.getTablePath, oneLoad.getLoadName))
+ } else {
+ trashSizeRemaining += partitionTableSegmentSize(carbonTable,
oneLoad,
+ loadMetadataDetails, partitionSpecsOption)
+ }
+ trashSizeRemaining +=
FileFactory.getCarbonFile(segmentFilePath).getSize
+ }
+ }
+ }
+ }
+ Seq(sizeFreed, trashSizeRemaining)
+ }
+
+ def partitionTableSegmentSize( carbonTable: CarbonTable, oneLoad:
LoadMetadataDetails,
+ loadMetadataDetails: Array[LoadMetadataDetails], partitionSpecsOption:
+ Option[Seq[PartitionSpec]]) : Long = {
+ var segmentSize: Long = 0
Review comment:
i agree with @ajantha-bhat , may be use loadmetadata details data and
index size only and for update files you can use some logic, please check
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]