[GitHub] [carbondata] marchpure commented on a change in pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder

GitBox Tue, 03 Nov 2020 05:38:50 -0800


marchpure commented on a change in pull request #3999:
URL: https://github.com/apache/carbondata/pull/3999#discussion_r515819017




##########
File path: 
processing/src/main/java/org/apache/carbondata/processing/merger/CompactionType.java
##########
@@ -26,7 +26,6 @@
 public enum CompactionType {
     MINOR,
     MAJOR,
-    IUD_UPDDEL_DELTA,

Review comment:
       I have modified code according to your suggestion

##########
File path: docs/configuration-parameters.md
##########
@@ -104,8 +104,7 @@ This section provides the details of all the configurations 
required for the Car
 | carbon.number.of.cores.while.compacting | 2 | Number of cores to be used 
while compacting data. This also determines the number of threads to be used to 
read carbondata files in parallel. |
 | carbon.compaction.level.threshold | 4, 3 | Each CarbonData load will create 
one segment, if every load is small in size it will generate many small file 
over a period of time impacting the query performance. This configuration is 
for minor compaction which decides how many segments to be merged. 
Configuration is of the form (x,y). Compaction will be triggered for every x 
segments and form a single level 1 compacted segment. When the number of 
compacted level 1 segments reach y, compaction will be triggered again to merge 
them to form a single level 2 segment. For example: If it is set as 2, 3 then 
minor compaction will be triggered for every 2 segments. 3 is the number of 
level 1 compacted segments which is further compacted to new segment. **NOTE:** 
When ***carbon.enable.auto.load.merge*** is **true**, configuring higher values 
cause overall data loading time to increase as compaction will be triggered 
after data loading is complete but status is not returned till compaction is c
 omplete. But compacting more number of segments can increase query 
performance. Hence optimal values needs to be configured based on the business 
scenario. Valid values are between 0 to 100. |
 | carbon.major.compaction.size | 1024 | To improve query performance and all 
the segments can be merged and compacted to a single segment upto configured 
size. This Major compaction size can be configured using this parameter. Sum of 
the segments which is below this threshold will be merged. This value is 
expressed in MB. |
-| carbon.horizontal.compaction.enable | true | CarbonData supports 
DELETE/UPDATE functionality by creating delta data files for existing 
carbondata files. These delta files would grow as more number of DELETE/UPDATE 
operations are performed. Compaction of these delta files are termed as 
horizontal compaction. This configuration is used to turn ON/OFF horizontal 
compaction. After every DELETE and UPDATE statement, horizontal compaction may 
occur in case the delta (DELETE/ UPDATE) files becomes more than specified 
threshold. **NOTE:** Having many delta files will reduce the query performance 
as scan has to happen on all these files before the final state of data can be 
decided. Hence it is advisable to keep horizontal compaction enabled and 
configure reasonable values to 
***carbon.horizontal.UPDATE.compaction.threshold*** and 
***carbon.horizontal.DELETE.compaction.threshold*** |
-| carbon.horizontal.update.compaction.threshold | 1 | This configuration 
specifies the threshold limit on number of UPDATE delta files within a segment. 
In case the number of delta files goes beyond the threshold, the UPDATE delta 
files within the segment becomes eligible for horizontal compaction and are 
compacted into single UPDATE delta file. Values range between 1 to 10000. |
+| carbon.horizontal.compaction.enable | true | CarbonData supports 
DELETE/UPDATE functionality by creating delta data files for existing 
carbondata files. These delta files would grow as more number of DELETE/UPDATE 
operations are performed. Compaction of these delete delta files are termed as 
horizontal compaction. This configuration is used to turn ON/OFF horizontal 
compaction. After every DELETE and UPDATE statement, horizontal compaction may 
occur in case the delete delta files becomes more than specified threshold. 
**NOTE:** Having many delta files will reduce the query performance as scan has 
to happen on all these files before the final state of data can be decided. 
Hence it is advisable to keep horizontal compaction enabled and configure 
reasonable values to ***carbon.horizontal.UPDATE.compaction.threshold*** and 
***carbon.horizontal.DELETE.compaction.threshold*** |

Review comment:
       yes. it won't be unused anywhere

##########
File path: 
core/src/main/java/org/apache/carbondata/core/mutate/CarbonUpdateUtil.java
##########
@@ -373,9 +374,7 @@ public static void cleanStaleDeltaFiles(CarbonTable table, 
final String timeStam
         @Override
         public boolean accept(CarbonFile file) {
           String fileName = file.getName();
-          return (fileName.endsWith(timeStamp + 
CarbonCommonConstants.UPDATE_DELTA_FILE_EXT)

Review comment:
       I have modified code according to your suggestion

##########
File path: 
core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java
##########
@@ -808,19 +808,6 @@ private CarbonCommonConstants() {
    */
   public static final int NUMBER_OF_SEGMENT_COMPACTED_PERTIME_UPPER_LIMIT = 
10000;
 
-  /**
-   * Number of Update Delta files which is the Threshold for IUD compaction.
-   * Only accepted Range is 0 - 10000. Outside this range system will pick 
default value.
-   */
-  @CarbonProperty
-  public static final String UPDATE_DELTAFILE_COUNT_THRESHOLD_IUD_COMPACTION =

Review comment:
       I have modified code according to your suggestion

##########
File path: 
integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/CarbonProjectForUpdateCommand.scala
##########
@@ -268,6 +267,11 @@ private[sql] case class CarbonProjectForUpdateCommand(
         // When the table has too many segemnts, it will take a long time.

Review comment:
       I have modified code according to your suggestion

##########
File path: 
integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/CarbonProjectForUpdateCommand.scala
##########
@@ -268,6 +267,11 @@ private[sql] case class CarbonProjectForUpdateCommand(
         // When the table has too many segemnts, it will take a long time.
         // So moving it to the end and it is outside of locking.
         CarbonUpdateUtil.cleanStaleDeltaFiles(carbonTable, fileTimestamp)

Review comment:
       I will raise another PR and give an optimazation in this week

##########
File path: 
integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/CarbonProjectForUpdateCommand.scala
##########
@@ -158,8 +159,6 @@ private[sql] case class CarbonProjectForUpdateCommand(
                 "for the update key")
             }
           }
-          // handle the clean up of IUD.
-          CarbonUpdateUtil.cleanUpDeltaFiles(carbonTable, false)

Review comment:
       Clean Files Command maybe a better solution.

##########
File path: 
integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonDataRDDFactory.scala
##########
@@ -1050,14 +752,15 @@ object CarbonDataRDDFactory {
     }
     var done = true
     // If the updated data should be added as new segment then update the 
segment information
-    if (updateModel.isDefined && updateModel.get.loadAsNewSegment) {
+    if (updateModel.isDefined) {

Review comment:
       I will raise another PR and give an optimazation in this week

##########
File path: 
processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java
##########
@@ -1243,10 +1256,14 @@ public static void 
updateTableStatusInCaseOfFailure(String loadName,
             SegmentStatusManager.readLoadMetadata(metaDataPath);
         boolean ifTableStatusUpdateRequired = false;
         for (LoadMetadataDetails loadMetadataDetail : loadMetadataDetails) {
-          if (loadMetadataDetail.getSegmentStatus() == 
SegmentStatus.INSERT_IN_PROGRESS && loadName
-              .equalsIgnoreCase(loadMetadataDetail.getLoadName())) {
+          if ((loadMetadataDetail.getSegmentStatus() == 
SegmentStatus.INSERT_IN_PROGRESS
+              || loadMetadataDetail.getSegmentStatus() == SegmentStatus.SUCCESS

Review comment:
       I have modified code according to your suggestion

##########
File path: 
integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/CarbonProjectForUpdateCommand.scala
##########
@@ -268,6 +267,11 @@ private[sql] case class CarbonProjectForUpdateCommand(
         // When the table has too many segemnts, it will take a long time.
         // So moving it to the end and it is outside of locking.
         CarbonUpdateUtil.cleanStaleDeltaFiles(carbonTable, fileTimestamp)
+
+        // Delete The New Inserted Segment
+        CarbonLoaderUtil.updateTableStatusInCaseOfFailure(

Review comment:
       I have modified code according to your suggestion

##########
File path: 
integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonDataRDDFactory.scala
##########
@@ -342,33 +336,8 @@ object CarbonDataRDDFactory {
 
     try {
       if (!carbonLoadModel.isCarbonTransactionalTable || 
segmentLock.lockWithRetries()) {
-        if (updateModel.isDefined && !updateModel.get.loadAsNewSegment) {
-          res = loadDataFrameForUpdate(
-            sqlContext,
-            dataFrame,
-            carbonLoadModel,
-            updateModel,
-            carbonTable,
-            hadoopConf,
-            segmentMetaDataAccumulator)
-          res.foreach { resultOfSeg =>
-            resultOfSeg.foreach { resultOfBlock =>
-              if (resultOfBlock._2._1.getSegmentStatus == 
SegmentStatus.LOAD_FAILURE) {
-                loadStatus = SegmentStatus.LOAD_FAILURE
-                if (resultOfBlock._2._2.failureCauses == FailureCauses.NONE) {
-                  updateModel.get.executorErrors.failureCauses = 
FailureCauses.EXECUTOR_FAILURE
-                  updateModel.get.executorErrors.errorMsg = "Failure in the 
Executor."
-                } else {
-                  updateModel.get.executorErrors = resultOfBlock._2._2
-                }
-              } else if (resultOfBlock._2._1.getSegmentStatus ==
-                         SegmentStatus.LOAD_PARTIAL_SUCCESS) {
-                loadStatus = SegmentStatus.LOAD_PARTIAL_SUCCESS
-                updateModel.get.executorErrors.failureCauses = 
resultOfBlock._2._2.failureCauses
-                updateModel.get.executorErrors.errorMsg = 
resultOfBlock._2._2.errorMsg
-              }
-            }
-          }
+        if (updateModel.isDefined && dataFrame.get.rdd.isEmpty()) {
+          // if the rowtoupdated is empty, do nothing

Review comment:
       I have modified code according to your suggestion

##########
File path: 
core/src/main/java/org/apache/carbondata/core/mutate/CarbonUpdateUtil.java
##########
@@ -276,14 +278,13 @@ public static boolean 
updateTableMetadataStatus(Set<Segment> updatedSegmentsList
                 SegmentStatusManager.readLoadMetadata(metaDataFilepath);
 
         for (LoadMetadataDetails loadMetadata : listOfLoadFolderDetailsArray) {
+          if (isUpdateStatusFileUpdateRequired &&
+              loadMetadata.getLoadName().equalsIgnoreCase("0")) {
+            loadMetadata.setUpdateStatusFileName(
+                CarbonUpdateUtil.getUpdateStatusFileName(updatedTimeStamp));
+          }
 
           if (isTimestampUpdateRequired) {
-            // we are storing the link between the 2 status files in the 
segment 0 only.

Review comment:
       I have modified code according to your suggestion

##########
File path: 
core/src/main/java/org/apache/carbondata/core/indexstore/blockletindex/SegmentIndexFileStore.java
##########
@@ -101,8 +101,8 @@ public SegmentIndexFileStore(Configuration configuration) {
    * @param segmentPath
    * @throws IOException
    */
-  public void readAllIIndexOfSegment(String segmentPath) throws IOException {
-    CarbonFile[] carbonIndexFiles = getCarbonIndexFiles(segmentPath, 
configuration);
+  public void readAllIIndexOfSegment(String segmentPath, String uuid) throws 
IOException {

Review comment:
       I have modified code according to your suggestion




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [carbondata] marchpure commented on a change in pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder

Reply via email to