Backgrounds Currently in data management scenarios(Data Loading,Segements Compaction .etc) there exist some data deletion actions. And these actions are dangerous because they are written in different place and some corner case will cause data deletion accidently.
<http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t431/image1.png> Current Data Deletion in Data Loading process Firstly, introduce to the current data loading processing 1. Delete Stale Segments This method will delete the segments which are not compatible with table status. In loading flow, this method will scan the all the segments and add the original segments(like Segment_1, do not contains "." in part[1]) to staleSegments list, then delete the segments in staleSegments lists. <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t431/image2.png> 2. Delete Invalid Segments There will be 3 steps in Delete Invalid Segments (1) Delete Expire Lock This method will delete the expired locks (>48h) (2) Check if the data need to be deleted, and move segments to proper place In current design, it will scan and remove 4 status of Segments(MARK_FOR_DELETE, COMPACTED, INSERT_IN_PROGRESS, INSERT_OVERWRITE_IN_PROGRESS),if it comes from loading flow to this deletion method, it will scan the segments, if meet the requirement to be deleted, and invisibleSegmentCnt > invisibleSegmentPreserveCnt, it will be added to history file and then be delete. <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t431/image3.png> <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t431/image4.png> <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t431/image5.png> (3) Delete Invalid Data In the final step, it will delete the data file which are moved to the history file. 3. Delete temporary files In default setting, in loading process, CarbonData will write to temp file first and copy to target path in the end of loading. This method will delete the tempfiles. Data Deletion Hotfix in Loading Process By analysing the deletion actions during the loading process, we are going to make some modification to the loading flow deletion to keep data being deleted by accident. There are some step to fix the problem: (1) Replace the stale cleaning function by CleanFile actions. (2) Ignoring the segments which status are INSERT_IN_PROGREE and INSERT_OVERWRITE_IN_PROGRESS, bacause the loading progress might take a long time in a high concurrent situation. This two kind of segments will leave to be deleted by the command of CleanFiles. Besides, there will a recycle bin to store the deleted files temporaryly, users can find their deleted segments at recycle bin. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/