GitHub user dhatchayani reopened a pull request:
https://github.com/apache/carbondata/pull/1702
[CARBONDATA-1896] Clean files operation improvement
**Problem:**
When bringing up the session, clean operation is handled in a way to mark
all the INSERT_OVERWRITE_IN_PROGRESS or INSERT_IN_PROGRESS segments to
MARKED_FOR_DELETE in tablestatus file. This clean operation is not considering
the other parallel sessions. If any other session's data load is IN_PROGRESS at
the time of bringing up one session, then the executing load also will be
changed to MARKED_FOR_DELETE irrespective of the actual load status. Handling
stale segments cleaning while session bring up also increases the time of
bringing up a session.
**Solution:**
SEGMENT_LOCK should be taken on the new segment while loading.
While cleaning segments tablestatus file and SEGMENT_LOCK should be
considered.
Cleaning stale files while bringing up the session should be removed and
this can be either manually done on the needed tables through already existing
CLEAN FILES DDL or the next load on the table will clean the same.
- [ ] Any interfaces changed?
- [ ] Any backward compatibility impacted?
- [ ] Document update required?
- [x] Testing done
Manual Testing
- [ ] For large changes, please consider breaking it into sub-tasks under
an umbrella JIRA.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/dhatchayani/incubator-carbondata clean_files
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/carbondata/pull/1702.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1702
----
commit 4573f5fbcc7d0414323513e8746f9050f9eb1e78
Author: dhatchayani <dhatcha.official@...>
Date: 2017-12-20T17:05:31Z
[CARBONDATA-1896] Clean files operation improvement
----
---