GitHub user xuchuanyin opened a pull request:
https://github.com/apache/carbondata/pull/1632
[CARBONDATA-1839] [DataLoad]Fix bugs in compressing sort temp files
Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:
- [X] Any interfaces changed?
`YES, ONLY CHANGE INTERNAL INTERFACES`
- [X] Any backward compatibility impacted?
`NO`
- [X] Document update required?
`YES`
- [X] Testing done
Please provide details on
- Whether new unit test cases have been added or why no new tests
are required?
`ADDED TESTS`
- How it is tested? Please attach test report.
`TESTED IN LOCAL CLUSTER`
- Is it a performance related change? Please attach the performance
test report.
`YES`
- Any additional information to help reviewers in testing this
change.
`There are some duplicate code in write temp sort files found
during this bug fixing and I plan to optimize it in successive PR not in this
one.`
- [X] For large changes, please consider breaking it into sub-tasks under
an umbrella JIRA.
`NOT RELATED`
RESOLVE
===
1. Fix bugs in compressing sort temp file
2. Reduce duplicate code in reading & writing sort temp file
and make it more readable
3. Optimize sort procedure:
Before:
```flow
st=>start: raw row that has been converted(call it 'RawRow' for short)
e=>end: write 'PartedRow' to DataFile in write procedure
op1=>operation: read RawRow from temp sort file
op2=>operation: sort on RawRow
op3=>operation: write RawRow to temp sort file
cond=>condition: final sort?
op4=>operation: sort on RawRow
op5=>operation: convert each RawRow to 3 'PartedRow'
st->op1->op2->op3->cond
cond(no)->op1
cond(yes)->op4->op5->e
```
Afterï¼
```flow
st=>start: raw row that has been converted(call it 'RawRow' for short)
e=>end: write 'PartedRow' to DataFile in write procedure
op1=>operation: convert RawRow to 3 'PartedRow'
op2=>operation: read PartedRow from temp sort file
op3=>operation: sort on PartedRow
op4=>operation: write PartedRow to temp sort file
cond=>condition: final sort?
op5=>operation: sort on PartedRow
st->op1->op2->op3->op4->cond
cond(no)->op2
cond(yes)->op5->e
```
4. Add tests to enable sort_temp_file_compressed while doing data loading
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/xuchuanyin/carbondata
bug_sort_temp_compress_1207
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/carbondata/pull/1632.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1632
----
commit fb46e1288ae3150700a6508298f1ec9dcc8d37c2
Author: xuchuanyin <[email protected]>
Date: 2017-12-07T08:31:58Z
Fix bugs in compressing sort temp file
1. fix bugs in compressing sort temp file
2. reduce duplicate code in reading & writing sort temp file
and make it more readable
3. optimize sort procedure:
Before:
raw row that has been converted(call it 'RawRow' for short) ->
sort on RawRow ->
write RawRow to temp sort file ->
read RawRow from temp sort file ->
sort on RawRow -> ... ->
at the final sort, sort on RawRow and convert the RawRow to 3 'PartedRow'
->
write 'PartedRow' to DataFile in write procedure.
After:
raw row that has been converted(call it 'RawRow' for short) ->
convert RawRow to 3 'PartedRow' ->
sort on PartedRow ->
write PartedRow to temp sort file ->
read PartedRow from temp sort file ->
sort on PartedRow -> ... ->
at the final sort, sort on PartedRow ->
write 'PartedRow' to DataFile in write procedure.
4. add tests
----
---