GitHub user xuchuanyin opened a pull request:
https://github.com/apache/carbondata/pull/1707
[CARBONDATA-1839] [DataLoad] Fix bugs and optimize in compressing sort temp
files
1. fix bugs in compressing sort temp file, use file-level compression
instead of batch-record-level compression
2. reduce duplicate code in reading & writing sort temp file
and make it more readable
3. optimize sort procedure:
Before:
raw row that has been converted(call it 'RawRow' for short) ->
sort on RawRow ->
write RawRow to temp sort file ->
read RawRow from temp sort file ->
sort on RawRow -> ... ->
at the final sort, sort on RawRow and convert the RawRow to 3 'PartedRow'
->
write 'PartedRow' to DataFile in write procedure.
After:
raw row that has been converted(call it 'RawRow' for short) ->
convert RawRow to 3 'PartedRow' ->
sort on PartedRow ->
write PartedRow to temp sort file ->
read PartedRow from temp sort file ->
sort on PartedRow -> ... ->
at the final sort, sort on PartedRow ->
write 'PartedRow' to DataFile in write procedure.
4. add tests
5. remove unused code
6. update docs, add property to configure the compressor
Please refer to
[maillist](http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Compression-for-sort-temp-files-in-Carbomdata-td31747.html)
to get more information
Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:
- [X] Any interfaces changed?
`YES, ONLY CHANGE INTERNAL INTERFACES`
- [X] Any backward compatibility impacted?
`NO`
- [X] Document update required?
`YES, RELATED DOCUMENT HAS BEEN UPDATED`
- [X] Testing done
Please provide details on
- Whether new unit test cases have been added or why no new tests
are required?
`ADDED TESTS`
- How it is tested? Please attach test report.
`TESTED IN LOCAL CLUSTER`
- Is it a performance related change? Please attach the performance
test report.
`YES`
- Any additional information to help reviewers in testing this
change.
`The key point lies in` **`SortStepRowHandler`**`. It is used to
convert raw row to 3-parted row and read/write row from/to sort temp
file/unsafe memory`
- [X] For large changes, please consider breaking it into sub-tasks under
an umbrella JIRA.
`NOT RELATED`
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/xuchuanyin/carbondata
bug_compress_sort_temp_1222
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/carbondata/pull/1707.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1707
----
commit 78684a172f0346584ee992bfc40750b03a9f814b
Author: xuchuanyin <xuchuanyin@...>
Date: 2017-12-07T08:31:58Z
Fix bugs in compressing sort temp file
1. fix bugs in compressing sort temp file, use file-level compression
instead of batch-record-level compression
2. reduce duplicate code in reading & writing sort temp file
and make it more readable
3. optimize sort procedure:
Before:
raw row that has been converted(call it 'RawRow' for short) ->
sort on RawRow ->
write RawRow to temp sort file ->
read RawRow from temp sort file ->
sort on RawRow -> ... ->
at the final sort, sort on RawRow and convert the RawRow to 3 'PartedRow'
->
write 'PartedRow' to DataFile in write procedure.
After:
raw row that has been converted(call it 'RawRow' for short) ->
convert RawRow to 3 'PartedRow' ->
sort on PartedRow ->
write PartedRow to temp sort file ->
read PartedRow from temp sort file ->
sort on PartedRow -> ... ->
at the final sort, sort on PartedRow ->
write 'PartedRow' to DataFile in write procedure.
4. add tests
5. remove unused code
6. update docs, add property to configure the compressor
----
---