[GitHub] carbondata pull request #1707: [CARBONDATA-1839] [DataLoad] Fix bugs and opt...

xuchuanyin Thu, 21 Dec 2017 20:06:07 -0800

GitHub user xuchuanyin opened a pull request:

    https://github.com/apache/carbondata/pull/1707


    [CARBONDATA-1839] [DataLoad] Fix bugs and optimize in compressing sort temp 
files

    1. fix bugs in compressing sort temp file, use file-level compression
    instead of batch-record-level compression
    
    2. reduce duplicate code in reading & writing sort temp file
     and make it more readable
    
    3. optimize sort procedure:
    
    Before:
     raw row that has been converted(call it 'RawRow' for short) ->
     sort on RawRow ->
     write RawRow to temp sort file ->
     read RawRow from temp sort file ->
     sort on RawRow -> ... ->
     at the final sort, sort on RawRow and convert the RawRow to 3 'PartedRow' 
->
     write 'PartedRow' to DataFile in write procedure.
    
    After:
     raw row that has been converted(call it 'RawRow' for short) ->
     convert RawRow to 3 'PartedRow' ->
     sort on PartedRow ->
     write PartedRow to temp sort file ->
     read PartedRow from temp sort file ->
     sort on PartedRow -> ... ->
     at the final sort, sort on PartedRow ->
     write 'PartedRow' to DataFile in write procedure.
    
    4. add tests
    
    5. remove unused code
    
    6. update docs, add property to configure the compressor
    
    Please refer to 
[maillist](http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Compression-for-sort-temp-files-in-Carbomdata-td31747.html)
 to get more information
    
    
    Be sure to do all of the following checklist to help us incorporate 
    your contribution quickly and easily:
    
     - [X] Any interfaces changed?
      `YES, ONLY CHANGE INTERNAL INTERFACES`
     - [X] Any backward compatibility impacted?
      `NO`
     - [X] Document update required?
      `YES, RELATED DOCUMENT HAS BEEN UPDATED`
     - [X] Testing done
            Please provide details on 
            - Whether new unit test cases have been added or why no new tests 
are required?
            `ADDED TESTS`
            - How it is tested? Please attach test report.
            `TESTED IN LOCAL CLUSTER`
            - Is it a performance related change? Please attach the performance 
test report.
            `YES`
            - Any additional information to help reviewers in testing this 
change.
            `The key point lies in` **`SortStepRowHandler`**`. It is used to 
convert raw row to 3-parted row and read/write row from/to sort temp 
file/unsafe memory`
     - [X] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA. 
            `NOT RELATED`

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/xuchuanyin/carbondata 
bug_compress_sort_temp_1222

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/1707.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1707
    
----
commit 78684a172f0346584ee992bfc40750b03a9f814b
Author: xuchuanyin <xuchuanyin@...>
Date:   2017-12-07T08:31:58Z

    Fix bugs in compressing sort temp file
    
    1. fix bugs in compressing sort temp file, use file-level compression
    instead of batch-record-level compression
    
    2. reduce duplicate code in reading & writing sort temp file
     and make it more readable
    
    3. optimize sort procedure:
    
    Before:
     raw row that has been converted(call it 'RawRow' for short) ->
     sort on RawRow ->
     write RawRow to temp sort file ->
     read RawRow from temp sort file ->
     sort on RawRow -> ... ->
     at the final sort, sort on RawRow and convert the RawRow to 3 'PartedRow' 
->
     write 'PartedRow' to DataFile in write procedure.
    
    After:
     raw row that has been converted(call it 'RawRow' for short) ->
     convert RawRow to 3 'PartedRow' ->
     sort on PartedRow ->
     write PartedRow to temp sort file ->
     read PartedRow from temp sort file ->
     sort on PartedRow -> ... ->
     at the final sort, sort on PartedRow ->
     write 'PartedRow' to DataFile in write procedure.
    
    4. add tests
    
    5. remove unused code
    
    6. update docs, add property to configure the compressor

----


---

[GitHub] carbondata pull request #1707: [CARBONDATA-1839] [DataLoad] Fix bugs and opt...

Reply via email to