GitHub user xuchuanyin opened a pull request:
https://github.com/apache/carbondata/pull/1808
[CARBONDATA-2023][DataLoad] Add size base block allocation in data loading
Carbondata assign blocks to nodes at the beginning of data loading.
Previous block allocation strategy is block number based and it will
shuffer skewed data problem if the size of input files differs a lot.
We introduced a size based block allocation strategy to optimize data
loading performance in skewed data scenario.
Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:
- [x] Any interfaces changed?
`Only changed the internal interfaces`
- [x] Any backward compatibility impacted?
`No`
- [x] Document update required?
`Updated the document`
- [x] Testing done
Please provide details on
- Whether new unit test cases have been added or why no new tests
are required?
`Added tests to verify the block-allocation correctness`
- How it is tested? Please attach test report.
`Tested in local 3-node cluster`
- Is it a performance related change? Please attach the performance
test report.
```
In my scenario, the size of input data file varies from 1KB to about 5GB.
Before enabling this feature, each executor processed the same number of
blocks
and the processed data size had a 5X gap. --(block number based
allocation)
After enabling this feature, each executor processed almost the same size
of data
and the processed data blocks had 6X gap. -- (block size based allocation)
The data loading performance had been promoted from 41MB/s/Node to
61MB/s/Node,
about 50% performance enhancement gained.
```
- Any additional information to help reviewers in testing this
change.
`I refactored the code to make it more readable. The core code
mainly lies in CarbonLoaderUtil`
- [x] For large changes, please consider breaking it into sub-tasks under
an umbrella JIRA.
`Not related`
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/xuchuanyin/carbondata
opt_size_base_block_allocation
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/carbondata/pull/1808.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1808
----
commit f9ee9eaa1d0289c958a0dcbc665a383ea190a812
Author: xuchuanyin <xuchuanyin@...>
Date: 2018-01-16T02:59:37Z
Add size base block allocation in data loading
Carbondata assign blocks to nodes at the beginning of data loading.
Previous block allocation strategy is block number based and it will
shuffer skewed data problem if the size of input files differs a lot.
We introduced a size based block allocation strategy to optimize data
loading performance in skewed data scenario.
----
---