[
https://issues.apache.org/jira/browse/CARBONDATA-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
xuchuanyin resolved CARBONDATA-2023.
------------------------------------
Resolution: Fixed
> Optimization in data loading for skewed data
> --------------------------------------------
>
> Key: CARBONDATA-2023
> URL: https://issues.apache.org/jira/browse/CARBONDATA-2023
> Project: CarbonData
> Issue Type: Improvement
> Components: data-load
> Affects Versions: 1.3.0
> Reporter: xuchuanyin
> Assignee: xuchuanyin
> Priority: Major
> Time Spent: 16h 40m
> Remaining Estimate: 0h
>
> In one of my cases, carbondata has to load skewed data files. The size of
> data file ranges from 1KB to about 5GB.
> In current implementation, carbondata will distribute the file blocks(splits)
> among the nodes to maximum the data locality and data evenly distributed, we
> call it `block-node-assignment` for short.
> However, the current implementation has some problems.
> The assignment is block number based. The goal is to make sure that all the
> nodes deal the same amount number of blocks. In the skewed data scenario
> described above, the block of a small file and the block of a big file are
> very different from its size (1KB v.s. 64MB). As a result, the difference of
> total data size assigned for each data node is very large.
> In order to solve this problem, the size of block should be considered during
> the block-node-assignment. One node can deal more blocks than another as long
> as the total size of blocks are almost the same.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)