[ 
https://issues.apache.org/jira/browse/CARBONDATA-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuchuanyin resolved CARBONDATA-2023.
------------------------------------
    Resolution: Fixed

> Optimization in data loading for skewed data
> --------------------------------------------
>
>                 Key: CARBONDATA-2023
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-2023
>             Project: CarbonData
>          Issue Type: Improvement
>          Components: data-load
>    Affects Versions: 1.3.0
>            Reporter: xuchuanyin
>            Assignee: xuchuanyin
>            Priority: Major
>          Time Spent: 16h 40m
>  Remaining Estimate: 0h
>
> In one of my cases, carbondata has to load skewed data files. The size of 
> data file ranges from 1KB to about 5GB.
> In current implementation, carbondata will distribute the file blocks(splits) 
> among the nodes to maximum the data locality and data evenly distributed, we 
> call it `block-node-assignment` for short.
> However, the current implementation has some problems.
> The assignment is block number based. The goal is to make sure that all the 
> nodes deal the same amount number of blocks. In the skewed data scenario 
> described above, the block of a small file and the block of a big file are 
> very different from its size (1KB v.s. 64MB). As a result, the difference of 
> total data size assigned for each data node is very large.
> In order to solve this problem, the size of block should be considered during 
> the block-node-assignment. One node can deal more blocks than another as long 
> as the total size of blocks are almost the same.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to