GitHub user ajantha-bhat opened a pull request:
https://github.com/apache/carbondata/pull/2653
[CARBONDATA-2874] Support SDK writer as thread safe api
Problem: Currently CarbonWriter.write() not a thread safe. if multiple
threads calls .write() for one writer.
Data count inconsistency is observed.
root casue: As all the threads are writing to same batch of blocking queue.
need to synchronize this. Else one thread data overwrite the other thread data.
Solution:
a) DataLoadExecutor is using only one iterator, take number of threads as
input and internally create that many iterator to loop over the data. This will
reduce the blocking time of queue as each iterator has its own queue.
b) InputProcessor step is taking only default 2 cores (2 thread) for data
load in SDK flow, can use the same number as number of threads created by user.
c) writer step is using only 2 cores (2 thread). can use the same number
as number of threads created by user.
Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:
- [ ] Any interfaces changed? NA. Added new interface
- [ ] Any backward compatibility impacted? NA
- [ ] Document update required? yes, udpated
- [ ] Testing done. Yes, updated the test case
- [ ] For large changes, please consider breaking it into sub-tasks under
an umbrella JIRA. NA
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ajantha-bhat/carbondata master_new
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/carbondata/pull/2653.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2653
----
commit ad991851c109a1231e5aea088001f1c80097d3b3
Author: ajantha-bhat <ajanthabhat@...>
Date: 2018-08-21T05:22:35Z
multi-thread by iterators
----
---