GitHub user ajantha-bhat opened a pull request:
https://github.com/apache/carbondata/pull/2664
[CARBONDATA-2895] Fix Query result mismatch with Batch-sort in save to disk
(sort temp files) scenario.
**probelm:** Query result mismatch with Batch-sort in save to disk (sort
temp files) scenario.
**scenario:**
a) Configure batchsort but give batch size more than
UnsafeMemoryManager.INSTANCE.getUsableMemory().
b) Load data that is greater than batch size. Observe that
unsafeMemoryManager save to disk happened as it cannot process one batch.
c) so load happens in 2 batch.
d) When query the results. There result data rows is more than expected
data rows.
**root cause:**
For each batch, createSortDataRows() will be called.
Files saved to disk during sorting of previous batch was considered for
this batch.
**solution:**
Files saved to disk during sorting of previous batch ,should not be
considered for this batch.
Hence use batchID as rangeID field of sorttempfiles.
So getFilesToMergeSort() will select files of only this batch.
Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:
- [ ] Any interfaces changed? NA
- [ ] Any backward compatibility impacted? NA
- [ ] Document update required? NA
- [ ] Testing done. done
- [ ] For large changes, please consider breaking it into sub-tasks under
an umbrella JIRA. NA
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ajantha-bhat/carbondata master_new
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/carbondata/pull/2664.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2664
----
commit bad70a37508a2bad650aae2b150eecef59449a30
Author: ajantha-bhat <ajanthabhat@...>
Date: 2018-08-27T15:25:03Z
[CARBONDATA-2895] Fix Query result mismatch with Batch-sort in save to disk
(sort temp files) scenario.
probelm: Query result mismatch with Batch-sort in save to disk (sort
temp files) scenario.
scenario:
a) Configure batchsort but give batch size more than
UnsafeMemoryManager.INSTANCE.getUsableMemory().
b) Load data that is greater than batch size. Observe that
unsafeMemoryManager save to disk happened as it cannot process one
batch.
c) so load happens in 2 batch.
d) When query the results. There result data rows is more than expected
data rows.
root cause:
For each batch, createSortDataRows() will be called.
Files saved to disk during sorting of previous batch was considered for
this batch.
solution:
Files saved to disk during sorting of previous batch ,should not be
considered for this batch.
Hence use batchID as rangeID field of sorttempfiles.
So getFilesToMergeSort() will select files of only this batch.
----
---