GitHub user manishgupta88 opened a pull request:
https://github.com/apache/carbondata/pull/1715
[CARBONDATA-1934] Incorrect results are returned by select query in case
when the number of blocklets for one part file are > 1 in the same task
Problem: When a select query is triggered, driver will prune the segments
and give a list of blocklets that need to be scanned. The number of tasks from
spark will be equal to the number of blocklets identified.
In case where one task has more than one blocklet for same file, then
BlockExecution getting formed is incorrect. Due to this the query results are
incorrect.
Fix: Use the abstract index to fill all the details in BlockExecutionInfo
- [ ] Any interfaces changed?
No
- [ ] Any backward compatibility impacted?
No
- [ ] Document update required?
No
- [ ] Testing done
Manual testing
- [ ] For large changes, please consider breaking it into sub-tasks under
an umbrella JIRA.
NA
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/manishgupta88/carbondata data_loss_fix
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/carbondata/pull/1715.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1715
----
commit b0c518d4aa7d4b2387899deefc0f9ed39b5c463c
Author: manishgupta88 <tomanishgupta18@...>
Date: 2017-12-22T10:35:31Z
Incorrect results are returned by select query in case when the number of
blocklets for one part file are > 1 in the same task
----
---