[
https://issues.apache.org/jira/browse/CARBONDATA-726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jacky Li resolved CARBONDATA-726.
---------------------------------
Resolution: Fixed
Fix Version/s: 1.1.0-incubating
> Update with V3 format for better IO and processing optimization.
> ----------------------------------------------------------------
>
> Key: CARBONDATA-726
> URL: https://issues.apache.org/jira/browse/CARBONDATA-726
> Project: CarbonData
> Issue Type: Improvement
> Reporter: Ravindra Pesala
> Fix For: 1.1.0-incubating
>
> Time Spent: 10h 10m
> Remaining Estimate: 0h
>
> Problems in current format.
> 1. IO read is slower since it needs to go for multiple seeks on the file to
> read column blocklets. Current size of blocklet is 120000, so it needs to
> read multiple times from file to scan the data on that column. Alternatively
> we can increase the blocklet size but it suffers for filter queries as it
> gets big blocklet to filter.
> 2. Decompression is slower in current format, we are using inverted index for
> faster filter queries and using NumberCompressor to compress the inverted
> index in bit wise packing. It becomes slower so we should avoid number
> compressor. One alternative is to keep blocklet size with in 32000 so that
> inverted index can be written with short, but IO read suffers a lot.
> To overcome from above 2 issues we are introducing new format V3.
> Here each blocklet has multiple pages with size 32000, number of pages in
> blocklet is configurable. Since we keep the page with in short limit so no
> need compress the inverted index here.
> And maintain the max/min for each page to further prune the filter queries.
> Read the blocklet with pages at once and keep in offheap memory.
> During filter first check the max/min range and if it is valid then go for
> decompressing the page to filter further.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)